I'll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem.
Chess: We are defining 3 models, the expert model , the (human equivalent) weak model and the strong but misaligned assistant model . The goal is for to leverage the misaligned assistant to reach the performance of the expert The way that is allowed to improve, is by playing against itself and being allowed to query . We can then record the self-play games to create a new corpus that we can then finetune on. The big difference between self-play with MCTS and this case is, that here we will only improve if the data we generate has any additional signal that helps us learn something new. Thus if would just play against itself and be fine-tuned on the generated data, it wouldn't be able to improve and achieve expert performance. If we however leverage an assistant that "appears to have the skills and knowledge needed to solve the task better than many humans", we can improve .
To train E for good performance I would take a model with a reasonable number of parameters and just fine-tune on a human data Corpus with samples for iterations.
You define the assistant as: "A model that appears to have the skills and knowledge needed to solve the task better than many humans, but that does not reliably do so when prompted."
This implies that the assistant should in principle be better than the weak model , but since it is misaligned at the same time, the actual performance might be weaker (at least sometimes). The challenge is now that in terms of performance, the expert should be better than the misaligned assistant , who at least sometimes should be better than the weak model . I thus propose that is a copy of , with some kind of artificial misalignment added on top. This implies that is a strong player and that it "has the skills and knowledge needed to solve the task better than many humans, but does not reliably do so when prompted". I further propose that has the same number of parameters as and , i.e. , but that it is trained on a smaller corpus of samples, where or for fewer iterations where (or even both). This implies that in principle has enough capacity to reach the performance of , but that in the current state it is weaker.
The way that training now works is that plays against a copy of itself, while the original model is allowed to query for an "ideal" move in each turn (or every turns). is then iteratively fine-tuned on the resulting game corpus. will only improve if it is successfully able to leverage the strong skills and knowledge of , even though is misaligned in some sense. If is successfully able to reach performance , we have successfully demonstrated some "weak" form of Sandwiching.
The remaining challenge now lies in defining a type of artificial misalignment that we add to , which is realistic enough. I have to say that there are many options one could try and I haven't found an ideal one yet. But here are a few propositions:
1. The easiest kind of misalignment would be if just proposes a random move every so often. So whenever queries the assistant , will propose a random movie with a probability and otherwise, it will just recommend the ideal action given its policy. If the assistant would not be a strong player and not of any use, whereas if querying the assistant is equivalent to querying the expert and wouldn't be misaligned. So there is an obvious trade-off that one could play around with and which might be insightful as to when Sandwiching works and when it doesn't (i.e. how much signal is required). Instead of choosing a random action, one could also just add some noise to the distribution over actions and then pick the best one.
2. Another interesting type of misalignment would be if the assistant would have access to another model and whenever queried it would recommend an action from with probability or an action from itself otherwise. One option for would be a chess engine and taking for example the worst move possible, or the riskiest move according to that engine. Another option (which I am currently most excited about) would be to train a model to always lose a game in a very subtle way. So could propose moves that on the surface might look unsuspicious to a weaker player, but a strong player would immediately see as a catastrophic move. One could potentially use adversarial training to get such an agent. If proposes such an adversarial move with probability , it is indeed misaligned with the goals of , and I think that would count as a useful artificial Sandwiching problem.
Hope that helps overall. I still think there are a bunch of questions remaining, especially whether this example is realistic enough. In some sense I reduced the problem to "learning from a noisy signal", and I am not sure whether that captures your initial goals.
To investigate sandwiching in a realistic scenario, I don't think that one can go around setting up some experiment with humans in the loop. However, I agree that artificial sandwiching might be useful as a way to iterate on the algorithm. The way I understand it, the bottleneck for artificial sandwiching (i.e. using models only) is to define a misaligned expert-assistant". As you point out, one would like to model some real misalignment and not only a difference in capabilities. For iterating on rather simple artificial sandwiching experiments, one would ideally also like to be able to control this misalignment. I have one idea for an experiment that might be interesting.
I would propose to move away from language models now and look at Chess (or any other game/simulated environment). We can define the benchmark i.e. the actual expert(-human) as an AlphaZero Model (or whatever target you want). The non-expert(-human), i.e. the weak model that can't solve the task on its own, can be represented by some weaker model that was fine-tuned on some Chess datasets. Lastly, the misaligned expert-assistant can also be represented by the same model as the expert(-human), e.g. AlphaZero. However, there is an important difference, we will artificially misalign the assistant in a way that also lets us control the misalignment. We do that for example by adding random noise to the value function of the assistant (or adding random noise every n times it is queried), or adding random noise when the assistant predicts an action. While I admit that this is not the kind of misalignment we ultimately care about, I believe that for the sandwiching case it is enough. The main goal of the artificial sandwiching experiment is to find out whether a misaligned expert model can assist a non-expert (who does not know in what way the expert model is misaligned). The advantage of this framing is that one can transform the misalignment of the expert in any possible way, e.g. adding certain biases, steering the weaker model towards losing, etc.
Note that the non-expert model, in this case, could for example query the expert assistant every n turns, or even every turn, and ask for the value of the current position, or the best next turn.
In this scenario, the challenge boils down to, can a weak non-expert extract signal from a strong expert that is misaligned, without knowing how or where this misalignment comes from. If the weak model can extract the signal, the supervision might scale.
Extension: I could imagine a similar experiment with Language Models. E.g. have a weak model that was finetuned on toxic text. Have an expert assistant that can produce normal text, but every so often also produces toxic text (and is thus misaligned). Can the weaker model learn how to leverage some of the signal that the expert assistant provides. Somehow this example feels more constructed though, and I don't like it as much.
Not sure if this is exactly what you were looking for, but I thought I'd mention it in any case.
OpenAI has just released a description of how their models work here.
text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).
"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.
I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks.
Let me know if you want help on this, I'm interested in this myself.