Pushing back a little on this part of the appendix:
Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.
I'm a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).
But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.
Nice project and writeup. I particularly liked the walkthrough of thought processes throughout the project
Decision square's Euclidean distance to the top-right 5×5 corner, positive (+1.326).We are confused and don't fully understand which logical interactions produce this positive regression coefficient.
Decision square's Euclidean distance to the top-right 5×5 corner, positive (+1.326).
We are confused and don't fully understand which logical interactions produce this positive regression coefficient.
I'd be weary about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.
It might be worth making a cross-correlation plot of the features. This won't give you a new coefficients to put faith in, but it might help you decide how much to trust the ones you have. It can also be useful looking at how unstable the coefficients are during training (or e.g. when trained on a different dataset).