Re: Black box methods like "asking the model if it has hidden goals."
I'm worried that these methods seem very powerful (e.g., Evans group's Tell me about yourself, the pre-fill black box methods in Auditing language models) because the text output of the model in those papers haven't undergone a lot of optimization pressure.
Outputs from real world models might undergo lots of scrutiny/optimization pressure[1] so that the model appears to be a "friendly chatbot." AI companies put much more care into crafting those personas than model organisms researchers would, and thus the AI could learn to "say nice things" much better.
So it's possible that model internals will much more faithful relative to model outputs in real world settings compared to in academic settings.
Re: Black box methods like "asking the model if it has hidden goals."
I'm worried that these methods seem very powerful (e.g., Evans group's Tell me about yourself, the pre-fill black box methods in Auditing language models) because the text output of the model in those papers haven't undergone a lot of optimization pressure.
Outputs from real world models might undergo lots of scrutiny/optimization pressure[1] so that the model appears to be a "friendly chatbot." AI companies put much more care into crafting those personas than model organisms researchers would, and thus the AI could learn to "say nice things" much better.
So it's possible that model internals will much more faithful relative to model outputs in real world settings compared to in academic settings.
or maybe they'll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.