Sodium

Na

247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Sodium00

Re: Black box methods like "asking the model if it has hidden goals."

I'm worried that these methods seem very powerful (e.g., Evans group's Tell me about yourself, the pre-fill black box methods in Auditing language models) because the text output of the model in those papers haven't undergone a lot of optimization pressure. 

Outputs from real world models might undergo lots of scrutiny/optimization pressure[1] so that the model appears to be a "friendly chatbot." AI companies put much more care into crafting those personas than model organisms researchers would, and thus the AI could learn to "say nice things" much better. 

So it's possible that model internals will much more faithful relative to model outputs in real world settings compared to in academic settings. 

  1. ^

    or maybe they'll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.

Sodium00

Thanks for putting this up! Just to double check—there aren't any restrictions against doing multiple AISC projects at the same time, right?