AI ALIGNMENT FORUM
AF

Tom Davidson
Ω6000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
How will we update about scheming?
Tom Davidson8mo*75

Thanks for this!

Compared to you, I more expect evidence of scheming if it exists. 

You argue weak schemers might just play nice. But if so, we can use them to do loads of intellectual labour to make fancy behavioral red teaming and interp to catch out the next gen of AI. 

More generally, the plan of bootstrapping to increasingly complex behavioral tests and control schemes seems likely to work. It seems like if one model has spent a lot of thinking time designing a scheme then another model would have to be much smarter to zero shot cause a catastrophe without the scheme detecting it. Eg. analogies with humans suggest this.

Reply11
No posts to display.