Consider the following vignette: > It is March 2028. With their new CoCo-Q neuralese reasoning model, a frontier AI lab has managed to fully automate the process of software engineering. In AI R&D, most human engineers have lost their old jobs, and only a small number of researchers now coordinate...
Playing Schelling games is a key dangerous capability for schemers: it’s much harder to control AIs that are very capable at Schelling coordination. In this post, I discuss the threat model and describe concrete projects for empirical work. Suppose you wish to control models that adversarially try to cause a...
"I" refers to Olli. Thanks to Buck, Ryan Greenblatt, Fabien Roger and Zach Stein-Perlman for feedback. The word “schemer” is used to refer to multiple things, and in particular to the following two concepts: Schemer as a power-motivated instrumental training-gamer: following Carlsmith, a schemer is a model that optimizes for...
In this post, I’ll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I’ll...
Abstract: > We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might...
Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan, Sam Svenningsen and Bogdan-Ionut Cirstea for suggestions and discussions on the topic. Summary: Arguments for deceptive alignment rely, among others, on a claim of the form “if an AI has a good world...