A pdf version of this report is available here. Summary In this report we argue that AI systems capable of large scale scientific research will likely pursue unwanted goals and this will lead to catastrophic outcomes. We argue this is the default outcome, even with significant countermeasures, given the current...
TLDR: We might want to use some sort of oversight techniques to avoid inner misalignment failures. Models will be too large and complicated to be understandable by a human, so we will use models to oversee models (or help humans oversee models). In many proposals this overseer model is an...
In this post I want to lay out some framings and thoughts about deception in misaligned AI systems. Types of Deception There seem to be two different things which people mean by deception which have different causes and likely different effects. Because these are both often called ‘deception’, they are...
This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Additional thanks to Michele Campolo, Oliver Zhang, Adam Shimi, and Leo Gao for their thoughts and feedback on this post. This work is additionally...