Executive summary Our mission at Apollo Research is to reduce catastrophic risks from AI by auditing advanced AI systems for misalignment and dangerous capabilities, with an initial focus on deceptive alignment. In our announcement post, we presented a brief theory of change of our organization which explains why we expect...
Crossposted from my personal blog. Everybody knows about the hedonic treadmill. Your hedonic state adjusts to your circumstances over time and quickly reverts to a mostly stable baseline. This is true of basic physiological needs -- you feel hungry; you seek out food; you eat; you feel sated, and you...
Some quotes: > Our approach > Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence. > > To align the first automated alignment researcher, we will need to 1) develop a scalable...
TL;DR 1. We are a new AI evals research organization called Apollo Research based in London. 2. We think that strategic AI deception – where a model outwardly seems aligned but is in fact misaligned – is a crucial step in many major catastrophic AI risk scenarios and that detecting...
This is a linkpost to a set of slides containing an update to a project that was the subject of a previous post ([Interim research report] Taking features out of superposition with sparse autoencoders). The update is very small and scrappy. We haven't had much time to devote to this...
This is crossposted from my personal blog. Epistemic Status: Much of this draws from my studies in neuroscience and ML. Many of the ideas in this post are heavily inspired by the work of Steven Byrnes and the authors of Shard Theory. However, it speculates quite a long way in...
We thank Eric Winsor, Lee Sharkey, Dan Braun, Carlos Ramon Guevara, and Misha Wagner for helpful suggestions and comments on this post. This post builds upon our last post on basic facts about language model internals and was written as part of the work done at Conjecture. We will shortly...