This post benefitted greatly from comments, suggestions, and ongoing discussions with David Duvenaud, David Krueger, and Jan Kulveit. All errors are my own. A few months ago, I and my coauthors published Gradual Disempowerment (GD hereafter). It was mostly about how things might go wrong, but naturally a lot of...
TLDR: * Though we tend to have LMs play the role of a helpful assistant, they can in fact generate the responses of any persona you ask for * It’s not just explicit — the persona you get is also steered by implicit cues, and confined by what it’s easy...
Full version on arXiv | X Executive summary AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI...
What is an agent? It’s a slippery concept with no commonly accepted formal definition, but informally the concept seems to be useful. One angle on it is Dennett’s Intentional Stance: we think of an entity as being an agent if we can more easily predict it by treating it as...
TLDR: Agents made out of conditioned predictive models are not utility maximisers, and, for instance, won't try to resist certain kinds of shutdown, despite being able to generally perform well. This is just a short cute example that I've explained in conversation enough times that now I'm hastily writing it...
tldr: consistent LLM failure suggests possible avenue for alignment and control epistemic status: somewhat hastily written, speculative in places, but lots of graphs of actual model probabilities Today in ‘surprisingly simple tasks that even the most powerful large language models can’t do’: writing out the alphabet but skipping over one...