Full version on arXiv | X Executive summary AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI...
We thank Madeline Brumley, Joe Kwon, David Chanin and Itamar Pres for their helpful feedback. Introduction Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. Most steering methods add a 'steering vector' (SV) to...
AI systems up to some high level of intelligence plausibly need to know exactly where they are in space-time in order for deception/"scheming" to make sense as a strategy. This is because they need to know: 1) what sort of oversight they are subject to and 2) what effects their...
We’ve recently released a comprehensive research agenda on LLM safety and alignment. This is a collaborative work with contributions from more than 35+ authors across the fields of AI Safety, machine learning, and NLP. Major credit goes to first author Usman Anwar, a 2nd year PhD student of mine who...
This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community:[1] Me: you guys should write up your work properly and try to publish it in ML venues. Them: well that seems like a lot of...
I believe Anthropic has said they won't publish capabilities research? OpenAI seems to be sort of doing the same (although no policy AFAIK). I heard FHI was developing one way back when... I think MIRI sort of does as well (default to not publishing, IIRC?)
I think the terms "AI Alignment" and "AI existential safety" are often used interchangeably, leading the ideas to be conflated. In practice, I think "AI Alignment" is mostly used in one of the following three ways, and should be used exclusively for Intent Alignment (with some vagueness about whose intent,...