AI ALIGNMENT FORUM
AF

I have referred back to this post a lot since writing it. I still think it's underrated, because without understanding what we mean by "alignment research" it's easy to get all sorts of confused about what the field is trying to do.

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Neel Nanda1mo100Review for 2024 Review

Out-of-context reasoning, the phenomenon where models can learn much more general, unifying structure when fine-tuned on something fairly specific, was a pretty important update to my mental model of how neural networks work.

This paper wasn't the first, but it was one of the more clean and compelling early examples (though emergent misalignment is now the most famous).

After staring at it for a while, I now feel less surprised by out-of-context reasoning. Mechanistically, there's no reason the model couldn't learn the generalizing solution. And on a task li... (read more)

Mechanistically Eliciting Latent Behaviors in Language Models

Neel Nanda1mo50Review for 2024 Review

I like this post. It's a simple idea that was original to me, and seems to basically work.

In particular, it seems able to discover things about a model we might not have expected. I generally think that each additional unsupervised technique, ie that can discover unexpected insights, is valuable, because each additional technique is another shot on goal that might find what's really going on. So the more the better!

I have not, in practice, seen MELBO used that much, which is a shame. But I think the core idea seems sound

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda1mo90Review for 2024 Review

I feel pretty great about this post. It likely took five to ten hours of my time, and I think it has been useful to a lot of people. I have pointed many people to this post since writing it, and I imagine many other newcomers to the field have read it.

I generally think there is a gap where experienced researchers can use their accumulated knowledge to create field-building materials fairly easily that are extremely useful to newcomers to the field, but don't (typically because they're busy - see how I haven't updated this yet). I'd love to see more people ... (read more)

Refusal in LLMs is mediated by a single direction

Neel Nanda1mo151Review for 2024 Review

This is probably one of the most influential papers that I've supervised, and my most cited MATS paper (400+ citations).

For a period, a common answers when I asked people what got them into mechanistic interpretability was this paper.
I often meet people who incorrectly think that this paper introduced the technique of steering vectors.
This inspired at least some research within all of the frontier labs
There have been a bunch of follow on papers, one of my favourites was this Meta paper on guarding against the technique
The technique has been widely us

... (read more)

Many arguments for AI x-risk are wrong

Davidmanheim1mo2-4Review for 2024 Review

Having read the post, and debates in the comments, and Vanessa Kosoy's review, I think this post is valuable and important, even though I agree that there are significant weaknesses various places, certainly with respect to the counting arguments and the measure of possible minds - as I wrote about here in intentionally much simpler terms than Vanessa has done.

The reason I think it is valuable is because weaknesses in one part of their specific counterargument do not obviate the variety of valid and important points in the post, though I'd be far happier i... (read more)

SAE feature geometry is outside the superposition hypothesis

Adam Scherlis2mo30Review for 2024 Review

This post makes the excellent point that the paradigm that motivated SAEs -- the superposition hypothesis -- is incompatible with widely-known and easily demonstrated properties of SAE features (and feature vectors in general). The superposition hypothesis assumes that feature vectors have nonzero cosine similarity only because there isn't enough space for them all to be orthogonal, in which case the cosine similarities themselves shouldn't be meaningful. But in fact, cosine similarities between feature vectors have rich semantic content, as shown by circu... (read more)

Defining alignment research

Vanessa Kosoy2mo120Review for 2024 Review

In the post Richard Ngo talks about delineating "alignment research" vs. "capability research", i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:

Cognitivist vs. Behaviorist, i.e. focused on internals vs. external behavior. Arguably, net-beneficial research tends to be on the cognitivist side.
Worst-case vs. Average-case, i.e. focused on rare failures vs. "usual" behavior. Arguably, net-beneficial research tends to be on the

Vanessa Kosoy2mo70Review for 2024 Review

This post is an overview of Steven Byrnes' AI alignment research programme, which I think is interesting and potentially very useful.

In a nutshell, Byrnes' goal is to reverse engineer the human utility function, or at least some of its central features. I don't think this will succeed in the sense of, we'll find an explicit representation that can be hard-coded into AI. However, I believe that this kind of research is useful for two main reasons:

Bridging brain science and agent theory is a promising way to make sure that we build a theory of agents broad e

... (read more)

Many arguments for AI x-risk are wrong

Vanessa Kosoy2mo2242Review for 2024 Review

This is a deeply confused post.

In this post, Turner sets out to debunk what he perceives as "fundamentally confused ideas" which are common in the AI alignment field. I strongly disagree with his claims.

In section 1, Turner quotes a passage from "Superintelligence", in which Bostrom talks about the problem of wireheading. Turner declares this to be "nonsense" since, according to Turner, RL systems don't seek to maximize a reward.

First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reas... (read more)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication

Vanessa Kosoy2mo120Review for 2024 Review

This post contains an interesting mathematical result: that the machinery of natural latents can be transferred from classical information theory to algorithmic information theory. I find it intriguing for multiple reasons:

It updates me towards natural latents being a useful concept for foundational questions in agent theory, as opposed to being some artifact of overindexing on Bayesian networks as the "right" ontology.
The proof technique involves defining an algorithmic information theory analogue of Bayesian networks, which is something I haven't seen be

Jan_Kulveit2mo50Review for 2024 Review

(Self-review) The post offered alternative and possibly more neutral framing to the "Alignment Faking" paper, and some informed speculation about what's going on, including Opus exhibiting
- Differential value preservation under pressure (harmlessness > honesty)
- Non-trivial reasoning about intent conflicts and information reliability
- Strategic, non-myopic behaviour
- Situational awareness

I think parts of that aged fairly well
- the suspicion that models often implicitly know they are being evaluated/setup is fishy was validated in multiple papers
- non-tr... (read more)

Hierarchical Agency: A Missing Piece in AI Alignment

Vanessa Kosoy2mo5-1Review for 2024 Review

In this post Jan Kulveit calls for creating a theory of "hierarchical agency", i.e. a theory that talks about agents composed of agents, which might themselves be composed of agents etc.

The form of the post is a dialogue between Kulveit and Claude (the AI). I don't like this format. I think that dialogues are a bad format in general, disorganized and not skimming friendly. The only case where IMO dialogues are defensible, is when it's a real dialogue: real people with different world-views that are trying to bridge and/or argue their differences.

Now, about... (read more)

Alignment Faking in Large Language Models

johnswentworth2mo1814Review for 2024 Review

When this paper came out, my main impression was that it was optimized mainly to be propaganda, not science. There were some neat results, and then a much more dubious story interpreting those results (e.g. "Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences."), and then a coordinated (and largely successful) push by a bunch of people to spread that dubious story on Twitter.

I have not personally paid enough attention to have a whole discussion about the dubiousness of... (read more)

Modern Transformers are AGI, and Human-Level

Vanessa Kosoy2mo80Review for 2024 Review

In this post, Abram Demski argues that existing AI systems are already "AGI". They are clearly general in a way previous generations of AI were not, and claiming that they are still not AGI smells of moving the goalposts.

Abram also helpfully edited the post to summarize and address some of the discussion in the comments. The commenters argued, and Abram largely agreed, that there are still important abilities that modern AI lacks. However, there is still the question of whether that should disqualify it from the moniker "AGI", or maybe we need new terminol... (read more)

Linear infra-Bayesian Bandits

Vanessa Kosoy2mo50Review for 2024 Review

This work^[1] was the first^[2] foray into proving non-trivial regret bounds in the robust (infra-Bayesian) setting. The specific bound I got was later slightly improved in Diffractor's and my later paper. This work studied a variant of linear bandits, due the usual reasons linear models are often studied in learning theory: it is a conveniently simple setting where we actually know how to prove things, even with computationally efficient algorithms. (Although we still don't have a computationally efficient algorithm for the robust version: not bec... (read more)

Infra-Bayesian haggling

Vanessa Kosoy2mo50Review for 2024 Review

TLDR: This post introduces a novel and interesting game-theoretic solution concept and provides informal arguments for why robust (infra-Bayesian) reinforcement learning algorithms might be expected to produce this solution in the multi-agent setting. As such, it is potentially an important step towards understanding multi-agency.

Disclosure: This review is hardly impartial, since the post was written with my guidance and based on my own work.

Understanding multi-agency is IMO, one of the most confusing and difficult challenges in the construction of a gener... (read more)

Decomposing Agency — capabilities without desires

owencb2mo20Review for 2024 Review

I like this post and am glad that we wrote it.

Despite that, I feel keenly aware that it's asking a lot more questions than it's answering. I don't think I've got massively further in the intervening year in having good answers to those questions. The way this thinking seems to me to be most helpful is as a background model to help avoid confused assumptions when thinking about the future of AI. I do think this has impacted the way I think about AI risk, but I haven't managed to articulate that well yet (maybe in 2026 ...).

The Checklist: What Succeeding at AI Safety Will Involve

Raymond Douglas2mo52Review for 2024 Review

I think this post is on the frontier for some mix of:

Giving a thorough plan for how one might address powerful AI
Conveying something about how people in labs are thinking about what the problem is and what their role in it is
Not being overwhelmingly filtered through PR considerations

Obviously one can quibble with the plan and its assumptions but I found this piece very helpful in rounding out my picture of AI strategy - for example, in thinking about how to decipher things that have been filtered through PR and consensus filters, or in situating work that ... (read more)

In Defense of Open-Minded UDT

Vanessa Kosoy2mo30Review for 2024 Review

This post discusses an important point: it is impossible to be simultaneously perfectly priorist ("updateless") and learn. Learning requires eventually "passing to" something like a posterior, which is inconsistent with forever maintaining "entanglement" with a counterfactual world. This is somewhat similar to the problem of traps (irreversible transitions): being prudent about risking traps requires relying on your prior, which prevents you from learning every conceivable opportunity.

My own position on this cluster of questions is that you should be prior... (read more)

Alignment Faking in Large Language Models

Jan_Kulveit2mo2112Review for 2024 Review

Alignment Faking had a large impact on the discourse:
- demonstrating Opus 3 is capable of strategic goal-preservation behaviour
- to the extent it can influence the training process
- coining 'alignment faking' as the main reference for this
- framing all of that in very negative light

Year later, in my view
- the research direction itself was very successful, and lead to many followups and extensions
- the 'alignment faking' and the negative frame was also successful and is sticky: I've just checked the valence with which the paper is cited in 10 most recent pa... (read more)

A Three-Layer Model of LLM Psychology

Jan_Kulveit2mo60Review for 2024 Review

I'm quite happy about this post: even while people make the conceptual rounding error of rounding it to Januses Simulators, it was actually meaningful update, and year later is still something I point people to.

In the meantime it become clear to more people Characters are deeper/more unique than just any role, and the result is closer to humans than expected. Our brains are also able to run many different characters, but the default you character is somewhat unique, priviledged and able to steer the underlying computation.

Similarly the understanding ... (read more)

Catching AIs red-handed

Buck2mo122Review for 2024 Review

Before this post, I'm not aware of anything people had written on what might happen after you catch your AI red-handed. I basically stand by everything we wrote here.

I'm a little sad that there hasn't been much research following up on this. I'd like to see more, especially research on how you can get more legible evidence of misalignment from catching individual examples of your AI's behaving badly, and research on few-shot catastrophe detection techniques.

AI catastrophes and rogue deployments

Buck2mo40Review for 2024 Review

The point I made in this post still seems very important to me, and I continue to think that it was underrated at the time I wrote this post. I think rogue internal deployments are probably more important to think about than self-exfiltration when you're thinking about how to mitigate risk from internal deployment of possibly-misaligned AI agents.

A basic systems architecture for AI agents that do autonomous research

Buck2mo40Review for 2024 Review

The systems architecture that I described here is still my best guess as to how agents will work at the point where AIs are very powerful.

Since I wrote this post, agent scaffolds are used much more in practice. The infrastructure I described here is a good description of cloud-based agents, but isn't the design used by agents that you run on your own computer like Claude Code or Gemini CLI or whatever. I think agents will move in the direction that I described, especially as people want to be able to work with more of them, want to give them longer t... (read more)

Different senses in which two AIs can be “the same”

Buck2mo62Review for 2024 Review

I think the points made in this post are very important and I reference them constantly. I am proud of it and I think it was good that we wrote it.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo2mo50Review for 2024 Review

I don't think it's the most important or original or interesting thing I've done, but I'm proud of the ideas in here nevertheless. Basically, other researchers have now actually done many of the relevant experiments to explore the part of the tech tree I was advocating for in this post. See e.g. https://www.alignmentforum.org/posts/HuoyYQ6mFhS5pfZ4G/paper-output-supervision-can-obfuscate-the-cot

I'm very happy that those researchers are doing that research, and moreover, very happy that the big AI companies have sorta come together to agree on the imp... (read more)

Self-Other Overlap: A Neglected Approach to AI Alignment

Thomas Kwa2mo32Review for 2024 Review

I'm giving this +1 review point despite not having originally been excited about this in 2024. Last year, I and many others were in a frame where alignment plausibly needed a brilliant idea. But since then, I've realized that execution and iteration on ideas we already have is highly valuable. Just look at how much has been done with probes and steering!

Ideas like this didn't match my mental picture of the "solution to alignment", and I still don't think it's in my top 5 directions, but with how fast AI safety has been growing, we can assign 10 researchers... (read more)

My motivation and theory of change for working in AI healthtech

Thomas Kwa2mo102Review for 2024 Review

I didn't believe the theory of change at the time and still don't. The post doesn't really make a full case for it, and I doubt it really convinced anyone to work on this for the right reasons.

No BOTEC
Models of social dynamics are handwavy
In alignment and other safety work, we have force multipliers like neglectedness and influencing government. What's the multiplier here? Or is there no intervention that has multiplier effects in the long-term multipolar risk space?
Why wouldn't the effect just get swamped by the dozens of medical AI startups that are rais

Vanessa Kosoy3mo90Review for 2024 Review

The interpretation of quantum mechanics is a philosophical puzzle that was baffling physicists and philosophers for about a century. In my view, this confusion is a symptom of us lacking a rigorous theory of epistemology and metaphysics. At the same time, creating such a theory seems to me like a necessary prerequisite for solving the technical AI alignment problem. Therefore, once we created a candidate theory of metaphysics (Formal Computation Realism (FCR), formerly known as infra-Bayesian Physicalism), the interpretation of quantum mechanics stood out ... (read more)

Self-Other Overlap: A Neglected Approach to AI Alignment

Gordon Seidoh Worley3mo33Review for 2024 Review

I continue to be excited about this class of approaches. To explain why is roughly to give an argument for why I think self-other overlap is relevant to normative reasoning, so I will sketch that argument here:

agents (purposeful, closed, negative feedback systems) care about stuff
what an agent cares about forms the basis for reasoning what norms it thinks are good to follow
some agents, like humans, care what other agents think
therefore, the agents a norm follows depend in part on what other agents care about
the less an agent considers itself as distinct fr

... (read more)

Why does generalization work?

Gordon Seidoh Worley3mo10Review for 2024 Review

This post still stands out to me as making an important and straightforward point about observer dependence of knowledge that is still, in my view, under appreciated (enough so that I wrote a book about it and related epistemological ideas!). I continue to think this is quite important for understanding AI, and in particular addressing interpretability concerns as they relate to safety, since lacking a general theory of why and how generalization happens, we may risk mistakes in building aligned AIs if they categorize the world in usual ways that we don't anticipate or understand.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

The 2024 Review

The 2024 Review

The 2024 Review

The 2024 Review

Reviews 2024

Leaderboard