Out-of-context reasoning, the phenomenon where models can learn much more general, unifying structure when fine-tuned on something fairly specific, was a pretty important update to my mental model of how neural networks work.
This paper wasn't the first, but it was one of the more clean and compelling early examples (though emergent misalignment is now the most famous).
After staring at it for a while, I now feel less surprised by out-of-context reasoning. Mechanistically, there's no reason the model couldn't learn the generalizing solution. And on a task li...
I like this post. It's a simple idea that was original to me, and seems to basically work.
In particular, it seems able to discover things about a model we might not have expected. I generally think that each additional unsupervised technique, ie that can discover unexpected insights, is valuable, because each additional technique is another shot on goal that might find what's really going on. So the more the better!
I have not, in practice, seen MELBO used that much, which is a shame. But I think the core idea seems sound
I feel pretty great about this post. It likely took five to ten hours of my time, and I think it has been useful to a lot of people. I have pointed many people to this post since writing it, and I imagine many other newcomers to the field have read it.
I generally think there is a gap where experienced researchers can use their accumulated knowledge to create field-building materials fairly easily that are extremely useful to newcomers to the field, but don't (typically because they're busy - see how I haven't updated this yet). I'd love to see more people ...
This is probably one of the most influential papers that I've supervised, and my most cited MATS paper (400+ citations).
Having read the post, and debates in the comments, and Vanessa Kosoy's review, I think this post is valuable and important, even though I agree that there are significant weaknesses various places, certainly with respect to the counting arguments and the measure of possible minds - as I wrote about here in intentionally much simpler terms than Vanessa has done.
The reason I think it is valuable is because weaknesses in one part of their specific counterargument do not obviate the variety of valid and important points in the post, though I'd be far happier i...
This post makes the excellent point that the paradigm that motivated SAEs -- the superposition hypothesis -- is incompatible with widely-known and easily demonstrated properties of SAE features (and feature vectors in general). The superposition hypothesis assumes that feature vectors have nonzero cosine similarity only because there isn't enough space for them all to be orthogonal, in which case the cosine similarities themselves shouldn't be meaningful. But in fact, cosine similarities between feature vectors have rich semantic content, as shown by circu...
In the post Richard Ngo talks about delineating "alignment research" vs. "capability research", i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
This post is an overview of Steven Byrnes' AI alignment research programme, which I think is interesting and potentially very useful.
In a nutshell, Byrnes' goal is to reverse engineer the human utility function, or at least some of its central features. I don't think this will succeed in the sense of, we'll find an explicit representation that can be hard-coded into AI. However, I believe that this kind of research is useful for two main reasons:
This is a deeply confused post.
In this post, Turner sets out to debunk what he perceives as "fundamentally confused ideas" which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from "Superintelligence", in which Bostrom talks about the problem of wireheading. Turner declares this to be "nonsense" since, according to Turner, RL systems don't seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reas...
This post contains an interesting mathematical result: that the machinery of natural latents can be transferred from classical information theory to algorithmic information theory. I find it intriguing for multiple reasons:
(Self-review) The post offered alternative and possibly more neutral framing to the "Alignment Faking" paper, and some informed speculation about what's going on, including Opus exhibiting
- Differential value preservation under pressure (harmlessness > honesty)
- Non-trivial reasoning about intent conflicts and information reliability
- Strategic, non-myopic behaviour
- Situational awareness
I think parts of that aged fairly well
- the suspicion that models often implicitly know they are being evaluated/setup is fishy was validated in multiple papers
- non-tr...
In this post Jan Kulveit calls for creating a theory of "hierarchical agency", i.e. a theory that talks about agents composed of agents, which might themselves be composed of agents etc.
The form of the post is a dialogue between Kulveit and Claude (the AI). I don't like this format. I think that dialogues are a bad format in general, disorganized and not skimming friendly. The only case where IMO dialogues are defensible, is when it's a real dialogue: real people with different world-views that are trying to bridge and/or argue their differences.
Now, about...
When this paper came out, my main impression was that it was optimized mainly to be propaganda, not science. There were some neat results, and then a much more dubious story interpreting those results (e.g. "Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences."), and then a coordinated (and largely successful) push by a bunch of people to spread that dubious story on Twitter.
I have not personally paid enough attention to have a whole discussion about the dubiousness of...
In this post, Abram Demski argues that existing AI systems are already "AGI". They are clearly general in a way previous generations of AI were not, and claiming that they are still not AGI smells of moving the goalposts.
Abram also helpfully edited the post to summarize and address some of the discussion in the comments. The commenters argued, and Abram largely agreed, that there are still important abilities that modern AI lacks. However, there is still the question of whether that should disqualify it from the moniker "AGI", or maybe we need new terminol...
This work[1] was the first[2] foray into proving non-trivial regret bounds in the robust (infra-Bayesian) setting. The specific bound I got was later slightly improved in Diffractor's and my later paper. This work studied a variant of linear bandits, due the usual reasons linear models are often studied in learning theory: it is a conveniently simple setting where we actually know how to prove things, even with computationally efficient algorithms. (Although we still don't have a computationally efficient algorithm for the robust version: not bec...
TLDR: This post introduces a novel and interesting game-theoretic solution concept and provides informal arguments for why robust (infra-Bayesian) reinforcement learning algorithms might be expected to produce this solution in the multi-agent setting. As such, it is potentially an important step towards understanding multi-agency.
Disclosure: This review is hardly impartial, since the post was written with my guidance and based on my own work.
Understanding multi-agency is IMO, one of the most confusing and difficult challenges in the construction of a gener...
I like this post and am glad that we wrote it.
Despite that, I feel keenly aware that it's asking a lot more questions than it's answering. I don't think I've got massively further in the intervening year in having good answers to those questions. The way this thinking seems to me to be most helpful is as a background model to help avoid confused assumptions when thinking about the future of AI. I do think this has impacted the way I think about AI risk, but I haven't managed to articulate that well yet (maybe in 2026 ...).
I think this post is on the frontier for some mix of:
Obviously one can quibble with the plan and its assumptions but I found this piece very helpful in rounding out my picture of AI strategy - for example, in thinking about how to decipher things that have been filtered through PR and consensus filters, or in situating work that ...
This post discusses an important point: it is impossible to be simultaneously perfectly priorist ("updateless") and learn. Learning requires eventually "passing to" something like a posterior, which is inconsistent with forever maintaining "entanglement" with a counterfactual world. This is somewhat similar to the problem of traps (irreversible transitions): being prudent about risking traps requires relying on your prior, which prevents you from learning every conceivable opportunity.
My own position on this cluster of questions is that you should be prior...
Alignment Faking had a large impact on the discourse:
- demonstrating Opus 3 is capable of strategic goal-preservation behaviour
- to the extent it can influence the training process
- coining 'alignment faking' as the main reference for this
- framing all of that in very negative light
Year later, in my view
- the research direction itself was very successful, and lead to many followups and extensions
- the 'alignment faking' and the negative frame was also successful and is sticky: I've just checked the valence with which the paper is cited in 10 most recent pa...
I'm quite happy about this post: even while people make the conceptual rounding error of rounding it to Januses Simulators, it was actually meaningful update, and year later is still something I point people to.
In the meantime it become clear to more people Characters are deeper/more unique than just any role, and the result is closer to humans than expected. Our brains are also able to run many different characters, but the default you character is somewhat unique, priviledged and able to steer the underlying computation.
Similarly the understanding ...
Before this post, I'm not aware of anything people had written on what might happen after you catch your AI red-handed. I basically stand by everything we wrote here.
I'm a little sad that there hasn't been much research following up on this. I'd like to see more, especially research on how you can get more legible evidence of misalignment from catching individual examples of your AI's behaving badly, and research on few-shot catastrophe detection techniques.
The point I made in this post still seems very important to me, and I continue to think that it was underrated at the time I wrote this post. I think rogue internal deployments are probably more important to think about than self-exfiltration when you're thinking about how to mitigate risk from internal deployment of possibly-misaligned AI agents.
The systems architecture that I described here is still my best guess as to how agents will work at the point where AIs are very powerful.
Since I wrote this post, agent scaffolds are used much more in practice. The infrastructure I described here is a good description of cloud-based agents, but isn't the design used by agents that you run on your own computer like Claude Code or Gemini CLI or whatever. I think agents will move in the direction that I described, especially as people want to be able to work with more of them, want to give them longer t...
I think the points made in this post are very important and I reference them constantly. I am proud of it and I think it was good that we wrote it.
I don't think it's the most important or original or interesting thing I've done, but I'm proud of the ideas in here nevertheless. Basically, other researchers have now actually done many of the relevant experiments to explore the part of the tech tree I was advocating for in this post. See e.g. https://www.alignmentforum.org/posts/HuoyYQ6mFhS5pfZ4G/paper-output-supervision-can-obfuscate-the-cot
I'm very happy that those researchers are doing that research, and moreover, very happy that the big AI companies have sorta come together to agree on the imp...
I'm giving this +1 review point despite not having originally been excited about this in 2024. Last year, I and many others were in a frame where alignment plausibly needed a brilliant idea. But since then, I've realized that execution and iteration on ideas we already have is highly valuable. Just look at how much has been done with probes and steering!
Ideas like this didn't match my mental picture of the "solution to alignment", and I still don't think it's in my top 5 directions, but with how fast AI safety has been growing, we can assign 10 researchers...
I didn't believe the theory of change at the time and still don't. The post doesn't really make a full case for it, and I doubt it really convinced anyone to work on this for the right reasons.
The interpretation of quantum mechanics is a philosophical puzzle that was baffling physicists and philosophers for about a century. In my view, this confusion is a symptom of us lacking a rigorous theory of epistemology and metaphysics. At the same time, creating such a theory seems to me like a necessary prerequisite for solving the technical AI alignment problem. Therefore, once we created a candidate theory of metaphysics (Formal Computation Realism (FCR), formerly known as infra-Bayesian Physicalism), the interpretation of quantum mechanics stood out ...
I continue to be excited about this class of approaches. To explain why is roughly to give an argument for why I think self-other overlap is relevant to normative reasoning, so I will sketch that argument here:
This post still stands out to me as making an important and straightforward point about observer dependence of knowledge that is still, in my view, under appreciated (enough so that I wrote a book about it and related epistemological ideas!). I continue to think this is quite important for understanding AI, and in particular addressing interpretability concerns as they relate to safety, since lacking a general theory of why and how generalization happens, we may risk mistakes in building aligned AIs if they categorize the world in usual ways that we don't anticipate or understand.
I have referred back to this post a lot since writing it. I still think it's underrated, because without understanding what we mean by "alignment research" it's easy to get all sorts of confused about what the field is trying to do.