Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
My understanding is that ARC Theory has thought some about this topic and has proposals for avoiding manipulation. I think their only public output on this is in an appendix of the ELK report 'Avoiding subtle manipulation'. I recommend reading '“Narrow” elicitation and why it might be sufficient' and 'Indirect normativity: defining a utility function' first for context.
I think @Lukas Finnveden and maybe @Eric Neyman thought about this more after this point. (My recollection is that at least @Lukas Finnveden updated toward thinking the problem was harder an...
Summary: AGI isn't super likely to come super soon. People should be working on stuff that saves humanity in worlds where AGI comes in 20 or 50 years, in addition to stuff that saves humanity in worlds where AGI comes in the next 10 years.
Thanks to Alexander Gietelink Oldenziel, Abram Demski, Daniel Kokotajlo, Cleo Nardo, Alex Zhu, and Sam Eisenstat for related conversations.
By "AGI" I mean the thing that has very large effects on the world (e.g., it kills everyone) via the same sort of route that humanity has large effects on the world. The route is where you figure out how to figure stuff out, and you figure a lot of stuff out using your figure-outers, and then the stuff you...
An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept.
@Jude Stiel nudged me to (very much in my own words) update a bit to anticipate that it's plausible we'll see some degree of impoverished / partial originary (and therefore occasionally novel) concept formation. Some aspects of [real according to me] concept formation could be accessible to faster feedback. (This doesn't much change my overall picture, and my view would still be surprised by large numbers of concepts produced by AIs and as interesting+useful to humans as human-produced concepts.)
Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.
Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.
All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.
This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns...
Since it's pretty common for people to find this content confusing, I tried to clarify its basic mechanics and purpose here.
I will be discussing weak-to-strong generalization with Sahil on Monday, November 3rd, 2025, 11am Pacific Daylight Time. You can join the discussion with this link.
Weak-to-strong generalization is an approach to alignment (and capabilities) which seeks to address the scarcity of human feedback by using a weak model to teach a strong model. This is similar to Paul Christiano's iterated distillation and amplification (IDA), but without the "amplification" step: the strong model is trained directly on labels generated by the weak model, not some "amplified" version of the weak model. I think of this as "reverse distillation".[1]
Why would this work at all? From a naive Bayesian perspective, it is tempting to imagine the "strong model" containing the "weak model" within its larger hypothesis space. Given enough data, the...
I think my example was not well-chosen here, because this is confusing. Part of what I'm modeling here is sampling from the weak teacher's model, in order to generate training data for the student. The coin-flips are independent in the same way that distinct samples from an LLM are independent.
Imagine that 10% of the weak teacher's samples are misaligned, while 90% are aligned. The teacher has this "aligned" hypothesis and "misaligned" hypothesis in its latent space, so we get some of one, some of the other. Intuitively, a "strong" student should learn to ...
This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity.
The main focus of this post is clarifying the basic machinery of the behavioral selection model, and conveying why it matters to disambiguate between different “motivations” for AI behavior. Very similar or identical behavior in training can correspond to radically different outcomes in deployment based on what motivated it.
I’ll preface by saying: I think the behavioral selection model is quite predictive and useful to understand, especially in the short-medium term. But it leaves out some really important dynamics for predicting AI motivations, and I wish I had clarified this more in the original post. Most importantly (as Habryka...
What if we don't need to solve AI alignment? What if AI systems will just naturally learn human values as they get more capable? John Wentworth explores this possibility, giving it about a 10% chance of working. The key idea is that human values may be a "natural abstraction" that powerful AI systems learn by default.
Here are some of my top candidates for big pushes to do right now on technical AI safety (low effort notes):
- Much better model organisms / misalignment analogies:
- Doing a wider set of pessimized training runs
- Good candidate for lots of AI labor automation? Like maybe good to try to set up pipelines for building these envs.
- Demonstrating risks from fitness-seekers/reward-seekers empirically
- Even on current models with better tests, see here
- Demonstrating various types of memetic spread of misalignment?
- Actually do control
- Build pipelines for red-
... (read more)