I wouldn’t bet on any current alignment proposal. Yet I think that the field is making progress and abounds with interesting opportunities to do even more, giving us a shot. Isn’t there a contradiction?
No, because research progress so rarely looks like having a clearly correct insight that clarifies everything; instead it often looks like building on apparently unpromising ideas, or studying the structure of the problem. Copernican heliocentrism didn’t initially predict observations as well as Ptolemaic astronomy; both ionic theory and the determination of basic molecular formula came from combining multiple approaches in chemistry, each getting some bits but not capturing the whole picture; Computer Science emerged from the arid debate over the foundations of mathematics; and Computational Complexity Theory has made more progress by looking at why some of its problems are hard than by waiting for clean solutions.
In the end you do want to solve the problem, obviously. But the road from here to there goes through many seemingly weird and insufficient ideas that are corrected, adapted, refined, often discarded except for a small bit. Alignment is no different, including “strong” alignment.
Research advances through productive mistakes, not perfect answers.
I’m taking this terminology from Goro Shimura’s characterization of his friend Yutaka Taniyama, with whom he formulated the Taniyama-Shimura Conjecture that Andrew Wiles proved in order to prove Fermat’s last theorem.
(Yutaka Taniyama and his time. Very personal recollections, Goro Shimura, 1989)
Though he was by no means a sloppy type, he was gifted with the special capability of making many mistakes, mostly in the right direction. I envied him for this, and tried in vain to imitate him, but found it quite difficult to make good mistakes.
So much of scientific progress takes the form of many people proposing different ideas that end up being partially right, where we can look back later and be like “damn, that was capturing a chunk of the solution.” It’s very rare that people arrive at the solution of any big scientific problem in one nice sweep of a clearly adequate idea. Even when it looks like it (Einstein is an example people like to bring up), they so often build on many of the weird and contradictory takes that came before, as well as the understanding of how the problem works at all (in Einstein’s case, this includes the many, many unconvincing attempts to unify mechanics and electromagnetism, the shape of Maxwell’s equations, the ether drag hypothesis, and Galileo’s relativity principle; he also made a lot of productive mistakes of his own).
Paul Graham actually says the same thing about startups that end up becoming great successes.
(What Microsoft Is This The Basic Altair Of?, Paul Graham, 2015)
One of the most valuable exercises you can try if you want to understand startups is to look at the most successful companies and explain why they were not as lame as they seemed when they first launched. Because they practically all seemed lame at first. Not just small, lame. Not just the first step up a big mountain. More like the first step into a swamp.
Graham proposes a change of polarity in considering lame ideas: instead of looking for flaws, he encourages us to steelman not the idea itself, but how it could lead to greatness.
(What Microsoft Is This The Basic Altair Of?, Paul Graham, 2015)
Most people's first impulse when they hear about a lame-sounding new startup idea is to make fun of it. Even a lot of people who should know better.
When I encounter a startup with a lame-sounding idea, I ask "What Microsoft is this the Altair Basic of?" Now it's a puzzle, and the burden is on me to solve it. Sometimes I can't think of an answer, especially when the idea is a made-up one. But it's remarkable how often there does turn out to be an answer. Often it's one the founders themselves hadn't seen yet.
That’s this mindset that makes me excited about on-going conceptual alignment research.
I look at ARC’s ELK, and I have disagreement about the constraints, and the way of stating the problem, and about each proposed solution; but I also see how much productive discussion ELK has generated by pushing people to either solve it or articulate why it’s impossible or why it falls short of capturing the key problems that we want to solve.
I look at Steve’s Brain-like AGI Alignment work, and I’m not convinced that we will build brain-like AGI before ML-based AGI or automated economies; but I also see that Steve has been pushing the thinking around value learning and its subtleties, and has found a robust way of transferring results and models from neuroscience to alignment.
I look at John’s Natural Abstraction work, and I’m still unsure whether the natural abstraction hypothesis is correct, and if it might at all lead to tractable extraction/analysis of the abstractions used in prediction; but I also see how it reframes the thinking and ideas around fragility of value, and provide ideas for forcing an ontological lock (if the natural abstraction hypothesis doesn’t hold by default).
I look at Evan’s training stories, and I’m unclear whether this is the right frame to argue for alignment guarantees, and if it has blindspots; but I also see how it clarifies misunderstandings around inner alignment, and provide the first step for a common language to discuss failure modes in prosaic alignment.
I look at Alex’s power-seeking theorems, and I wonder if it’s not missing a crucial component about how power is spent, and if the set of permutations considered fit with how goals are selected in real life; but I also realize that the formalization made these subtleties of instrumental convergence more salient, and provided some intuitions about ways of sampling goals that might reduce power-seeking incentives.
I look at Vanessa’s Infra-bayesianism work, and I worry that it’s not tackling the crucial part of inferring and capturing human values, as well as going for too much generality at the cost of shareability; but I also see that it looks particularly good for tackling question of realizability and breaking self-reference, while yielding powerful enough math that I expect progress on the agenda.
I look at Critch’s RAAPs work, and I don’t know if competitive pressure is a strong enough mechanism to cause that kind of problem, nor am I so sure that the agentic and structural effects can be disentangled; but I also appreciate the attempt to bring more structural-type thinking into alignment, and how this addresses a historical gap in how to think about AI risk and alignment strategies.
And so on for many other works on the AF.
It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…
None of these approaches looks good enough on its own, and I expect many to shift, get redirected, or even abandoned to iterate on a new version. I also expect to criticize their development and disagree with the researchers involved. Yet I still see benefits and insights they might deliver, and want more work to be put into them for that reason.
But isn’t all that avoiding the real problem of finding a solution to the alignment problem right now? No, because they each give us better tools and ideas and handles for tackling the problem, and all our current proposals don’t work.
That doesn’t look fast, you might answer. And I agree that fundamental science and solving new and complex problems have historically taken way too long for the short timelines we seem to be on. But that’s not a reason to refuse to do the necessary work, or despair; it’s a reason to find ways of accelerating science! For example, looking at what historically hampered progress, and remove it as much as possible. Or how hidden bits of evidence were revealed, and leverage that to explore the space of ideas and approaches faster.
Okay, but shouldn’t we focus all our efforts on finding smarter and smarter people to work on this problem instead of pushing for the small progress we’re able to make now? I think this misses the point: we don’t want smartness, we want the ability to reveal hidden bits of evidence. That’s certainly correlated with smartness, but with one big difference: there’s often diminishing returns to the bits of evidence you can get from one angle, and that leads to wanting a more diverse portfolio of researchers who are good at harnessing and revealing different streams of evidence. That’s one thing which the common “Which alignment researcher would you want to have 10 copies of?” misses: we want variety, because no one is that good at revealing bits from all relevant streams of evidence.
To go back to the Einstein example, he was clearly less of a math genius than most of his predecessors who attempted to unify mechanics and electromagnetism, like Poincaré. But that didn’t matter, because what Einstein had was a knack for revealing the hidden bits of evidence in what we already knew about physics and the shape of our laws of physics. And after he did that, many mathematicians and physicians with better math chops pushed his theory and ideas and revealed incredibly rich models and insights and predictions.
How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.
So if you’re worried about AI risk, and want to know if there’s anything that can be done, the answer is a resounding yes. There are so many ways of improving our understanding and thus our chances: participating in current research programs and agendas, coming up with new weird takes and approaches, exploring the mechanism, history, and philosophy of science to accelerate the process as much as we can…
I don’t know if we’ll make it in time. 5 to 15 years is a tight deadline indeed, and the strong alignment problem is incredibly complex and daunting. But I know this: if we solve the problem and get out of this alive, this will not be by waiting for an obviously convincing approach; it will come instead from making as many productive mistakes as we can, and learning from them as fast as we can.
I’m not discussing applied alignment research here, like the work of Redwood, but I also find this part crucial and productive. It’s just that such work is less about “formulating a solution” and more about “exploring the models and the problems experimentally”, which fit well with the model I’m drawing here.
I’m currently finishing a sequence arguing for more pluralism in alignment and providing an abstraction of the alignment problem that I find particularly good for generating new approaches and understanding how all the different takes and perspectives relate.
The range where many short timelines put the bulk of their probability mass.