I wouldn’t bet on any current alignment proposal. Yet I think that the field is making progress and abounds with interesting opportunities to do even more, giving us a shot. Isn’t there a contradiction?

No, because research progress so rarely looks like having a clearly correct insight that clarifies everything; instead it often looks like building on apparently unpromising ideas, or studying the structure of the problem. Copernican heliocentrism didn’t initially predict observations as well as Ptolemaic astronomy; both ionic theory and the determination of basic molecular formula came from combining multiple approaches in chemistry, each getting some bits but not capturing the whole picture; Computer Science emerged from the arid debate over the foundations of mathematics; and Computational Complexity Theory has made more progress by looking at why some of its problems are hard than by waiting for clean solutions.

In the end you do want to solve the problem, obviously. But the road from here to there goes through many seemingly weird and insufficient ideas that are corrected, adapted, refined, often discarded except for a small bit. Alignment is no different, including “strong” alignment.

Research advances through productive mistakes, not perfect answers.

I’m taking this terminology from Goro Shimura’s characterization of his friend Yutaka Taniyama, with whom he formulated the Taniyama-Shimura Conjecture that Andrew Wiles proved in order to prove Fermat’s last theorem.

(Yutaka Taniyama and his time. Very personal recollections, Goro Shimura, 1989)

Though he was by no means a sloppy type, he was gifted with the special capability of making many mistakes, mostly in the right direction. I envied him for this, and tried in vain to imitate him, but found it quite difficult to make good mistakes.

So much of scientific progress takes the form of many people proposing different ideas that end up being partially right, where we can look back later and be like “damn, that was capturing a chunk of the solution.” It’s very rare that people arrive at the solution of any big scientific problem in one nice sweep of a clearly adequate idea. Even when it looks like it (Einstein is an example people like to bring up), they so often build on many of the weird and contradictory takes that came before, as well as the understanding of how the problem works at all (in Einstein’s case, this includes the many, many unconvincing attempts to unify mechanics and electromagnetism, the shape of Maxwell’s equations, the ether drag hypothesis, and Galileo’s relativity principle; he also made a lot of productive mistakes of his own).

Paul Graham actually says the same thing about startups that end up becoming great successes.

(What Microsoft Is This The Basic Altair Of?, Paul Graham, 2015)

One of the most valuable exercises you can try if you want to understand startups is to look at the most successful companies and explain why they were not as lame as they seemed when they first launched. Because they practically all seemed lame at first. Not just small, lame. Not just the first step up a big mountain. More like the first step into a swamp.

Graham proposes a change of polarity in considering lame ideas: instead of looking for flaws, he encourages us to steelman not the idea itself, but how it could lead to greatness.

(What Microsoft Is This The Basic Altair Of?, Paul Graham, 2015)

Most people's first impulse when they hear about a lame-sounding new startup idea is to make fun of it. Even a lot of people who should know better.

When I encounter a startup with a lame-sounding idea, I ask "What Microsoft is this the Altair Basic of?" Now it's a puzzle, and the burden is on me to solve it. Sometimes I can't think of an answer, especially when the idea is a made-up one. But it's remarkable how often there does turn out to be an answer. Often it's one the founders themselves hadn't seen yet.

That’s this mindset that makes me excited about on-going conceptual alignment research.

I look at ARC’s ELK, and I have disagreement about the constraints, and the way of stating the problem, and about each proposed solution; but I also see how much productive discussion ELK has generated by pushing people to either solve it or articulate why it’s impossible or why it falls short of capturing the key problems that we want to solve.

I look at Steve’s Brain-like AGI Alignment work, and I’m not convinced that we will build brain-like AGI before ML-based AGI or automated economies; but I also see that Steve has been pushing the thinking around value learning and its subtleties, and has found a robust way of transferring results and models from neuroscience to alignment.

I look at John’s Natural Abstraction work, and I’m still unsure whether the natural abstraction hypothesis is correct, and if it might at all lead to tractable extraction/analysis of the abstractions used in prediction; but I also see how it reframes the thinking and ideas around fragility of value, and provide ideas for forcing an ontological lock (if the natural abstraction hypothesis doesn’t hold by default).

I look at Evan’s training stories, and I’m unclear whether this is the right frame to argue for alignment guarantees, and if it has blindspots; but I also see how it clarifies misunderstandings around inner alignment, and provide the first step for a common language to discuss failure modes in prosaic alignment.

I look at Alex’s power-seeking theorems, and I wonder if it’s not missing a crucial component about how power is spent, and if the set of permutations considered fit with how goals are selected in real life; but I also realize that the formalization made these subtleties of instrumental convergence more salient, and provided some intuitions about ways of sampling goals that might reduce power-seeking incentives.

I look at Vanessa’s Infra-bayesianism work, and I worry that it’s not tackling the crucial part of inferring and capturing human values, as well as going for too much generality at the cost of shareability; but I also see that it looks particularly good for tackling question of realizability and breaking self-reference, while yielding powerful enough math that I expect progress on the agenda.

I look at Critch’s RAAPs work, and I don’t know if competitive pressure is a strong enough mechanism to cause that kind of problem, nor am I so sure that the agentic and structural effects can be disentangled; but I also appreciate the attempt to bring more structural-type thinking into alignment, and how this addresses a historical gap in how to think about AI risk and alignment strategies.

And so on for many other works on the AF.[1]

It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…

None of these approaches looks good enough on its own, and I expect many to shift, get redirected, or even abandoned to iterate on a new version. I also expect to criticize their development and disagree with the researchers involved. Yet I still see benefits and insights they might deliver, and want more work to be put into them for that reason.

But isn’t all that avoiding the real problem of finding a solution to the alignment problem right now? No, because they each give us better tools and ideas and handles for tackling the problem, and all our current proposals don’t work.

That doesn’t look fast, you might answer. And I agree that fundamental science and solving new and complex problems have historically taken way too long for the short timelines we seem to be on. But that’s not a reason to refuse to do the necessary work, or despair; it’s a reason to find ways of accelerating science! For example, looking at what historically hampered progress, and remove it as much as possible. Or how hidden bits of evidence were revealed, and leverage that to explore the space of ideas and approaches faster.

Okay, but shouldn’t we focus all our efforts on finding smarter and smarter people to work on this problem instead of pushing for the small progress we’re able to make now? I think this misses the point: we don’t want smartness, we want the ability to reveal hidden bits of evidence. That’s certainly correlated with smartness, but with one big difference: there’s often diminishing returns to the bits of evidence you can get from one angle, and that leads to wanting a more diverse portfolio of researchers who are good at harnessing and revealing different streams of evidence. That’s one thing which the common “Which alignment researcher would you want to have 10 copies of?” misses: we want variety, because no one is that good at revealing bits from all relevant streams of evidence.

To go back to the Einstein example, he was clearly less of a math genius than most of his predecessors who attempted to unify mechanics and electromagnetism, like Poincaré. But that didn’t matter, because what Einstein had was a knack for revealing the hidden bits of evidence in what we already knew about physics and the shape of our laws of physics. And after he did that, many mathematicians and physicians with better math chops pushed his theory and ideas and revealed incredibly rich models and insights and predictions.

How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.

So if you’re worried about AI risk, and want to know if there’s anything that can be done, the answer is a resounding yes. There are so many ways of improving our understanding and thus our chances: participating in current research programs and agendas, coming up with new weird takes and approaches, exploring the mechanism, history, and philosophy of science to accelerate the process as much as we can…[2]

I don’t know if we’ll make it in time. 5 to 15 years[3] is a tight deadline indeed, and the strong alignment problem is incredibly complex and daunting. But I know this: if we solve the problem and get out of this alive, this will not be by waiting for an obviously convincing approach; it will come instead from making as many productive mistakes as we can, and learning from them as fast as we can.

  1. ^

    I’m not discussing applied alignment research here, like the work of Redwood, but I also find this part crucial and productive. It’s just that such work is less about “formulating a solution” and more about “exploring the models and the problems experimentally”, which fit well with the model I’m drawing here.

  2. ^

     I’m currently finishing a sequence arguing for more pluralism in alignment and providing an abstraction of the alignment problem that I find particularly good for generating new approaches and understanding how all the different takes and perspectives relate.

  3. ^

    The range where many short timelines put the bulk of their probability mass.

32

3 comments, sorted by Click to highlight new comments since: Today at 2:21 AM
New Comment

Mostly I'd agree with this, but I think there needs to be a bit of caution and balance around:

How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.

Do we want variety? Absolutely: worlds where things work out well likely correlate strongly with finding a variety of approaches.

However, there's some risk in Do(increase variety). The ideal is that we get many researchers thinking about the problem in a principled way, and variety happens. If we intentionally push too much for variety, we may end up with a lot of wacky approaches that abandoned too much principled thinking too early. (I think I've been guilty of this at times)

That said, I fully agree with the goal of finding a variety of approaches. It's just rather less clear to me how much an individual researcher should be thinking in terms of boosting variety. (it's very clear that there should be spaces that provide support for finding different approaches, so I'm entirely behind that; currently it's much more straightforward to work on existing ideas than to work on genuinely new ones)

Certainly many great ideas initially looked like garbage - but I'll wager a lot of garbage initially looked like garbage too. I'd be interested in knowing more about the hidden-greatness-garbage: did it tend to have any common recognisable qualities at the time? Did it tend to emerge from processes with common recognisable qualities? In environments with shared qualities?...

It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…

I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.

For sure I agree that the researcher knowing these things is a good start - so getting as many potential researchers to grok these things is important.

My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don't want to restrict thinking to ideas that may overcome all these issues - since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful.

Generating a broad variety of new ideas is great, and we don't want to be too quick in throwing out those that miss the target. The thing I'm unclear about is something like:

What target(s) do I aim for if I want to generate the set of ideas with greatest value?

I don't think that "Aim for full alignment solution" is the right target here.
I also don't think that "Aim for wacky long-shots" is the right target - and of course I realize that Adam isn't suggesting this.
(we might find ideas that look like wacky long-shots from outside, but we shouldn't be aiming for wacky long-shots)

But I don't have a clear sense of what target I would aim for (or what process I'd use, what environment I'd set up, what kind of people I'd involve...), if my goal were specifically to generate promising ideas (rather than to work on them long-term, or to generate ideas that I could productively work on).

Another disanalogy with previous research/invention... is that we need to solve this particular problem. So in some sense a history of:
[initially garbage-looking-idea] ---> [important research problem solved] may not be relevant.

What we need is: [initially garbage-looking-idea generated as attempt to solve x] ---> [x was solved]
It's not good enough if we find ideas that are useful for something, they need to be useful for this.

I expect the kinds of processes that work well to look different from those used where there's no fixed problem.