Some abstract, non-technical reasons to be non-maximally-pessimistic about AI alignment

Rob Bensinger

I basically agree with Eliezer’s picture of things in the AGI interventions post.

But I’ve seen some readers rounding off Eliezer’s ‘the situation looks very dire’-ish statements to ‘the situation is hopeless’, and ‘solving alignment still looks to me like our best shot at a good future, but so far we’ve made very little progress, we aren’t anywhere near on track to solve the problem, and it isn’t clear what the best path forward is’-ish statements to ‘let’s give up on alignment’.

It’s hard to give a technical argument for ‘alignment isn’t doomed’, because I don’t know how to do alignment (and, to the best of my knowledge in December 2021, no one else does either). But I can give some of the more abstract reasons I think that.

I feel sort of wary of sharing a ‘reasons to be less pessimistic’ list, because it’s blatantly filtered evidence, it makes it easy to overcorrect, etc. In my experience, people tend to be way too eager to classify problems as either ‘easy’ or ‘impossible’; just adding more evidence may cause people to bounce back and forth between the two rather than planting a flag in the middle ground.

I did write a version of 'reasons not to be maximally pessimistic' for a few friends in 2018. I’m warily fine with sharing that below, with the caveats ‘holy shit is this ever filtered evidence!’ and ‘these are my own not-MIRI-vetted personal thoughts’. And 'this is a casual thing I jotted down for friends in 2018'.

Today, I would add some points (e.g., 'AGI may be surprisingly far off; timelines are hard to predict'), and I'd remove others (e.g., 'Nate and Eliezer feel pretty good about MIRI's current research'). Also, since the list is both qualitative and one-sided, it doesn’t reflect the fact that I’m quantitatively a bit more pessimistic now than I was in 2018.

Lo:

[...S]ome of the main reasons I'm not extremely pessimistic about artificial general intelligence outcomes.

(Warning: one-sided lists of considerations can obviously be epistemically bad. I mostly mean to correct for the fact that I see a lot of rationalists who strike me as overly pessimistic about AGI outcomes. Also, I don't try to argue for most of these points in any detail; I'm just trying to share my own views for others' stack.)

1. AGI alignment is just a technical problem, and humanity actually has a remarkably good record when it comes to solving technical problems. It's historically common for crazy-seeming goals to fall to engineering ingenuity, even in the face of seemingly insuperable obstacles.

Some of the underlying causes for this are 'it's hard to predict what clever ideas are hiding in the parts of your map that aren't filled in yet', and 'it's hard to prove a universal negation'. A universal negation is what you need in order to say that there's no clever engineering solution; whereas even if you've had ten thousand failed attempts, a single existence proof — a single solution to the problem — renders those failures totally irrelevant.

2. We don't know very much yet about the alignment problem. This isn't a reason for optimism, but it's a reason not to have confident pessimism, because no confident view can be justified by a state of uncertainty. We just have to learn more and do it the hard way and see how things go.

A blank map can feel like 'it's hopeless' for various reasons, even when you don't actually have enough Bayesian evidence to assert a technical problem is hopeless. For example: you think really hard about the problem and can't come up with a solution, which to some extent feels like there just isn't a solution. And: people aren't very good at knowing which parts of their map are blank, so it may feel like there aren't more things to learn even where there are. And: to the extent there are more things to learn, these can represent not only answers to questions you've posed, but answers to questions you never thought to pose; and can represent not only more information relevant to your current angle of attack on the problem, but information that can only be seen as relevant once you've undergone a perspective shift, ditched an implicit assumption, etc. This is to a large extent the normal way intellectual progress has worked historically, but hindsight bias makes this hard to notice and fully appreciate.

Or as Eliezer put it in his critique of Paul Christiano's approach to alignment on LW:

I restate that these objections seem to me to collectively sum up to “This is fundamentally just not a way you can get an aligned powerful AGI unless you already have an aligned superintelligence”, rather than “Some further insights are required for this to work in practice.” But who knows what further insights may really bring? Movement in thoughtspace consists of better understanding, not cleverer tools.

Eliezer is not a modest guy. This is not false humility or politeness. This is a statement about what technical progress looks like when you have to live through it and predict it in the future, as opposed to what it looks like with the benefit of hindsight: it looks like paradigm shifts and things going right in really weird and unexpected ways (that make perfect sense and look perfectly obvious in hindsight). If we want to avoid recapitulating the historical errors of people who thought a thing was impossible (or centuries away, etc.) because they didn't know how to do it yet, then we have to either have a flatter prior about how hard alignment is, or make sure to ground our confidence in very solid inside-view domain knowledge.

3. If you can get a few very specific things right, you can leverage AGI capabilities to bootstrap your way to getting everything else right, including solving various harder forms of the alignment problem. By the very nature of the AGI problem, you don't have to do everything by human ingenuity; you just have to get this one thing firmly right. Neglecting this bootstrapping effect makes it easy to overestimate the expected difficulty of the problem.

4. AGI alignment isn't the kind of problem that requires massive coordination or a global mindset shift or anything like that. It's more like the moon landing or the Manhattan Project, in that it's a concrete goal that a specific project at a certain time or place can pull off all on its own, regardless of how silly the rest of the world is acting at the time.

Coordination can obviously make this task a lot easier. In general, the more coordination you have, the easier the technical challenge becomes; and the more technical progress you make, the lower a level of coordination and resource advantage you need. But at its core, the alignment problem is about building a machine with certain properties, and a team can just do that even if the world-at-large that they're operating in is badly broken.

5. Sufficiently well-informed and rational actors have extremely good incentives here. The source of the 'AI developers are racing to the brink' problem is bias and information asymmetry, not any fundamental conflict of interest.

6. Clear and rigorous thinking is helpful for AGI capabilities, and it's also helpful for understanding the nature and severity of AGI risk. This doesn't mean that there's a strong correlation today between the people who are best at capabilities and the people who are thinking most seriously about safety; but it does mean that there's a force pushing in the direction of a correlation like that becoming more strong over time (e.g., as conversations happen and the smartest people acquire more information, think about things more, and thereby get closer to truth).

7. Major governments aren't currently leaders in AI research, and there are reasons to think this is unlikely to change in the future. (This is positive from my perspective because I think state actors can make a lot of aspects of the problem more difficult and complicated.)

8. Deference to domain experts. Nate, Eliezer, Benya, and other researchers at MIRI think it's doable, and these are some of the folks I think are most reliably correct and well-calibrated about tricky questions like these. They're also the kind of people I think really would drop this line of research if the probability of success seemed too low to them, or if some other approach seemed more promising.

9. This one's hard to communicate, but: some kind of gestalt impression gathered from seeing how MIRI people approach the problem in near mode, and how they break the problem down into concrete smaller subproblems.

I don't think this is a strong reason to expect success, but I do think there's some kind of mindset switch that occurs when you are living and breathing nitty-gritty details related to alignment work, deployment strategy, etc., and when you see various relatively-concrete paths to success discussed in a serious and disciplined way.

I think a big part of what I'm gesturing at here is a more near-mode model of AGI itself: thinking of AGI as software whose properties we determine, where we can do literally anything we want with it (if we can figure out how to represent the thing as lines of code). A lot of people go too far with this and conclude the alignment problem is trivial because it's 'just software'; but I think there's a sane version of this perspective that's helpful for estimating the difficulty of the problem.

10. Talking in broad generalities, MIRI tends to think that you need a relatively principled approach to AGI in order to have a shot at alignment. But drilling down on the concrete details, it's still the case that it can be totally fine in real life to use clever hacks rather than deep principled approaches, as long as the clever hacks work. (Which they sometimes do, even in robust code.)

The key thing from the MIRI perspective isn't 'you never use cheats or work-arounds to make the problem easier on yourself', but rather 'it's not cheats and work-arounds all the way down; the high-level cleverness is grounded in a deep understanding of what the system is fundamentally doing'.

11. Relatedly, I have various causes for optimism that are more specific to MIRI's particular research approach; e.g., thinking it's easier to solve various conceptual problems because of inside-view propositions about the problems.

12. The problems MIRI is working on have been severely neglected by researchers in the past, so it's not like they're the kind of problem humanity has tried its hand at and found to be highly difficult. Some of the problems have accrued a mythology of being formidably difficult or even impossible, in spite of no one having really tried them before.

(A surprisingly large number of the problems MIRI has actually already solved are problems that various researchers in the field have told us are impossible for anyone to solve even in principle, which indicates that a lot of misunderstandings of things like reflective reasoning are really commonplace.)

13. People haven't tried very hard to find non-MIRI-ish approaches that might work.

14. Humanity sometimes builds robust and secure software. If the alignment problem is similar to other cases of robustness, then it's a hard problem, but not so hard that large teams of highly motivated and rigorous teams (think NASA) can't solve them.

15. Indeed, there are already dedicated communities specializing in methodologically similar areas like computer security, and if they took some ownership of the alignment problem, things could suddenly start to look a lot sunnier.

16. More generally, there are various non-AI communities who make me more optimistic than AI researchers on various dimensions, and to the extent I'm uncertain about the role those communities will play in AGI in the future, I'm more uncertain about AGI outcomes.

17. [redacted]

18. [redacted]

28

Some abstract, non-technical reasons to be non-maximally-pessimistic about AI alignment

28