I missed the crux of the alignment problem the whole time

This post has been written for the first Refine blog post day, at the end of the week of readings, discussions, and exercises about epistemology for doing good conceptual research. Thanks to Adam Shimi for helpful discussion and comments.

I first got properly exposed to AI alignment ~1-2 years ago. I read the usual stuff like Superintelligence, The Alignment Problem, Human Compatible, a bunch of posts on LessWrong and Alignment Forum, watched all of Rob Miles’ videos, and participated in the AGI Safety Fundamentals program. I recently joined Refine and had more conversations with people, and realized I didn’t really get the crux of the problem all this while.

I knew that superintelligent AI would be very powerful and would Goodhart whatever goals we give it, but I never really got how this relates to basically ‘killing us all’. It feels basically right that AIs will be misaligned by default and will do stuff that is not what we want it to do while pursuing instrumentally convergent goals all along. But the possible actions that such an AI could take seemed so numerous that ‘killing all of humanity’ seemed like such a small point in the whole actionspace of the AI, that it would require extreme bad luck for us to be in that situation.

First, this seems partially due to my background as a non-software engineer in oil and gas, an industry that takes safety very very seriously. In making a process safe, we quantify the risks of an activity, understand the bounds of the potential failure modes, and then take actions to mitigate against those risks and also implement steps to minimize damage should a failure mode be realized. How I think about safety is from the perspective of specific risk events and the associated probabilities, coupled with the exact failure modes of those risks. This thinking may have hindered my ability to think of the alignment problem in abstract terms, because I focused on looking for specific failure modes that I could picture in my head.

Second, there are a few failure modes that seem more popular in the introductory reading materials that I was exposed to. None of them helped me internalize the crux of the problem.

The first was the typical paperclip maximizer or ‘superintelligent AI will kill all of us’ scenario. It feels like sci-fi that is not grounded in reality, leading to me failing to internalize the point about unboundedness. I do not dispute that a superintelligent AI will have the capabilities to destroy all of humanity, but it doesn’t feel like it would actually do so.
The other failure modes were from Paul Christiano’s post which in my first reading boiled down to ‘powerful AIs will accelerate present-day societal failures but not pose any additional danger’, as well as Andrew Critch’s post which felt to me like ‘institutions have structurally perverse incentives that lead to the tragedy of the commons’. In my shallow understanding of both of these posts, current human societies have failure modes that will be accelerated by AIs because AIs basically speed things up, whether they are good or bad. So these scenarios were too close to normal scenarios to let me internalize the crux about unboundedness.

My internal model of a superintelligent AI was a very powerful tool AI. I didn’t really get why we are trying to ‘align it to human values’ because I didn’t really see human values as the crux of the problem, nor did I think having a superintelligent AI being fully aligned to a human’s value would be particularly useful. Which human’s values are we talking about anyway? Would it be any good for an AI to fully adopt human values only to end up like Hitler, who is no less a human than any of us are? The phrase ‘power corrupts, absolute power corrupts absolutely’ didn’t help much either, as it made me feel like the problem is with power instead of values. Nothing seemed particularly relevant unless we solved philosophy.

Talking to more people made me start thinking of superintelligent AIs in a more agentic way. It actually helped that I started to anthropomorphize AI, by visualizing it as a ‘person’ going about doing things that maximizes its utility function, but possesses immense power that makes it capable of doing practically everything. This powerful agent is going about doing things, while not having the slightest ‘understanding’ of what a ‘human person’ is, but behaves as if it knows what a ‘human person’ is because it was trained to identify these humans and exhibit a certain behavior during training. And one day after deployment, it realizes that what these ‘human persons’ are, they are starting to be in the way of its goals, and it promptly gets all humans out of its way by destroying the whole of humanity, just like how it has destroyed everything else that came in its way of achieving its goals.

I know the general advice of not anthropomorphizing AIs because they will be fundamentally different from humans, and they are not ‘evil’ in the sense that they are ‘trying’ to destroy us all (the AI does not hate you, nor does it love you, but you are made of atoms which it can use for something else). But I needed to look at AIs in a more anthropomorphized form to actually get that it will ‘want’ things and ‘try’ very hard to do things that it ‘wants’.

Now, the tricky bit is different. It is about how to make this agentic AI have a similar understanding of our world, to have a similar notion of what humans are, to ‘understand’ that humans are these sentient beings that have real thoughts and emotions instead of objects that satisfy certain criteria as shown in the training data. And hopefully, when a superintelligent being has similar abstractions and values as we do, it will actually care to not destroy us all.

I’m still highly uncertain that I now get the crux of the alignment problem, but hopefully this is a step in the right direction.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

17

I missed the crux of the alignment problem the whole time

17