I'm leaving the Future of Humanity Institute, the best and most impactful job I've ever had, to co-found Aligned AI. For the first time in my research career, I feel the problem of AI alignment is solvable.


Alignment research: a history of partial failures

The history of AI safety is littered with failures and partial successes. The most common examples of failure are ideas that would work well typically, but which fail in extreme situations - and a superintelligent AI is perfectly capable of creating such situations.

  • Low-impact AIs were supposed to allow smart machines that interacted with humans without causing huge disruptions. They had some success at 'almost no impact'. But everyone - including me - failed at developing algorithms that had reliable low-impact. If the AI is allowed even a little bit of impact, it can make these low-impact restrictions irrelevant.
  • Corrigibility and interruptibility were designed to allow AIs to be reprogrammed even when active and powerful. They have good narrow uses, but aren't a general solution: though the AI would not interfere with the interruptibility process, it also has no incentive to preserve it or to ensure its subagents were also interruptible.
  • Oracles, question answering AIs (and their close relatives, tool AIs) are perennial suggestions, the idea being to limit the power of the AI by limiting it to answering questions or giving suggestions. But that fails, for instance when the AI is incentivised to manipulate humans through the contents of its answers or suggestions.
  • There were some interesting examples on limiting AI power, but these were ultimately vulnerable to the AI creating subagents.
  • The different forms of value learning confronted a surprising obstacle: values could not be learnt without making strong assumptions about human rationality, and human rationality could not be learnt without making strong assumptions about human values.

A litany of partial failures suggests that the next approach tried will be a failure as well - unless we can identify why the approaches above failed. Is there a common failure mode for all of them?

The common thread: lack of value extrapolations

It is easy to point at current examples of agents with low (or high) impact, at safe (or dangerous) suggestions, at low (or high) powered behaviours. So we have in a sense the 'training sets' for defining low-impact/Oracles/low-powered AIs.

It's extending these examples to the general situation that fails: definitions which cleanly divide the training set (whether produced by algorithms or humans) fail to extend to the general situation. Call this the 'value extrapolation problem[1], with 'value' interpreted broadly as a categorisation of situations into desirable and undesirable.

Humans turn out to face similar problems. We have broadly defined preferences in familiar situations we have encountered in the world or in fiction. Yet, when confronted with situations far from these, we have to stop and figure out how our values might possibly extend[2]. Since these human values aren't - yet - defined, we can't directly input them into an algorithm, so AIs that can't solve value extrapolation can't be aligned with human values.

Value extrapolation is thus necessary for AI alignment. It is also almost sufficient, since it allows AIs to draw correct conclusions from imperfectly defined human data. Combined with well grounded basic human values, it will allow the algorithm to extrapolate as well as humans can - better, in fact, using its superhuman abilities.

If that's successful, AIs that value extrapolate and that start aligned, will remain aligned even as they dramatically change the world and confront the unexpected, re-assessing its reward functions when its world-model changes.


We think that once humanity builds its first AGI, superintelligence is likely near, leaving little time to develop AI safety at that point. Indeed, it may be necessary that the first AGI start off aligned: we may not have the time or resources to convince its developers to retrofit alignment to it. So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI.

To do this, we'll start by offering alignment as a service for more limited AIs. Value extrapolation scales down as well as up: companies value algorithms that won't immediately misbehave in new situations, algorithms that will become conservative and ask for guidance when facing ambiguity.

We will get this service into widespread use (a process that may take some time), and gradually upgrade it to a full alignment process. That will involve drawing on our research and that of others - we will remain strongly engaged with other research groups, providing tools that they can use and incorporating their own results into our service.

We will refine and develop this deployment plan, depending on research results, commercial opportunities, feedback, and suggestions. Contact us in the comments of this post or from our website.

Thanks to LessWrong

I want to thank LessWrong, as a collective entity, for getting us to the point where such a plan seems doable. We'll be posting a lot here, putting out ideas, asking for feedback - if you can continue giving the same quality of response that you always have (and checking that we ourselves haven't go misaligned!), that's all we can ask from you :-)

  1. Formerly called the 'model splintering' problem. ↩︎

  2. Humans have demonstrated a skill with value extrapolation, during their childhoods and adolescences, when encountering new stories and thought-experiments, and when their situation changes dramatically. Though human value extrapolation can be contingent, it rarely falls into the extreme failure modes of AIs. ↩︎



25 comments, sorted by Click to highlight new comments since: Today at 8:31 PM
New Comment

Can you describe what changed / what made you start feeling that the problem is solvable / what your new attack is, in short?

This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.

Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I know how to build aligned AI for real this time guys and the answer is [a thing folks have been disagreeing about whether or not it works for years]" makes me -1 unless there's some explanation of how it's different this time.

Sorry if this is a bit harsh. I don't want to be too down on this project, but I feel like a core chunk of the post is that there's some exciting development that leads Stuart to think something new is possible but then doesn't really tell us what that something new is, and I feel that by the standards of LW/AF that's good reason to complain and ask for more info.

Firstly, because the problem feels central to AI alignment, in the way that other approaches didn't. So making progress in this is making general AI alignment progress; there won't be such a "one error detected and all the work is useless" problem. Secondly, we've had success generating some key concepts, implying the problem is ripe for further progress.

Hmm, the only overlap I can see between your recent work and this description (including optimism about very-near-term applications) is the idea of training an ensemble of models on the same data, and then if the models disagree with each other on a new sample, then we're probably out of distribution (kinda like the Yarin Gal dropout ensemble thing and much related work).

And if we discover that we are in fact out of distribution, then … I don't know. Ask a human for help?

If that guess is at all on the right track (very big "if"!), I endorse it as a promising approach well worth fleshing out further (and I myself put a lot of hope on things in that vein working out). I do, however, think there are AGI-specific issues to think through, and I'm slightly worried that y'all will get distracted by the immediate deployment issues and not make as much progress on AGI-specific stuff. But I'm inclined to trust your judgment :)

Value extrapolation is thus necessary for AI alignment. It is also almost sufficient, since it allows AIs to draw correct conclusions from imperfectly defined human data.

I am missing something... The idea of correctly extrapolating human values is basically the definition of the Eliezer's original proposal, CEV. In fact, it's right there in the name. What is the progress over the last decade?

CEV is based on extrapolating the person; the values are what the person would have had, had they been smarter, known more, had more self-control, etc... Once you have defined the idealised person, the values emerge as a consequence. I've criticised this idea in the past, mainly because the process to generate the idealised person seems vulnerable to negative attractors (Eliezer's most recent version of CEV has less of this problem).

Value extrapolation and model splintering are based on extrapolating features and concepts in models, to other models. This can be done without knowing human psychology or (initially) anything about knowing anything about humans at all, including their existence. See for example the value extrapolation partially resolves symbol grounding post; I would never write "CEV partially resolves symbol grounding". On the contrary, CEV needs symbol grounding.

I wanted to look up CEV after reading this comment. Here's a link for anyone else looking: https://intelligence.org/files/CEV.pdf

That acronym stands for "Coherent Extrapolated Volition" not "Coherent Extrapolated Values". But from skimming the paper just now, I think agree with shminux that it's basically the same idea.

A more recent explanation of CEV by Eliezer: https://arbital.com/p/cev/  

Aligned AI is a benefit corporation dedicated to solving the alignment problem

Is this a UK or US public-benefit corporation?

Who are the other founders?

Who and how much are you capitalized for?

UK based currently, Rebecca Gorman other co-founder.

This page says "We are located in Oxford, England." So I think they are a UK public-benefit corporation, but I could be mistaken.

If you think this is financially viable, then I'm fairly keen on this, especially if you provide internships and development opportunities for aspiring safety researchers.

Yes, those are important to provide, and we will.

Given that there's a lot of variation in how humans extrapolate values, whose extrapolation process do you intend to use?

We're aiming to solve the problem in a way that is acceptable to one given human, and then generalise from that.

This seems fragile in ways that make me less optimistic about the approach overall. We have strong reasons to think that value aggregation is intractable, and (by analogy,) in some ways the problem of coherence in CEV is the tricky part. That is, the problem of making sure that we're not Dutch book-able is, IIRC, NP-complete, and even worse, the problem of aggregating preferences has several impossibility results.

Edit: To clarify, I'm excited about the approach overall, and think it's likely to be valuable, but this part seems like a big problem.

I've posted on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum).

But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn't need human value aggregation.

I'm skeptical that many of the problems with aggregation don't both apply to actual individual human values once extrapolated, and generalize to AIs with closely related values, but I'd need to lay out the case for that more clearly. (I did discuss the difficulty of cooperation even given compatible goals a bit in this paper, but it's nowhere near complete in addressing this issue.)

It's worth you write up your point and post it - that tends to clarify the issue, for yourself as well as for others.

To do this, we'll start by offering alignment as a service for more limited AIs.

Interesting move! Will be interesting to see how you will end up packaging and positioning this alignment as a service, compared to the services offered by more general IT consulting companies. Good luck!

I'm encouraged by your optimism, and wish you the best of luck (British, and otherwise), but I hope you're not getting much of your intuition from the "Humans have demonstrated a skill with value extrapolation..." part. I don't think we have good evidence for this in a broad enough range of circumstances for it to apply well to the AGI case.

We know humans do pretty 'well' at this - when surrounded by dozens of other similar agents, in a game-theoretical context where it pays to cooperate, it pays to share values with others, and where extreme failure modes usually lead to loss of any significant power before they can lead to terrible abuse of that power.

Absent such game-theoretic constraints, I don't think we know much at all about how well humans do at this.

Further, I don't think I know what it means to do value extrapolation well - beyond something like "you're doing it well if you're winning" (what would it look like for almost all humans to do it badly?). That's fine for situations where cooperation with humans is the best way to win. Not so much where it isn't.

But with luck I'm missing something!

I do not put too much weight on that intuition, except as an avenue to investigate (how do humans do it, exactly? If it depends on the social environment, can the conditions of that be replicated?).