The purpose of this book is to explain why [superintelligence] might be the last event in human history and how to make sure that it is not... The book is intended for a general audience but will, I hope, be of value in convincing specialists in artificial intelligence to rethink their fundamental assumptions.
Yesterday, I eagerly opened my copy of Stuart Russell's Human Compatible (mirroring his Center for Human-Compatible AI, where I've worked the past two summers). I've been curious about Russell's research agenda, and also how Russell argued the case so convincingly as to garner the following acclamations from two Turing Award winners:
Human Compatible made me a convert to Russell's concerns with our ability to control our upcoming creation—super-intelligent machines. Unlike outside alarmists and futurists, Russell is a leading authority on AI. His new book will educate the public about AI more than any book I can think of, and is a delightful and uplifting read.—Judea Pearl
This beautifully written book addresses a fundamental challenge for humanity: increasingly intelligent machines that do what we ask but not what we really intend. Essential reading if you care about our future. —Yoshua Bengio
Bengio even recently lent a reasoned voice to a debate on instrumental convergence!
Bringing the AI community up-to-speed
I think the book will greatly help AI professionals understand key arguments, avoid classic missteps, and appreciate the serious challenge humanity faces. Russell straightforwardly debunks common objections, writing with both candor and charm.
I must admit, it's great to see such a prominent debunking; I still remember, early in my concern about alignment, hearing one professional respond to the entire idea of being concerned about AGI with a lazy ad hominem dismissal. Like, hello? This is our future we're talking about!
But Russell realizes that most people don't intentionally argue in bad faith; he structures his arguments with the understanding and charity required to ease the difficulty of changing one's mind. (Although I wish he'd be a little less sassy with LeCun, understandable as his frustration may be)
More important than having fish, however, is knowing how to fish; Russell helps train the right mental motions in his readers:
With a bit of practice, you can learn to identify ways in which the achievement of more or less any fixed objective can result in arbitrarily bad outcomes. [Russell goes on to describe specific examples and strategies] (p139)
He somehow explains the difference between the Platonic assumptions of RL and the reality of a human-level reasoner, while also introducing wireheading. He covers the utility-reward gap, explaining that our understanding of real-world agency is so crude that we can't even coherently talk about the "purpose" of eg AlphaGo. He explains instrumental subgoals. These bits are so, so good.
Now for the main course, for those already familiar with the basic arguments:
Please realize that I'm replying to my understanding of Russell's agenda as communicated in a nontechnical book for the general public; I also don't have a mental model of Russell personally. Still, I'm working with what I've got.
Here's my summary: reward uncertainty through some extension of a CIRL-like setup, accounting for human irrationality through our scientific knowledge, doing aggregate preference utilitarianism for all of the humans on the planet, discounting people by how well their beliefs map to reality, perhaps downweighting motivations such as envy (to mitigate the problem of everyone wanting positional goods). One challenge is towards what preference-shaping situations the robot should guide us (maybe we need meta-preference learning?). Russell also has a vision of many agents, each working to reasonably pursue the wishes of their owners (while being considerate of others).
I'm going to simplify the situation and just express my concerns about the case of one irrational human, one robot.
There's fully updated deference:
One possible scheme in AI alignment is to give the AI a state of moral uncertainty implying that we know more than the AI does about its own utility function, as the AI's meta-utility function defines its ideal target. Then we could tell the AI, "You should let us shut you down because we know something about your ideal target that you don't, and we estimate that we can optimize your ideal target better without you."
The obstacle to this scheme is that belief states of this type also tend to imply that an even better option for the AI would be to learn its ideal target by observing us. Then, having 'fully updated', the AI would have no further reason to 'defer' to us, and could proceed to directly optimize its ideal target.
which Russell partially addresses by advocating ensuring realizability, and avoiding feature misspecification by (somehow) allowing for dynamic addition of previously unknown features (see also Incorrigibility in the CIRL Framework). But supposing we don't have this kind of model misspecification, I don't see how the "AI simply fully computes the human's policy, updates, and then no longer lets us correct it" issue is addressed. If you're really confident that computing the human policy lets you just extract the true preferences under the realizability assumptions, maybe this is fine? I suspect Russell has more to say here that didn't make it onto the printed page.
There's also the issue of getting a good enough human mistake model, and figuring out people's beliefs, all while attempting to learn their preferences (see the value learning sequence).
Now, it would be pretty silly to reply to an outlined research agenda with "but specific problems X, Y, and Z!", because the whole point of further research is to solve problems. However, my concerns are more structural. Certain AI designs lend themselves to more robustness against things going wrong (in specification, training, or simply having fewer assumptions). It seems to me that the uncertainty-based approach is quite demanding on getting component after component "right enough".
Let me give you an example of something which is intuitively "more robust" to me: approval-directed agency.
Consider a human Hugh, and an agent Arthur who uses the following procedure to choose each action:
Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating.
Here, the approval-policy does what a predictor says to do at each time step, which is different from maximizing a signal. Its shape feels different to me; the policy isn't shaped to maximize some reward signal (and pursue instrumental subgoals). Errors in prediction almost certainly don't produce a policy adversarial to human interests.
How does this compare with the uncertainty approach? Let's consider one thing it seems we need to get right:
Where in the world is the human?
How will the agent robustly locate the human whose preferences it's learning, and why do we need to worry about this?
Well, a novice might worry "what if the AI doesn't properly cleave reality at its joints, relying on a bad representation of the world?". But, having good predictive accuracy is instrumentally useful for maximizing the reward signal, so we can expect that its implicit representation of the world continually improves (i.e., it comes to find a nice efficient encoding). We don't have to worry about this - the AI is incentivized to get this right.
However, if the AI is meant to deduce and further the preferences of that single human, it has to find that human. But, before the AI is operational, how do we point to our concept of "this person" in a yet-unformed model whose encoding probably doesn't cleave reality along those same lines? Even if we fix the structure of the AI's model so we can point to that human, it might then have instrumental incentives to modify the model so it can make better predictions.
Why does it matter so much that we point exactly to the human? Well, then we're extrapolating the "preferences" of something that is not the person (or a person?) - the predicted human policy in this case seems highly sensitive to the details of the person or entity being pointed to. This seems like it could easily end in tragedy, and (strong belief, weakly held) doesn't seem like the kind of problem that has a clean solution. this sort of thing seems to happen quite often for proposals which hinge on things-in-ontologies.
Human action models, mistake models, etc. are also difficult in this way, and we have to get them right. I'm not necessarily worried about the difficulties themselves, but that the framework seems so sensitive to them.
This book is most definitely an important read for both the general public and AI specialists, presenting a thought-provoking agenda with worthwhile insights (even if I don't see how it all ends up fitting together). To me, this seems like a key tool for outreach.
Just think: in how many worlds does alignment research benefit from the advocacy of one of the most distinguished AI researchers ever?
Reading this made me realize a pretty general idea, which we can call "decoupling action from utility".
Consequentialist AI: figure out which action, if carried out, would maximize paperclips; then carry out that action.
Decoupled AI 1: figure out which action, if carried out, would maximize paperclips; then print a description of that action.
Decoupled AI 2: figure out which action, if described to a human, would be approved; then carry out that action. (Approval-directed agent)
Decoupled AI 3: figure out which prediction, if erased by a low probability event, would be true; then print that prediction. (Counterfactual oracle)
Any other ideas for "decoupled" AIs, or risks that apply to this approach in general?
(See also the concept of "decoupled RL" from some DeepMind folks.)
If the question is about all the risks that apply, rather than special risks with this specific approach, then I'll note that the usual risks from the inner alignment problem seem to apply.
Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.
Another design: imitation learning. Generally, there seems to be a pattern of: policies which aren't selected for on the basis of maximizing some kind of return.
The class of non-agent AI's (not choosing actions based on the predicted resulting utility) seems very broad. We could choose actions alphabetically, or use an expert system representing the outside view, or use a biased/inaccurate model when predicting consequences, or include preferences about which actions are good or bad in themselves.
I don't think there's any general failure mode (there are certainly specific ones), but if we condition on this AI being selected by humans, maybe we select something that's doing enough optimization that it will take a highly-optimizing action like rewriting itself to be an agent.
I'm a bit confused because you're citing this in comparison with approval-directed agency, but doesn't approval-directed agency also have this problem?
Approval-directed agency also has to correctly learn or specify what "considered it at length" means (i.e., to learn/specify reflection or preferences-for-reflection) and a baseline model of the current human user as a starting point for reflection, so it's not obvious to me that it's much more robust.
I think overall I do lean towards Paul’s approach though, if only because I understand it a lot better. I wonder why Professor Russell doesn’t describe his agenda in more technical detail, or engage much with the technical AI safety community, to the extent that even grad students at CHAI apparently do not know much about his approach. (Edit: This last paragraph was temporarily deleted while I consulted with Rohin then added back.)
For the sake of explaining this: for quite a while, he's been engaging with academics and policymakers, and writing a book; it's not that he's been doing research and not talking to anyone about it.
Fyi, when you quote people who work at an organization saying something that has a negative implication about that organization, you make it less likely that people will say things like that in the future. I'm not saying that you did anything wrong here; I just want to make sure that you know of this effect, and that it does make me in particular more likely to be silent the next time you ask about CHAI rather than responding.
Clarification: For me, the general worry is something like "if I get quoted, I need to make sure that it's not misleading (which can happen even if the person quoting me didn't mean to be misleading), and that takes time and effort and noticing all the places where I'm quoted, and it's just easier to not say things at all".
(Other people may have more worries, like "If I say something that could be interpreted as being critical of the organization, and that becomes sufficiently well-publicized, then I might get fired, so I'll just never say anything like that.")
Note: I've only started to delve into the literature about Paul's agenda, so these opinions are lightly held.
Before I respond to specific points, recall that I wrote
The approval agent is taking actions according to the output of an ML-trained approval predictor; the fact that the policy isn't selected to maximize a signal is critical, and part of why I find approval-based methods so intriguing. There's a very specific kind of policy you need in order to pursue instrumental subgoals, which is reliably produced by maximization, but which otherwise seems to be vanishingly unlikely.
The contrast is the failing gracefully, not (necessarily) the specific problems.
In addition to the above (even if approval-directed agents have this problem, this doesn't mean disaster, just reduced performance), my understanding is that approval doesn't require actually locating the person, just imitating the output of their approval after reflection. This should be able to be trained in the normal fashion, right? (see the learning from examples section)
Suppose we train the predictor
Approvalusing examples and high-powered ML. Then we have the agent take the action most highly rated by
Approvalat each time step. This seems to fail much more gracefully as the quality of
The AI is incentivized to get this right only in directions that increase approval. If the AI discovers something the human operator would disapprove of learning, it is incentivized to obscure that fact or act as if it didn't know it. (This works both for "oh, here's an easy way to kill all humans" and "oh, it turns out God isn't real.")
yes, but its underlying model is still accurate, even if it doesn't reveal that to us? I wasn’t claiming that the AI would reveal to us all of the truths it learns.
Perhaps I misunderstand your point.
This depends on whether it thinks we would approve more of it having an accurate model and deceiving us or having an inaccurate model in the way we want its model to be less accurate. Some algorithmic bias work is of the form "the system shouldn't take in inputs X, or draw conclusions Y, because that violates a deontological rule, and simple accuracy-maximization doesn't incentivize following that rule."
My point is something like "the genius of approval-directed agency is that it grounds out every meta-level in 'approval,' but this is also (potentially) the drawback of approval-directed agency." Specifically, for any potentially good property the system might have (like epistemic accuracy) you need to check whether that actually in-all-cases for-all-users maximizes approval, because if it doesn't, then the approval-directed agent is incentivized to not have that property.
[The deeper philosophical question here is something like "does ethics backchain or forwardchain?", as we're either grounding things out in what will believe or what we believe now, and approval-direction is more the latter, and CEV-like things are more the former.]
Note that I wasn’t talking about approval directed agents in the part you originally quoted. I was saying that normal maximizers will learn to build good models as part of capability generalization.
Oh! Sorry, I missed the "How does this compare with" line.