Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
This is a bonus newsletter summarizing Stuart Russell's new book, along with summaries of a few of the most relevant papers. It's entirely written by Rohin, so the usual "summarized by" tags have been removed.
We're also changing the publishing schedule: so far, we've aimed to send a newsletter every Monday; we're now aiming to send a newsletter every Wednesday.
Audio version here (may not be up yet).
Human Compatible: Artificial Intelligence and the Problem of Control (Stuart Russell): Since I am aiming this summary for people who are already familiar with AI safety, my summary is substantially reorganized from the book, and skips large portions of the book that I expect will be less useful for this audience. If you are not familiar with AI safety, note that I am skipping many arguments and counterarguments in the book that are aimed for you. I'll refer to the book as "HC" in this newsletter.
Before we get into details of impacts and solutions to the problem of AI safety, it's important to have a model of how AI development will happen. Many estimates have been made by figuring out the amount of compute needed to run a human brain, and figuring out how long it will be until we get there. HC doesn't agree with these; it suggests the bottleneck for AI is in the algorithms rather than the hardware. We will need several conceptual breakthroughs, for example in language or common sense understanding, cumulative learning (the analog of cultural accumulation for humans), discovering hierarchy, and managing mental activity (that is, the metacognition needed to prioritize what to think about next). It's not clear how long these will take, and whether there will need to be more breakthroughs after these occur, but these seem like necessary ones.
What could happen if we do get beneficial superintelligent AI? While there is a lot of sci-fi speculation that we could do here, as a weak lower bound, it should at least be able to automate away almost all existing human labor. Assuming that superintelligent AI is very cheap, most services and many goods would become extremely cheap. Even many primary products such as food and natural resources would become cheaper, as human labor is still a significant fraction of their production cost. If we assume that this could bring up everyone's standard of life up to that of the 88th percentile American, that would result in nearly a tenfold increase in world GDP per year. Assuming a 5% discount rate per year, this corresponds to $13.5 quadrillion net present value. Such a giant prize removes many reasons for conflict, and should encourage everyone to cooperate to ensure we all get to keep this prize.
Of course, this doesn't mean that there aren't any problems, even with AI that does what its owner wants. Depending on who has access to powerful AI systems, we could see a rise in automated surveillance, lethal autonomous weapons, automated blackmail, fake news and behavior manipulation. Another issue that could come up is that once AI is better than humans at all tasks, we may end up delegating everything to AI, and lose autonomy, leading to human enfeeblement.
This all assumes that we are able to control AI. However, we should be cautious about such an endeavor -- if nothing else, we should be careful about creating entities that are more intelligent than us. After all, the gorillas probably aren't too happy about the fact that their habitat, happiness, and existence depends on our moods and whims. For this reason, HC calls this the gorilla problem: specifically, "the problem of whether humans can maintain their supremacy and autonomy in a world that includes machines with substantially greater intelligence". Of course, we aren't in the same position as the gorillas: we get to design the more intelligent "species". But we should probably have some good arguments explaining why our design isn't going to succumb to the gorilla problem. This is especially important in the case of a fast intelligence explosion, or hard takeoff, because in that scenario we do not get any time to react and solve any problems that arise.
Do we have such an argument right now? Not really, and in fact there's an argument that we will succumb to the gorilla problem. The vast majority of research in AI and related fields assumes that there is some definite, known specification or objective that must be optimized. In RL, we optimize the reward function; in search, we look for states matching a goal criterion; in statistics, we minimize expected loss; in control theory, we minimize the cost function (typically deviation from some desired behavior); in economics, we design mechanisms and policies to maximize the utility of individuals, welfare of groups, or profit of corporations. This leads HC to propose the following standard model of machine intelligence: Machines are intelligent to the extent that their actions can be expected to achieve their objectives. However, if we put in the wrong objective, the machine's obstinate pursuit of that objective would lead to outcomes we won't like.
Consider for example the content selection algorithms used by social media, typically maximizing some measure of engagement, like click-through. Despite their lack of intelligence, such algorithms end up changing the user's preference so that they become more predictable, since more predictable users can be given items they are more likely to click on. In practice, this means that users are pushed to become more extreme in their political views. Arguably, these algorithms have already caused much damage to the world.
So the problem is that we don't know how to put our objectives inside of the AI system so that when it optimizes its objective, the results are good for us. Stuart calls this the "King Midas" problem: as the legend goes, King Midas wished that everything he touched would turn to gold, not realizing that "everything" included his daughter and his food, a classic case of a badly specified objective (AN #1). In some sense, we've known about this problem for a long time, both from King Midas's tale, and in stories about genies, where the characters inevitably want to undo their wishes.
You might think that we could simply turn off the power to the AI, but that won't work, because for almost any definite goal, the AI has an incentive to stay operational, just because that is necessary for it to achieve its goal. This is captured in what may be Stuart's most famous quote: you can't fetch the coffee if you're dead. This is one of a few worrisome convergent instrumental subgoals.
What went wrong? The problem was the way we evaluated machine intelligence, which doesn't take into account the fact that machines should be useful for us. HC proposes: Machines are beneficial to the extent that their actions can be expected to achieve our objectives. But with this definition, instead of our AI systems optimizing a definite, wrong objective, they will also be uncertain about the objective, since we ourselves don't know what our objectives are. HC expands on this by proposing three principles for the design of AI systems, that I'll quote here in full:
1. The machine’s only objective is to maximize the realization of human preferences.
2. The machine is initially uncertain about what those preferences are.
3. The ultimate source of information about human preferences is human behavior.
Cooperative Inverse Reinforcement Learning provides a formal model of an assistance game that showcases these principles. You might worry that an AI system that is uncertain about its objective will not be as useful as one that knows the objective, but actually this uncertainty is a feature, not a bug: it leads to AI systems that are deferential, that ask for clarifying information, and that try to learn human preferences. The Off-Switch Game shows that because the AI is uncertain about the reward, it will let itself be shut off. These papers are discussed later in this newsletter.
So that's the proposed solution. You might worry that the proposed solution is quite challenging: after all, it requires a shift in the entire way we do AI. What if the standard model of AI can deliver more results, even if just because more people work on it? Here, HC is optimistic: the big issue with the standard model is that it is not very good at learning our preferences, and there's a huge economic pressure to learn preferences. For example, I would pay a lot of money for an AI assistant that accurately learns my preferences for meeting times, and schedules them completely autonomously.
Another research challenge is how to actually put principle 3 into practice: it requires us to connect human behavior to human preferences. Inverse Reward Design and Preferences Implicit in the State of the World (AN #45) are example papers that tackle portions of this. However, there are lots of subtleties in this connection. We need to use Gricean semantics for language: when we say X, we do not mean the literal meaning of X: the agent must also take into account the fact that we bothered to say X, and that we didn't say Y. For example, I'm only going to ask for the agent to buy a cup of coffee if I believe that there is a place to buy reasonably priced coffee nearby. If those beliefs happen to be wrong, the agent should ask for clarification, rather than trudge hundreds of miles or pay hundreds of dollars to ensure I get my cup of coffee.
Another problem with inferring preferences from behavior is that humans are nearly always in some deeply nested plan, and many actions don't even occur to us. Right now I'm writing this summary, and not considering whether I should become a fireman. I'm not writing this summary because I just ran a calculation showing that this would best achieve my preferences, I'm doing it because it's a subpart of the overall plan of writing this bonus newsletter, which itself is a subpart of other plans. The connection to my preferences is very far up. How do we deal with that fact?
There are perhaps more fundamental challenges with the notion of "preferences" itself. For example, our experiencing self and our remembering self may have different preferences -- if so, which one should our agent optimize for? In addition, our preferences often change over time: should our agent optimize for our current preferences, even if it knows that they will predictably change in the future? This one could potentially be solved by learning meta-preferences that dictate what kinds of preference change processes are acceptable.
All of these issues suggest that we need work across many fields (such as AI, cognitive science, psychology, and neuroscience) to reverse-engineer human cognition, so that we can put principle 3 into action and create a model that shows how human behavior arises from human preferences.
So far, we've been talking about the case with a single human. But of course, there are going to be multiple humans: how do we deal with that? As a baseline, we could imagine that every human gets their own agent that optimizes for their preferences. However, this will differentially benefit people who care less about other people's welfare, since their agents have access to many potential plans that wouldn't be available to an agent for someone who cared about other people. For example, if Harriet was going to be late for a meeting with Ivan, her AI agent might arrange for Ivan to be even later.
What if we had laws that prevented AI systems from acting in such antisocial ways? It seems likely that superintelligent AI would be able to find loopholes in such laws, so that they do things that are strictly legal but still antisocial, e.g. line-cutting. (This problem is similar to the problem that we can't just write down what we want and have AI optimize it.)
What if we made our AI systems utilitarian (assuming we figured out some acceptable method of comparing utilities across people)? Then we get the "Somalia problem": agents will end up going to Somalia to help the worse-off people there, and so no one would ever buy such an agent.
Overall, it's not obvious how we deal with the transition from a single human to multiple humans. While HC focuses on a potential solution for the single human / single agent case, there is still much more to be said and done to account for the impact of AI on all of humanity. To quote HC, "There is really no analog in our present world to the relationship we will have with beneficial intelligent machines in the future. It remains to be seen how the endgame turns out."
Rohin's opinion: I enjoyed reading this book; I don't usually get to read a single person's overall high-level view on the state of AI, how it could have societal impact, the argument for AI risk, potential solutions, and the need for AI governance. It's nice to see all of these areas I think about tied together into a single coherent view. While I agree with much of the book, especially the conceptual switch from the standard model of intelligent machines to Stuart's model of beneficial machines, I'm going to focus on disagreements in this opinion.
First, the book has an implied stance towards the future of AI research that I don't agree with: I could imagine that powerful AI systems end up being created by learning alone without needing the conceptual breakthroughs that Stuart outlines. This has been proposed in e.g. AI-GAs (AN #63)), and seems to be the implicit belief that drives OpenAI and DeepMind's research agendas. This leads to differences in risk analysis and solutions: for example, the inner alignment problem (AN #58) only applies to agents arising from learning algorithms, and I suspect would not apply to Stuart's view of AI progress.
The book also gives the impression that to solve AI safety, we simply need to make sure that AI systems are optimizing the right objective, at least in the case where there is a single human and a single robot. Again, depending on how future AI systems work, that could be true, but I expect there will be other problems that need to be solved as well. I've already mentioned inner alignment; other graduate students at CHAI work on e.g. robustness and transparency.
The proposal for aligning AI requires us to build a model that relates human preferences to human behavior. This sounds extremely hard to get completely right. Of course, we may not need a model that is completely right: since reward uncertainty makes the agent amenable to shutdowns, it seems plausible that we can correct mistakes in the model as they come up. But it's not obvious to me that this is sufficient.
The sections on multiple humans are much more speculative and I have more disagreements there, but I expect that is simply because we haven't done enough research yet. For example, HC worries that we won't be able to use laws to prevent AIs from doing technically legal but still antisocial things for the benefit of a single human. This seems true if you imagine that a single human suddenly gets access to a superintelligent AI, but when everyone has a superintelligent AI, then the current system where humans socially penalize each other for norm violations may scale up naturally. The overall effect depends on whether AI makes it easier to violate norms, or to detect and punish norm violations.
Read more: Max Tegmark's summary, Alex Turner's thoughts
AI Alignment Podcast: Human Compatible: Artificial Intelligence and the Problem of Control (Lucas Perry and Stuart Russell): This podcast covers some of the main ideas from the book, which I'll ignore for this summary. It also talks a bit about the motivations for the book. Stuart has three audiences in mind. He wants to explain to laypeople what AI is and why it matters. He wants to convince AI researchers that they should be working in this new model of beneficial AI that optimizes for our objectives, rather than the standard model of intelligent AI that optimizes for its objectives. Finally, he wants to recruit academics in other fields to help connect human behavior to human preferences (principle 3), as well as to figure out how to deal with multiple humans.
Stuart also points out that his book has two main differences from Superintelligence and Life 3.0: first, his book explains how existing AI techniques work (and in particular it explains the standard model), and second, it proposes a technical solution to the problem (the three principles).
Cooperative Inverse Reinforcement Learning (Dylan Hadfield-Menell et al): This paper provides a formalization of the three principles from the book, in the case where there is a single human H and a single robot R. H and R are trying to optimize the same reward function. Since both H and R are represented in the environment, it can be the human's reward: that is, it is possible to reward the state where the human drinks coffee, without also rewarding the state where the robot drinks coffee. This corresponds to the first principle: that machines should optimize our objectives. The second principle, that machines should initially be uncertain about our objectives, is incorporated by assuming that only H knows the reward, requiring R to maintain a belief over the reward. Finally, for the third principle, R needs to get information about the reward from H's behavior, and so R assumes that H will choose actions that best optimize the reward (taking into account the fact that R doesn't know the reward).
This defines a two-player game, originally called a CIRL game but now called an assistance game. We can compute optimal joint strategies for H and R. Since this is an interactive process, H can do better than just acting optimally as if R did not exist (the assumption typically made in IRL): H can teach R what the reward is. In addition, R does not simply passively listen and then act, but interleaves learning and acting, and so must manage the explore-exploit tradeoff.
See also Learning to Interactively Learn and Assist (AN #64), which is inspired by this paper and does a similar thing with deep RL.
Read more: BAIR blog post: Cooperatively Learning Human Values
The Off-Switch Game (Dylan Hadfield-Menell et al): This paper studies theoretically the impact of uncertainty over the reward on R's incentives around potential off switches. It proposes the simplest model that the authors expect to lead to generalizable results. R and H are in an assistance game, in which R goes first. R may either take an action a, getting utility u, or shut itself down, getting utility 0. In either case, the game ends immediately. Alternatively, R can choose to wait, in which case H can either shut down R, getting utility 0, or allow R to go ahead with action a, getting utility u.
If H is perfectly rational, then waiting is always an optimal action for R, since H will ensure that the team gets max(u, 0) utility. There can be other optimal actions: if R is sure that u >= 0, then taking action a is also optimal, and similarly if R is sure that u <= 0, then shutting down is also optimal. However, if H is not rational, and sometimes fails to take the utility-maximizing action (in a way R can't predict), then things get murkier. If R is sure about the value of u, then it is never optimal to wait, better to just take the action a (if u >= 0) or shut down (if u < 0) rather than let H screw it up. If R is pretty confident that u is positive, it may still decide to take action a, rather than risk that H makes the wrong decision. However, if R is very uncertain about the sign of u, then waiting becomes optimal again. In general, more uncertainty over the reward leads to more deferential behavior (allowing H to shut it off), but at a cost: R is much less able to help H when it is very uncertain about the reward.
Rohin's opinion: While I agree with the broad thrust of this paper, I do have one nitpick: the game ends immediately after H chooses whether or not to shut off R. In reality, if R isn't shut off, the assistance game will continue, which changes the incentives. If R can be relatively confident in the utility of some action (e.g. doing nothing), then it may be a better plan for it to disable the shutdown button, and then take that action and observe H in the mean time to learn the reward. Then, after it has learned more about the reward and figured out why H wanted to shut it down, it can act well and get utility (rather than being stuck with the zero utility from being shut down). While this doesn't seem great, it's not obviously bad: R ends up doing nothing until it can figure out how to actually be useful, hardly a catastrophic outcome. Really bad outcomes only come if R ends up becoming confident in the wrong reward due to some kind of misspecification, as suggested in Incorrigibility in the CIRL Framework, summarized next.
Incorrigibility in the CIRL Framework (Ryan Carey): This paper demonstrates that when the agent has an incorrect belief about the human's reward function, then you no longer get the benefit that the agent will obey shutdown instructions. It argues that since the purpose of a shutdown button is to function as a safety measure of last resort (when all other measures have failed), it should not rely on an assumption that the agent's belief about the reward is correct.
Rohin's opinion: I certainly agree that if the agent is wrong in its beliefs about the reward, then it is quite likely that it would not obey shutdown commands. For example, in the off switch game, if the agent is incorrectly certain that u is positive, then it will take action a, even though the human would want to shut it down. See also these (AN #32) posts (AN #32) on model misspecification and IRL. For a discussion of how serious the overall critique is, both from HC's perspective and mine, see the opinion on the next post.
Problem of fully updated deference (Eliezer Yudkowsky): This article points out that even if you have an agent with uncertainty over the reward function, it will acquire information and reduce its uncertainty over the reward, until eventually it can't reduce uncertainty any more, and then it would simply optimize the expectation of the resulting distribution, which is equivalent to optimizing a known objective, and has the same issues (such as disabling shutdown buttons).
Rohin's opinion: As with the previous paper, this argument is only really a problem when the agent's belief about the reward function is wrong: if it is correct, then at the point where there is no more information to gain, the agent should already know that humans don't like to be killed, do like to be happy, etc. and optimizing the expectation of the reward distribution should lead to good outcomes. Both this and the previous critique are worrisome when you can't even put a reasonable prior over the reward function, which is quite a strong claim.
HC's response is that the agent should never assign zero probability to any hypothesis. It suggests that you could have an expandable hierarchical prior, where initially there are relatively simple hypotheses, but as hypotheses become worse at explaining the data, you "expand" the set of hypotheses, ultimately bottoming out at (perhaps) the universal prior. I think that such an approach could work in principle, and there are two challenges in practice. First, it may not be computationally feasible to do this. Second, it's not clear how such an approach can deal with the fact that human preferences change over time. (HC does want more research into both of these.)
Fully updated deference could also be a problem if the observation model used by the agent is incorrect, rather than the prior. I'm not sure if this is part of the argument.
Inverse Reward Design (Dylan Hadfield-Menell et al): Usually, in RL, the reward function is treated as the definition of optimal behavior, but this conflicts with the third principle, which says that human behavior is the ultimate source of information about human preferences. Nonetheless, reward functions clearly have some information about our preferences: how do we make it compatible with the third principle? We need to connect the reward function to human behavior somehow.
This paper proposes a simple answer: since reward designers usually make reward functions through a process of trial-and-error where they test their reward functions and see what they incentivize, the reward function tells us about optimal behavior in the training environment(s). The authors formalize this using a Boltzmann rationality model, where the reward designer is more likely to pick a proxy reward when it gives higher true reward in the training environment (but it doesn't matter if the proxy reward becomes decoupled from the true reward in some test environment). With this assumption connecting the human behavior (i.e. the proxy reward function) to the human preferences (i.e. the true reward function), they can then perform Bayesian inference to get a posterior distribution over the true reward function.
They demonstrate that by using risk-averse planning with respect to this posterior distribution, the agent can avoid negative side effects that it has never seen before and has no information about. For example, if the agent was trained to collect gold in an environment with dirt and grass, and then it is tested in an environment with lava, the agent will know that even though the specified reward was indifferent about lava, this doesn't mean much, since any weight on lava would have led to the same behavior in the training environment. Due to risk aversion, it conservatively assumes that the lava is bad, and so successfully avoids it.
See also Active Inverse Reward Design (AN #24), which builds on this work.
Rohin's opinion: I really like this paper as an example of how to apply the third principle. This was the paper that caused me to start thinking about how we should be thinking about the assumed vs. actual information content in things (here, the key insight is that RL typically assumes that the reward function conveys much more information than it actually does). That probably influenced the development of Preferences Implicit in the State of the World (AN #45), which is also an example of the third principle and this information-based viewpoint, as it argues that the state of the world is caused by human behavior and so contains information about human preferences.
It's worth noting that in this paper the lava avoidance is both due to the belief over the true reward, and the risk aversion. The agent would also avoid pots of gold in the test environment if it never saw it in the training environment. IRD only gives you the correct uncertainty over the true reward; it doesn't tell you how to use that uncertainty. You would still need safe exploration, or some other source of information, if you want to reduce the uncertainty.
Quote from the book on the problem of aligning black box models:
This is unfortunately the only paragraph that HC devotes to the matter.
I enjoyed pages 185-190, on mathematical guarantees, especially because I've been confused about what the "provably beneficial" in CHAI's mission statement is meant to say. Some quotes:
On the applicability of theorems to practice:
as well as
It then talks about assumption failure in cryptography due to side-channel attacks.
A somewhat more concrete version of what "provably beneficial" might mean:
It then goes on to discuss how such a theorem is subject to "side-channel attacks" because such theorems typically assume Cartesian duality, which is not actually true (see Embedded Agency).
I often don't have much to say about these newsletters, since they usually only straightforwardly summarize things, or make statements that would take me a long time to engage with, but it seemed good to mention that this edition was particularly helpful to me (because I've been considering whether to invest the time to read all of the book, and this made it more likely that I will, since I seem to disagree with at least a bunch of the things you summarized here)
Glad to hear it! Yeah, I do expect many people to disagree with many parts of this book. My guess is that it mostly boils down to a difference in predictions about how we build powerful AI systems.
I mentioned in my opinion that I think many of my disagreements are because of an implicit disagreement on how we build powerful AI systems:
I didn't expand on this in the newsletter because I'm not clear enough on the disagreement; I try to avoid writing very confused thoughts that say wrong things about what other people believe in a publication read by a thousand people. But that's fine for a comment here!
Rather than attribute a model to Stuart, I'm just going to make up a model that was inspired by reading HC, but wasn't proposed by HC. In this model, we get a superintelligent AI system that looks like a Bayesian-like system that explicitly represents things like "beliefs", "plans", etc. Some more details:
Some implications of this model:
If you're curious about how I select what goes in the newsletter: I almost put in this critical review of the book, in the spirit of presenting both sides of the argument. I didn't put it in because I couldn't understand it.
My best guess right now is that the author is arguing that "we'll never get superintelligence", possibly because intelligence isn't a coherent concept, but there's probably something more that I'm not getting. If it turned out that it was only saying "we'll never get superintelligence", and there weren't any new supporting arguments, I wouldn't include it in the newsletter, because we've seen and heard that counterargument more than enough.
They also made an error in implicitly arguing that because they didn't think unaligned behavior seems intelligent, then we have nothing to worry about from such AI - they wouldn't be "intelligent". I think leaving this out was a good choice.
There's also the scenario where the AI models the world in a way that has as good or better predictive power than our intentional stance model, but this weird model assigns undesirable values to the AI's co-player in the CIRL game. We can't rely on the agent "already knowing that humans don't like to be killed," because the AI doesn't have to be using the level of abstraction on which "human" or "killed" are natural categories.
I certainly would count an ontological failure in the reward function as an incorrect belief about the reward function.
I'm just a little leery of calling things "wrong" when it makes the same predictions about observations as being "right." I don't want people to think that we can avoid "wrong ontologies" by starting with some reasonable-sounding universal prior and then updating on lots of observational data. Or that something "wrong" will be doing something systematically stupid, probably due to some mistake or limitation that of course the reader would never program into their AI.