Alignment Newsletter #52


Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.


Thoughts on Human Models (Ramana Kumar and Scott Garrabrant): Many approaches to AI safety involve modeling humans in some way, for example in order to correctly interpret their feedback. However, there are significant disadvantages to human modeling. First and most importantly, if we have AI systems do useful things without modeling humans, then we can use human approval as a "test set": we can check whether the AI's behavior is something we approve of, and this is an independent evaluation of the AI system. However, if the AI system had a human model, then it may have optimized its behavior for human approval, and so we cannot use approval as a "test set". Second, if our AI system has a catastrophic bug, it seems better if it doesn't have any human models. An AI system without human models will at worst optimize for some unrelated goal like paperclips, which at worst leads to it treating humans as obstacles and causing extinction. However, an AI system with human models with a catastrophic bug might optimize for human suffering, or having humans respond to email all day, etc. Thirdly, an AI system with human models might be simulating conscious beings that can suffer. Fourthly, since humans are agent-like, an AI system that models humans is likely to produce a subsystem that is agent-like and so dangerous.

The authors then discuss why it might be hard to avoid human models. Most notably, it is hard to see how to use a powerful AI system that avoids human models to produce a better future. In particular, human models could be particularly useful for interpreting specifications (in order to do what humans mean, as opposed to what we literally say) and for achieving performance given a specification (e.g. if we want to replicate aspects of human cognition). Another issue is that it is hard to avoid human modeling, since even "independent" tasks have some amount of information about human motivations in selecting that task.

Nevertheless, the authors would like to see more work on engineering-focused approaches to AI safety without human models, especially since this area is neglected, with very little such work currently. While MIRI does work on AI safety without human models, this is from a very theoretical perspective. In addition to technical work, we could also promote certain types of AI research that is less likely to develop human models "by default" (e.g. training AI systems in procedurally generated simulations, rather than on human-generated text and images).

Rohin's opinion: While I don't disagree with the reasoning, I disagree with the main thrust of this post. I wrote a long comment about it; the TL;DR is that since humans want very specific behavior out of AI systems, the AI system needs to get a lot of information from humans about what it should do, and if it understands all that information then it necessarily has a (maybe implicit) human model. In other words, if you require your AI system not to have human models, it will not be very useful, and people will use other techniques.

Technical AI alignment

Iterated amplification

AI Alignment Podcast: AI Alignment through Debate (Lucas Perry and Geoffrey Irving) (summarized by Richard): We want AI safety solutions to scale to very intelligent agents; debate is one scalability technique. It's formulated as a two player zero-sum perfect information game in which agents make arguments in natural language, to be evaluated by a human judge. Whether or not such debates are truth-conducive is an empirical question which we can try to evaluate experimentally; doing so will require both technical and social science expertise (as discussed in a previous post (AN #47)).

Richard's opinion: I think one of the key questions underlying Debate is how efficiently natural language can summarise reasoning about properties of the world. This question is subject to some disagreement (at one extreme, Facebook's roadmap towards machine intelligence describes a training environment which is "entirely linguistically defined") and probably deserves more public discussion in the context of safety.

Rohin's note: If you've read the previous posts on debate, the novel parts of this podcast are on the relation between iterated amplification and debate (which has been discussed before, but not in as much depth), and the reasons for optimism and pessimism about debate.

Agent foundations

Pavlov Generalizes (Abram Demski): In the iterated prisoner's dilemma, the Pavlov strategy is to start by cooperating, and then switch the action you take whenever the opponent defects. This can be generalized to arbitrary games. Roughly, an agent is "discontent" by default and chooses actions randomly. It can become "content" if it gets a high payoff, in which case it continues to choose whatever action it previously chose as long as the payoffs remain consistently high. This generalization achieves Pareto optimality in the limit, though with a very bad convergence rate. Basically, all of the agents start out discontent and do a lot of exploration, and as long as any one agent is discontent the payoffs will be inconsistent and all agents will tend to be discontent. Only when by chance all of the agents take actions that lead to all of them getting high payoffs do they all become content, at which point they keep choosing the same action and stay in the equilibrium.

Despite the bad convergence, the cool thing about the Pavlov generalization is that it only requires agents to notice when the results are good or bad for them. In contrast, typical strategies that aim to mimic Tit-for-Tat require the agent to reason about the beliefs and utility functions of other agents, which can be quite difficult to do. By just focusing on whether things are going well for themselves, Pavlov agents can get a lot of properties in environments with other agents that Tit-for-Tat strategies don't obviously get, such as exploiting agents that always cooperate. However, when thinking about logical time (AN #25), it would seem that a Pavlov-esque strategy would have to make decisions based on a prediction about its own behavior, which is... not obviously doomed, but seems odd. Regardless, given the lack of work on Pavlov strategies, it's worth trying to generalize them further.

Approval-directed agency and the decision theory of Newcomb-like problems (Caspar Oesterheld)

Learning human intent

Thoughts on Human Models (Ramana Kumar and Scott Garrabrant): Summarized in the highlights!


Algorithms for Verifying Deep Neural Networks (Changliu Liu et al): This is a survey paper about verification of properties of deep neural nets.


Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (Pushmeet Kohli et al): This post highlights three areas of current research towards making robust AI systems. First, we need better evaluation metrics: rather than just evaluating RL systems on the environments they were trained on, we need to actively search for situations in which they fail. Second, given a specification or constraint that we would like to ensure, we can develop new training techniques that can ensure that the specifications hold. Finally, given a specification, we can use formal verification techniques to ensure that the model obeys the specification on all possible inputs. The authors also list four areas of future research that they are excited about: leveraging AI capabilities for evaluation and verification, developing publicly available tools for evaluation and verification, broadening the scope of adversarial examples beyond the L-infinity norm ball, and learning specifications.

Rohin's opinion: The biggest challenge I see with this area of research, at least in its application to powerful and general AI systems, is how you get the specification in the first place, so I'm glad to see "learning specifications" as one of the areas of interest.

If I take the view from this post, it seems to me that techniques like domain randomization, and more generally training on a larger distribution of data, would count as an example of the second type of research: it is a change to the training procedure that allows us to meet the specification "the agent should achieve high reward in a broad variety of environments". Of course, this doesn't give us any provable guarantees, so I'm not sure if the authors of the post would include it in this category.


Historical economic growth trends (Katja Grace) (summarized by Richard): Data on historical economic growth "suggest that (proportional) rates of economic and population growth increase roughly linearly with the size of the world economy and population", at least from around 0 CE to 1950. However, this trend has not held since 1950 - in fact, growth rates have fallen since then.

Miscellaneous (Alignment)

Coherent behaviour in the real world is an incoherent concept (Richard Ngo): In a previous post (AN #35), I argued that coherence arguments (such as those based on VNM rationality) do not constrain the behavior of an intelligent agent. In this post, Richard delves further into the argument, and considers other ways that we could draw implications from coherence arguments.

I modeled the agent as having preferences over full trajectories, and objected that if you only look at observed behavior (rather than hypothetical behavior), you can always construct a utility function such that the observed behavior optimizes that utility function. Richard agrees that this objection is strong, but looks at another case: when the agent has preferences over states at a single point in time. This case leads to other objections. First, many reasonable preferences cannot be modeled via a reward function over states, such as the preference to sing a great song perfectly. Second, in the real world you are never in the same state more than once, since at the very least your memories will change, and so you can never infer a coherence violation by looking at observed behavior.

He also identifies further problems with applying coherence arguments to realistic agents. First, all behavior is optimal for the constant zero reward function. Second, any real agent will not have full information about the world, and will have to have beliefs over the world. Any definition of coherence will have to allow for multiple beliefs -- but if you allow all beliefs, then you can rationalize any behavior as based on some weird belief that the agent has. If you require the agent to be Bayesian, you can still rationalize any behavior by choosing a prior appropriately.

Rohin's opinion: I reject modeling agents as having preferences over states primarily for the first reason that Richard identified: there are many "reasonable" preferences that cannot be modeled with a reward function solely on states. However, I don't find the argument about beliefs as a free variable very convincing: I think it's reasonable to argue that a superintelligent AI system will on average have much better beliefs than us, and so anything that we could determine as a coherence violation with high confidence should be something the AI system can also determine as a coherence violation with high confidence.

Three ways that "Sufficiently optimized agents appear coherent" can be false (Wei Dai): This post talks about three ways that agents could not appear coherent, where here "coherent" means "optimizing for a reasonable goal". First, if due to distributional shift the agent is put into situations it has never encountered before, it may not act coherently. Second, we may want to "force" the agent to pretend as though compute is very expensive, even if this is not the case, in order to keep them bounded. Finally, we may explicitly try to keep the agent incoherent -- for example, population ethics has impossibility results that show that any coherent agent must bite some bullet that we don't want to bite, and so we may instead elect to keep the agent incoherent instead. (See Impossibility and Uncertainty Theorems in AI Value Alignment (AN #45).)

The Unavoidable Problem of Self-Improvement in AI and The Problem of Self-Referential Reasoning in Self-Improving AI (Jolene Creighton and Ramana Kumar): These articles introduce the thinking around AI self-improvement, and the problem of how to ensure that future, more intelligent versions of an AI system are just as safe as the original system. This cannot be easily done in the case of proof-based systems, due to Godel's incompleteness theorem. Some existing work on the problem: Botworld, Vingean reflection, and Logical induction.

Other progress in AI

Deep learning

The Lottery Ticket Hypothesis at Scale (Jonathan Frankle et al) (summarized by Richard): The lottery ticket hypothesis is the claim that "dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations". This paper builds on previous work to show that winning tickets can also be found for larger networks (Resnet-50, not just Resnet-18), if those winning tickets are initialised not with their initial weights from the full network, but rather with their weights after a small amount of full-network training.

Richard's opinion: It's interesting that the lottery ticket hypothesis scales; however, this paper seems quite incremental overall.


OpenAI LP (OpenAI) (summarized by Richard): OpenAI is transitioning to a new structure, consisting of a capped-profit company (OpenAI LP) controlled by the original OpenAI nonprofit organisation. The nonprofit is still dedicated to its charter, which OpenAI LP has a legal duty to prioritise. All investors must agree that generating profits for them is a secondary goal, and that their overall returns will be capped at 100x their investment (with any excess going back to the nonprofit).

Richard's opinion: Given the high cost of salaries and compute for machine learning research, I don't find this a particularly surprising development. I'd also note that, in the context of investing in a startup, a 100x return over a timeframe of decades is not actually that high.