Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


Better priors as a safety problem and Learning the prior (Paul Christiano) (summarized by Rohin): Any machine learning algorithm (including neural nets) has some inductive bias, which can be thought of as its “prior” over what the data it will receive will look like. In the case of neural nets (and any other general ML algorithm to date), this prior is significantly worse than human priors, since it does not encode e.g. causal reasoning or logic. Even if we avoid priors that depended on us previously seeing data, we would still want to update on facts like “I think therefore I am”. With a better prior, our ML models would be able to learn more sample efficiently. While this is so far a capabilities problem, there are two main ways in which it affects alignment.

First, as argued in Inaccessible information (AN #104), the regular neural net prior will learn models which can predict accessible information. However, our goals depend on inaccessible information, and so we would have to do some “extra work” in order to extract the inaccessible information from the learned models in order to build agents that do what we want. This leads to a competitiveness hit, relative to agents whose goals depend only on accessible information, and so during training we might expect to consistently get agents whose goals depend on accessible information instead of the goals we actually want.

Second, since the regular neural net prior is so weak, there is an incentive to learn a better prior, and then have that better prior perform the task. This is effectively an incentive for the neural net to learn a mesa optimizer (AN #58), which need not be aligned with us, and so would generalize differently than we would, potentially catastrophically.

Let’s formalize this a bit more. We have some evidence about the world, given by a dataset D = {(x1, y1), (x2, y2), ...} (we assume that it’s a prediction task -- note that most self-supervised tasks can be written in this form). We will later need to make predictions on the dataset D' = {x1', x2', …}, which may be from a “different distribution” than D (e.g. D might be about the past, while D' is about the future). We would like to use D to learn some object Z that serves as a “prior”, such that we can then use Z to make good predictions on D'.

The standard approach which we might call the “neural net prior” is to train a model to predict y from x using the dataset D, and then apply that model directly to D', hoping that it transfers correctly. We can inject some human knowledge by finetuning the model using human predictions on D', that is by training the model on {(x1', H(x1')), (x2', H(x2')), …}. However, this does not allow H to update their prior based on the dataset D. (We assume that H cannot simply read through all of D, since D is massive.)

What we’d really like is some way to get the predictions H would make if they could update on dataset D. For H, we’ll imagine that a prior Z is given by some text describing e.g. rules of logic, how to extrapolate trends, some background facts about the world, empirical estimates of key quantities, etc. I’m now going to talk about priors over the prior Z, so to avoid confusion I’ll now call an individual Z a “background model”.

The key idea here is to structure the reasoning in a particular way: H has a prior over background models Z, and then given Z, H’s predictions for any given x_i are independent of all of the other (x, y) pairs. In other words, once you’ve fixed your background model of the world, your prediction of y_i doesn’t depend on the value of y_j for some other x_j. Or to explain it a third way, this is like having a set of hypotheses {Z}, and then updating on each element of D one by one using Bayes Rule. In that case, the log posterior of a particular background model Z is given by log Prior(Z) + sum_i log P(y_i | x_i, Z) (neglecting a normalization constant).

The nice thing about this is the individual terms Prior(Z) and P(y_i | x_i, Z) are all things that humans can do, since they don’t require the human to look at the entire dataset D. In particular, we can learn Prior(Z) by presenting humans with a background model, and having them evaluate how likely it is that the background model is accurate. Similarly, P(y_i | x_i, Z) simply requires us to have humans predict y_i under the assumption that the background facts in Z are accurate. So, we can learn models for both of these using neural nets. We can then find the best background model Z-best by optimizing the equation above, representing what H would think was the most likely background model after updating on all of D. We can then learn a model for P(yi' | xi', Z-best) by training on human predictions of yi' given access to Z-best.

This of course only gets us to human performance, which requires relatively small Z. If we want to have large background models allowing for superhuman performance, we can use iterated amplification and debate to learn Prior(Z) and P(y | x, Z). There is some subtlety about how to represent Z that I won’t go into here.

Rohin's opinion: It seems to me like solving this problem has two main benefits. First, the model our AI system learns from data (i.e. Z-best) is interpretable, and in particular we should be able to extract the previously inaccessible information that is relevant to our goals (which helps us build AI systems that actually pursue those goals). Second, AI systems built in this way are incentivized to generalize in the same way that humans do: in the scheme above, we learn from one distribution D, and then predict on a new distribution D', but every model learned with a neural net is only used on the same distribution it was trained on.

Of course, while the AI system is incentivized to generalize the way humans do, that does not mean it will generalize as humans do -- it is still possible that the AI system internally “wants” to gain power, and only instrumentally answers questions the way humans would answer them. So inner alignment is still a potential issue. It seems possible to me that whatever techniques we use for dealing with inner alignment will also deal with the problems of unsafe priors as a side effect, in which case we may not end up needing to implement human-like priors. (As the post notes, it may be much more difficult to use this approach than to do the standard “neural net prior” approach described above, so it would be nice to avoid it.)



Alignment proposals and complexity classes (Evan Hubinger) (summarized by Rohin): The original debate (AN #5) paper showed that any problem in PSPACE can be solved by optimal play in a debate game judged by a (problem-specific) algorithm in P. Intuitively, this is an illustration of how the mechanism of debate can take a weak ability (the ability to solve arbitrary problems in P) and amplify it into a stronger ability (the ability to solve arbitrary problems in PSPACE). One would hope that similarly, debate would allow us to amplify a human’s problem-solving ability into a much stronger problem-solving ability.

This post applies this technique to several other alignment proposals. In particular, for each proposal, we assume that the “human” can be an arbitrary polynomial-time algorithm, and the AI models are optimal w.r.t their loss functions, and we ask which problems we can solve using these capabilities. The post finds that, as lower bounds, the various forms of amplification can access PSPACE, while market making (AN #108) can access EXP. If there are untamperable pointers (so that the polynomial-time algorithm can look at objects of an arbitrary size, as long as it only looks at a polynomial-sized subset of them), then amplification and market making can access R (the set of decidable problems).

Rohin's opinion: In practice our models are not going to reach the optimal loss, and humans won’t solve arbitrary polynomial-time problems, so these theorems won’t directly apply to reality. Nonetheless, this does seem like a worthwhile check to do -- it feels similar to ensuring that a deep RL algorithm has a proof of convergence under idealized assumptions, even if those assumptions won’t actually hold in reality. I have much more faith in a deep RL algorithm that started from one with a proof of convergence and then was modified based on empirical considerations.

How should AI debate be judged? (Abram Demski) (summarized by Rohin): Debate (AN #5) requires a human judge to decide which of two AI debaters should win the debate. How should the judge make this decision? The discussion on this page delves into this question in some depth.


What counts as defection? (Alex Turner) (summarized by Rohin): We often talk about cooperating and defecting in general-sum games. This post proposes that we say that a player P has defected against a coalition C (that includes P) currently playing a strategy S when P deviates from the strategy S in a way that increases his or her own personal utility, but decreases the (weighted) average utility of the coalition. It shows that this definition has several nice intuitive properties: it implies that defection cannot exist in common-payoff games, uniformly weighted constant-sum games, or arbitrary games with a Nash equilibrium strategy. A Pareto improvement can also never be defection. It then goes on to show the opportunity for defection can exist in the Prisoner’s dilemma, Stag hunt, and Chicken (whether it exists depends on the specific payoff matrices).


Environments as a bottleneck in AGI development (Richard Ngo) (summarized by Rohin): Models built using deep learning are a function of the learning algorithm, the architecture, and the task / environment / dataset. While a lot of effort is spent on analyzing learning algorithms and architectures, not much is spent on the environment. This post asks how important it is to design a good environment in order to build AGI.

It considers two possibilities: the “easy paths hypothesis" that many environments would incentivize AGI, and the “hard paths hypothesis” that such environments are rare. (Note that “hard paths” can be true even if an AGI would be optimal for most environments: if AGI would be optimal, but there is no path in the loss landscape to AGI that is steeper than other paths in the loss landscape, then we probably wouldn’t find AGI in that environment.)

The main argument for “hard paths” is to look at the history of AI research, where we often trained agents on tasks that were “hallmarks of intelligence” (like chess) and then found that the resulting systems were narrowly good at the particular task, but were not generally intelligent. You might think that it can’t be too hard, since our environment led to the creation of general intelligence (us), but this is subject to anthropic bias: only worlds with general intelligence would ask whether environments incentivize general intelligence, so they will always observe that their environment is an example that incentivizes general intelligence. It can serve as a proof of existence, but not as an indicator that it is particularly likely.

Rohin's opinion: I think this is an important question for AI timelines, and the plausibility of “hard paths” is one of the central reasons that my timelines are longer than others who work on deep learning-based AGI. However, GPT-3 (AN #102) demonstrates quite a lot of generality, so recently I’ve started putting more weight on “actually, designing the environment won’t be too hard”, which has correspondingly shortened my timelines.


Talk: Key Issues In Near-Term AI Safety Research (Aryeh Englander) (summarized by Rohin): This talk points out synergies between long-term AI safety and the existing fields of assured autonomy, safety engineering, and testing, evaluation, verification and validation (TEV&V), primarily by showing how they fit into and expand DeepMind's framework of specification, robustness and assurance (AN #26).



Using Selective Attention in Reinforcement Learning Agents (Yujin Tang et al) (summarized by Sudhanshu): Recently winning a best paper award at GECCO 2020, this work marks a leap forward in the performance capabilities learned by small agents via evolutionary methods. Specifically, it shows that by jointly learning which small fraction of input to attend to, agents with only thousands of free parameters can be trained by an evolutionary strategy to achieve state-of-the-art performance in vision-based control tasks.

The key pieces include self-attention over input patches, non-differentiable top-K patch selection that effect 'inattentional blindness', and training via CMA-ES. By design, the agent is interpretable as the top-K patches that are selected can be examined. Empirically, the agent has 1000x fewer weights than a competing neural architecture, and the method shows robustness to changes in task-irrelevant inputs, as the agent learns to focus only on task-relevant patches.

Read more: Paper: Neuroevolution of Self-Interpretable Agents

Sudhanshu's opinion: The parallelism afforded by evolutionary methods and genetic algorithms might be valuable in an environment where weak compute is plentiful, so it's exciting to see evidence of such methods besting GPU-hungry deep neural networks. However, I wonder how this would do on sparse reward tasks, where the fitness function is almost always uninformative. Finally, while it generalises to settings where there are task-irrelevant distractions, its deliberately sharp self-attention likely leaves it vulnerable to even simple adversarial attacks.

Improving Sample Efficiency in Model-Free Reinforcement Learning from Images (Denis Yarats et al) (summarized by Flo): Sample efficiency in RL can be improved by using off-policy methods that can reuse the same sample multiple times and by using self-supervised auxiliary losses that help with representation learning, especially when rewards are sparse. This work combines both approaches by proposing to learn a latent state representation using an autoencoder while jontly training an agent on that latent representation using SAC (AN #42). Previous work in the on-policy case shows a positive effect from propagating Actor-Critic gradients through the encoder to improve the usefulness of the encoding for policy learning. However, this destabilizes training in the off-policy case, as changing the encoding to facilitate the actor also changes the Q-function estimate, which in turn changes the actor's goal and can introduce nonstationarity. This problem is circumvented by only propagating the Q-network's gradients through the encoder while blocking the actor's gradients.

The method strongly outperforms SAC trained on pixels. It also matches the previous state-of-the-art set by model-based approaches on an image-based continuous control task and outperforms them for noisy observations (as these make dynamics models hard to learn). The authors also find that the learnt encodings generalize between tasks to some extent and that reconstructing the true environment state is easier using their latent representation than using a representation obtained by training SAC on pixels directly.

Flo's opinion: Methods like this that can benefit from seeing a lot of action-independent environment observations might be quite important for applying RL to the real world, as this type of data is a lot cheaper to generate. For example, we can easily generate a ton of observations from a factory by equipping workers with cameras, but state-action-next-state triples from a robot interacting with the factory are very costly to obtain.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment