[AN #149]: The newsletter's editorial policy

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

In the survey I ran about a month ago, a couple of people suggested that I should clarify my editorial policy, especially since it has drifted since the newsletter was created. Note that I don’t view what I’m writing here as a policy that I am committing to. This is more like a description of how I currently make editorial decisions in practice, and it may change in the future.

I generally try to only summarize “high quality” articles. Here, "high quality" means that the article presents some conceptually new thing not previously sent in the newsletter and there is decent evidence convincing me that this new thing is true / useful / worth considering. (Yes, novelty is one of my criteria. I could imagine sending e.g. a replication of some result if I wasn’t that confident of the original result, but I usually wouldn’t.)

Throughout the history of the newsletter, when deciding whether or not to summarize an article, I have also looked for some plausible pathway by which the new knowledge might be useful to an alignment researcher. Initially, there was a pretty small set of subfields that seemed particularly relevant (especially reward learning) and I tried to cover most high-quality work within those areas. (I cover progress in ML because it seems like a good model of ML / AGI development should be very useful for alignment research.)

However, over time as I learned more, I became more excited about a large variety of subfields. There’s basically no hope for me to keep up with all of the subfields, so now I rely a lot more on quick intuitive judgments about how exciting I expect a particular paper to be, and many high quality articles that are relevant to AI alignment never get summarized. I currently still try to cover almost every new high quality paper or post that directly talks about AI alignment (as opposed to just being relevant).

Highlights are different. The main question I ask myself when deciding whether or not to highlight an article is: “Does it seem useful for most technical alignment researchers to read this?” Note that this is very different from an evaluation of how impactful or high quality the article is: a paper that talks about all the tips and tricks you need to get learning from human feedback to work in practice could be very impactful and high quality, but probably still wouldn’t be highlighted because many technical researchers don’t work with systems that learn from human feedback, and so won’t read it. On the other hand, this editorial policy probably isn’t that impactful, but it seems particularly useful for my readers to read (so that you know what you are and aren’t getting with this newsletter).

A summary is where I say things that the authors would agree with. Usually, I strip out things that the authors said that I think are wrong. The exception is when the thing I believe is wrong is a central point of the article, in which case I will put it in the summary even though I don’t believe it. Typically I will then mention the disagreement in the opinion (though this doesn’t always happen, e.g. if I’ve mentioned the disagreement in previous newsletters, or if it would be very involved to explain why I disagree). I often give authors a chance to comment on the summaries + opinions, and usually authors are happy overall but might have some fairly specific nitpicks.

An opinion is where I say things that I believe that the authors may or may not believe.

TECHNICAL AI ALIGNMENT

PROBLEMS

Low-stakes alignment (Paul Christiano) (summarized by Rohin): We often split AI alignment into two parts: outer alignment, or "finding a good reward function", and inner alignment, or "robustly optimizing that reward function". However, these are not very precise terms, and they don't form clean subproblems. In particular, for outer alignment, how good does the reward function have to be? Does it need to incentivize good behavior in all possible situations? How do you handle the no free lunch theorem? Perhaps you only need to handle the inputs in the training set? But then what specifies the behavior of the agent on new inputs?

This post proposes an operationalization of outer alignment that admits a clean subproblem: low stakes alignment. Specifically, we are given as an assumption that we don't care much about any small number of decisions that the AI makes -- only a large number of decisions, in aggregate, can have a large impact on the world. This prevents things like quickly seizing control of resources before we have a chance to react. We do not expect this assumption to be true in practice: the point here is to solve an easy subproblem in the hopes that the solution is useful for solving the hard version of the problem.

The main power of this assumption is that we no longer have to worry about distributional shift. We can simply keep collecting new data online and training the model on the new data. Any decisions it makes in the interim period could be bad, but by the low-stakes assumption, they won't be catastrophic. Thus, the primary challenge is in obtaining a good reward function, that incentivizes the right behavior after the model is trained. We might also worry about whether gradient descent will successfully find a model that optimizes the reward even on the training distribution -- after all, gradient descent has no guarantees for non-convex problems -- but it seems like, to the extent that gradient descent doesn't do this, it will probably affect aligned and unaligned models equally.

Note that this subproblem is still non-trivial, and existential catastrophes still seem possible if we fail to solve it. For example, one way that the low-stakes assumption could be made true was if we had a lot of bureaucracy and safeguards that the AI system had to go through before making any big changes to the world. It still seems possible for the AI system to cause lots of trouble if none of the bureaucracy or safeguards can understand what the AI system is doing.

Rohin's opinion: I like the low-stakes assumption as a way of saying "let's ignore distributional shift for now". Probably the most salient alternative is something along the lines of "assume that the AI system is trying to optimize the true reward function". The main way that low-stakes alignment is cleaner is that it uses an assumption on the environment (an input to the problem) rather than an assumption on the AI system (an output of the problem). This seems to be a lot nicer because it is harder to "unfairly" exploit a not-too-strong assumption on an input rather than on an output. See this comment thread for more discussion.

LEARNING HUMAN INTENT

Transfer Reinforcement Learning across Homotopy Classes (Zhangjie Cao, Minae Kwon et al) (summarized by Rohin): Suppose a robot walks past a person and it chooses to pass them on the right side. Imagine that we want to make the robot instead pass on the left side, and our tool for doing this was to keep nudging the robot's trajectory until it did what we wanted. In this case, we're screwed: there is no way to "nudge" the trajectory from passing on the right to passing on the left, without going through a trajectory that crashes straight into the person.

The core claim of this paper is that the same sort of situation applies to finetuning for RL agents. Suppose we train an agent for one task where there is lots of data, and then we want to finetune it to another task. Let's assume that the new task is in a different homotopy class than the original task, which roughly means that you can't nudge the trajectory from the old task to the new task without going through a very low reward trajectory (in our example, crashing into the person). However, finetuning uses gradient descent, which nudges model parameters; and intuitively, a nudge to model parameters would likely correspond to a nudge to the trajectory as well. Since the new task is in a different homotopy class, this means that gradient descent would have to go through a region in which the trajectory gets very low reward. This is not the sort of thing gradient descent is likely to do, and so we should expect finetuning to fail in this case.

The authors recommend that in such cases, we first train in a simulated version of the task in which the large negative reward is removed, allowing the finetuning to "cross the gap". Once this has been done, we can then reintroduce the large negative reward through a curriculum -- either by gradually increasing the magnitude of the negative reward, or by gradually increasing the number of states that have large negative reward. They run several robotics experiments demonstrating that this approach leads to significantly faster finetuning than other methods.

Rohin's opinion: This seems like an interesting point to be thinking about. The part I'm most interested in is whether it is true that small changes in the neural net parameters must lead to small changes in the resulting trajectory. It seems plausible to me that this is true for small neural nets but ends up becoming less true as neural nets become larger and data becomes more diverse. In our running example, if the neural net was implementing some decision process that considered both left and right as options, and then "chose" to go right, then it seems plausible that a small change to the weights could cause it to choose to go left instead, allowing gradient descent to switch across trajectory homotopy classes with a small nudge to model parameters.

Learning What To Do by Simulating the Past (David Lindner et al) (summarized by Rohin): Since the state of the world has already been optimized for human preferences, it can be used to infer those preferences. For example, it isn’t a coincidence that vases tend to be intact and on tables. An agent with an understanding of physics can observe that humans haven’t yet broken a particular vase, and infer that they care about vases not being broken.

Previous work (AN #45) provides an algorithm, RLSP, that can perform this type of reasoning, but it is limited to small environments with known dynamics and features. In this paper (on which I am an author), we introduce a deep variant of the algorithm, called Deep RLSP, to move past these limitations. While RLSP assumes known features, Deep RLSP learns a feature function using self-supervised learning. While RLSP computes statistics for all possible past trajectories using dynamic programming, deep RLSP learns an inverse dynamics model and inverse policy to simulate the most likely past trajectories, which serve as a good approximation for the necessary statistics.

We evaluate the resulting algorithm on a variety of Mujoco tasks, with promising results. For example, given a single state of a HalfCheetah balancing on one leg, Deep RLSP is able to learn a (noisy) policy that somewhat mimics this balancing behavior. (These results can be seen here.)

Thesis: Extracting and Using Preference Information from the State of the World

MISCELLANEOUS (ALIGNMENT)

Mundane solutions to exotic problems (Paul Christiano) (summarized by Rohin): The author’s goal is to find “mundane” or simple algorithms that solve even “exotic” problems in AI alignment. Why should we expect this is possible? If an AI system is using powerful, exotic capabilities to evade detection, shouldn’t we need powerful, exotic algorithms to fight that? The key idea here is that we can instead have a mundane algorithm that leverages the exotic capabilities of the AI system to produce an exotic oversight process. For example, we could imagine that a mundane algorithm could be used to create a question-answerer that knows everything the model knows. We could then address gradient hacking (AN #71) by asking the question “what should the loss be?” In this case, our model has an exotic capability: very strong introspective access to its own reasoning and the training process that modifies it. (This is what is needed to successfully hack gradients). As a result, our question answerer should be able to leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients, even if our normal hardcoded loss would not do so.

OTHER PROGRESS IN AI

DEEP LEARNING

Scaling Scaling Laws with Board Games (Andrew L. Jones) (summarized by Rohin): While we've seen scaling laws (AN #87) for compute, data, and model size, we haven't yet seen scaling laws for the problem size. This paper studies this case using the board game Hex, in which difficulty can be increased by scaling up the size of the board. The author applies AlphaZero to a variety of different board sizes, model sizes, RL samples, etc and finds that performance tends to be a logistic function of compute / samples used. The function can be characterized as follows:

1. Slope: In the linearly-increasing regime, you will need about 2× as much compute as your opponent to beat them 2/3 of the time.

2) Perfect play: The minimum compute needed for perfect play increases 7× for each increment in board size.

3) Takeoff: The minimum training compute needed to see any improvement over random play increases by 4× for each increment of board size.

These curves fit the data quite well. If the curves are fit to data from small board sizes and then used to predict results for large board sizes, their error is small.

Recall that AlphaZero uses MCTS to amplify the neural net policy. The depth of this MCTS determines how much compute is spent on each decision, both at training time and test time. The author finds that a 10x increase in training-time compute allows you to eliminate about 15x of test-time compute while maintaining similar performance.

NEWS

BERI Seeking New University Collaborators (Sawyer Bernath) (summarized by Rohin): BERI is seeking applications for new collaborators. They offer free services to university groups. If you’re a member of a research group, or an individual researcher, working on long-termist projects, you can apply here. Applications are due June 20th.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.