Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.


Empirical Observations of Objective Robustness Failures (Jack Koch, Lauro Langosco et al) (summarized by Rohin): This paper presents empirical demonstrations of failures of objective robustness. We've seen objective robustness (AN #66) / inner alignment (AN #111) / mesa optimization (AN #58) before; if you aren't familiar with it, I recommend reading one of those articles (or their summaries) before continuing. This paper studies these failures in the context of deep reinforcement learning and shows these failures in three cases:

1. In CoinRun (AN #79), if you train an agent normally (where the rewarding coin is always at the rightmost end of the level), the agent learns to move to the right. If you randomize the coin location at test time, the agent will ignore it and instead run to the rightmost end of the level and jump. It still competently avoids obstacles and enemies: its capabilities are robust, but its objective is not. Using the interpretability tools from Understanding RL Vision (AN #128), we find that the policy and value function pay much more attention to the right wall than to the coin.

2. Consider an agent trained to navigate to a cheese that is always placed in the upper right corner of a maze. When the location of the cheese is randomized at test time, the agent continues to go to the upper right corner. Alternatively, if the agent is trained to go to a yellow gem during training time, and at test time it is presented with a yellow star or a red gem, it will navigate towards the yellow star.

3. In the keys and chest environment (AN #67), an agent trained in a setting where keys are rare will later collect too many keys once keys become commonplace.

Read more: Paper: Objective Robustness in Deep Reinforcement Learning

Rohin's opinion: I'm glad that these experiments have finally been run and we have actual empirical examples of the phenomenon -- I especially like the CoinRun example, since it is particularly clear that in this case the capabilities are robust but the objective is not.



Environmental Structure Can Cause Instrumental Convergence (Alex Turner) (summarized by Rohin): We have previously seen (AN #78) that if you are given an optimal policy for some reward function, but are very uncertain about that reward function (specifically, your belief assigns reward to states in an iid manner), you should expect that the optimal policy will navigate towards states with higher power in some but not all situations. This post generalizes this to non-iid reward distributions: specifically, that "at least half" of reward distributions will seek power (in particular circumstances).

The new results depend on the notion of environment symmetries, arising in states in which an action a2 leads to “more options” than another action a1 (we'll assume that a1 and a2 lead to different, disjoint parts of the state space). Specifically, a1 leads to a part of the state space that is isomorphic to a subgraph of the part of the state space that a2 leads to. For example, a1 might be going to a store where you can buy books or video games, and a2 might be going to a supermarket where you can buy food, plants, cleaning supplies, tools, etc. Then, one subgraph isomorphism would be the one that maps “local store” to “supermarket”, “books” to “food”, and “video games” to “plants”. Another such isomorphism would instead map “video games” to “tools”, while keeping the rest the same.

Now this alone doesn’t mean that an optimal policy is definitely going to take a2. Maybe you really want to buy books, so a1 is the optimal choice! But for every reward function for which a1 is optimal, we can construct another reward function for which a2 is optimal, by mapping it through the isomorphism. So, if your first reward function highly valued books, this would now construct a new reward function that highly values food, and now a2 will be optimal. Thus, at least half of the possible reward functions (or distributions over reward functions) will prefer a2 over a1. Thus, in cases where these isomorphisms exist, optimal policies will tend to seek more options (which in turn means they are seeking power).

If the agent optimizes average reward (i.e. gamma is 1), then we can extend this analysis out in time, to the final cycle that an agent ends up in. (It must end up in a cycle because by assumption the state space is finite.) Any given cycle would only count as one “option”, so ending up in any given cycle is not very likely (using a similar argument of constructing other rewards). If shutdown is modeled as a state with a single self-loop and no other actions, then this implies that optimal policies will tend to avoid entering the shutdown state.

We’ve been saying “we can construct this other reward function under which the power-seeking action is optimal”. An important caveat is that maybe we know that this other reward function is very unlikely. For example, maybe we really do just know that we’re going to like books and not care much about food, and so the argument “well, we can map the book-loving reward to a food-loving reward” isn’t that interesting, because we assign high probability to the first and low probability to the second. We can’t rule this out for what humans actually do in practice, but it isn’t as simple as “a simplicity prior would do the right thing” -- for any non-power-seeking reward function, we can create a power-seeking reward function with only slightly higher complexity by having a program that searches for a subgraph isomorphism and then applies it to the non-power-seeking reward function to create a power-seeking version.

Another major caveat is that this all relies on the existence of these isomorphisms / symmetries in the environment. It is still a matter of debate whether good models of the environment will exhibit such isomorphisms.


Discussion: Objective Robustness and Inner Alignment Terminology (Jack Koch and Lauro Langosco) (summarized by Rohin): Mesa optimization and inner alignment have become pretty important topics in AI alignment since the 2019 paper (AN #58) on it was published. However, there are two quite different interpretations of inner alignment concerns:

1. Objective-focused: This approach considers structural properties of the computation executed by the learned model. In particular, the risk argument is that sufficiently capable learned models will be executing some form of optimization algorithm (such as a search algorithm), guided by an explicit objective called the mesa-objective, and this mesa-objective may not be identical to the base objective (though it should incentivize similar behavior on the training distribution), which can then lead to bad behavior out of distribution.

The natural decomposition is then to separate alignment into two problems: first, how do we specify an outer (base) objective that incentivizes good behavior in all situations that the model will ever encounter; and second, how do we ensure that the mesa objective equals the base objective.

2. Generalization-focused: This approach instead talks about the behavior of the model out of distribution. The risk argument is that sufficiently capable learned models, when running out of distribution, will take actions that are still competent and high impact, but that are not targeted towards accomplishing what we want: in other words, their capabilities generalize, but their objectives do not.

Alignment can then be decomposed into two problems: first, how do we get the behavior that we want on the training distribution, and second, how do we ensure the model never behaves catastrophically on any input.

Rohin's opinion: I strongly prefer the second framing, though I’ll note that this is not independent evidence -- the description of the second framing in the post comes from some of my presentations and comments and conversations with the authors. The post describes some of the reasons for this; I recommend reading through it if you’re interested in inner alignment.


Frequent arguments about alignment (John Schulman) (summarized by Rohin): This post outlines three AI alignment skeptic positions and corresponding responses from an advocate. Note that while the author tends to agree with the advocate’s view, they also believe that the skeptic makes good points.

1. Skeptic's position:The alignment problem gets easier as models get smarter, since they start to learn the difference between, say, human smiles and human well-being. So all we need to do is to prompt them appropriately, e.g. by setting up a conversation with “a wise and benevolent AI advisor”.

Advocate's response: We can do a lot better than prompting: in fact, a recent paper (AN #152) showed that prompting is effectively (poor) finetuning, so we might as well finetune. Separately from prompting itself, alignment does get easier in some ways as models get smarter, but it also gets harder: for example, smarter models will game their reward functions in more unexpected and clever ways.

2. What’s the difference between alignment and capabilities anyway? Something like RL from human feedback for summarization (AN #116) could equally well have been motivated through a focus on AI products.

Response: While there’s certainly overlap, alignment research is usually not the lowest-hanging fruit for building products. So it’s useful to have alignment-focused teams that can champion the work even when it doesn’t provide the best near-term ROI.

3. We can’t make useful progress on aligning superhuman models until we actually have superhuman models to study. Why not wait until those are available?

Response: If we don’t start now, then in the short term, companies will deploy products that optimize simple objectives like revenue and engagement, which could be improved by alignment work. In the long term, it is plausible that alignment is very hard, such that we need many conceptual advances that we need to start on now to have them ready by the point that we feel obligated to use powerful AI systems. In addition, empirically there seem to be many alignment approaches that aren’t bottlenecked by the capabilities of models -- see for example this post (AN #141).

Rohin's opinion: I generally agree with these positions and responses, and in particular I’m especially happy about the arguments being specific to the actual models we use today, which grounds out the discussion a lot more and makes it easier to make progress. On the second point in particular, I’d also say that empirically, product-focused people don’t do e.g. RL for human feedback, even if it could be motivated that way.



Decision Transformer: Reinforcement Learning via Sequence Modeling (Lili Chen et al) (summarized by Zach): In this paper, the authors abstract reinforcement learning (RL) as a sequence modeling problem. The authors are inspired by the rise of powerful sequence models (i.e transformers) in natural language processing. Specifically, they hypothesize that when models are trained to predict the expected reward-to-go alongside state and action sequences, the transformer architecture can be used to do RL.

As an example, consider finding the shortest path between two vertices on a graph. We could start by recording random walks with their expected returns. Once we have enough data, we could condition on paths such that the expected return-to-go (length remaining) is low. This would effectively return shortest paths without the explicit need for optimization.

This framework works well in practice and is competitive with state-of-the-art model-free offline RL baselines on Atari and OpenAI gym. The authors also carry out ablation studies to determine if the sequence modeler is just doing imitation learning on a subset of the data with high returns. This turns out not to be the case, indicating that the approach effectively uses the entire dataset.

Zach's opinion: It's worth highlighting that this can be seen as an extension to Upside-Down RL (AN #83). In that report, the goal is also to produce actions consistent with the desired reward. This paper extends that line of work by using transformers to expand context beyond the immediate state which aids in long-term credit assignment. The authors claim this helps via self-attention, but it seems much more likely that this effect comes from using the return-to-go as in Upside-Down RL.

Reinforcement Learning as One Big Sequence Modeling Problem (Michael Janner et al) (summarized by Zach): Typically, RL is concerned with estimating policies that utilize immediate state information to produce high returns. However, we can also view RL as concerned with predicting sequences of actions that lead to high returns. From this perspective, it's natural to wonder if sequence modelers that work well in other domains, such as transformers in NLP, would work well for RL. This paper tests this hypothesis and demonstrates the utility of transformers in RL for a variety of problem settings.

As with the last paper, the authors train the model to predict the reward-to-go. In place of trajectory optimizers, the authors make use of beam search as a planning algorithm. To do RL, rather than maximize the log-probability of potential sequences, the authors replace the log-probability search heuristic with the reward-to-go. In experiments, transformers that maintain the log-probability can imitate expert policies to high fidelity. Visually, the resulting policies are indistinguishable from that of the expert.

The authors also show that their method is competitive on the standard OpenAI gym benchmarks. Finally, the authors look at the attention patterns of the trained models. They identify two patterns: the first links variables in a strictly Markovian fashion and the other links dimensions of the state-action variables across time. Interestingly, action variables are more strongly coupled to past actions than past state variables. This suggests a connection to action-smoothing proposed previously for deep-dynamics models.

Zach's opinion: It's fun to note that this paper came out of the same lab as Decision Transformer (the paper summrized above) only a day after. In contrast to Decision Transformer, this paper focuses more on the finer technical details of transformers by utilizing beam-search and studying the attentional patterns of the resulting models. The figures in this paper are informative. I feel I did learn something about why transformers seem to work in this domain. While the idea of goal-conditioned RL itself isn't novel, showing that 'off-the-shelf' transformers can do well in this domain is impressive.


You can now apply to EA Funds anytime! (LTFF & EAIF only) (Jonas Vollmer) (summarized by Rohin): The Long-Term Future Fund (LTFF) has funding available for people working on AI alignment. I’m told that the LTFF is constrained by high-quality applications, and that applying only takes a few hours, so it is probably best to err on the side of applying. The LTFF has removed its previous round-based system and now accepts applications anytime.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment