This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano.

The goal of this post is to present my thoughts on some of the sorts of experiments that might be able to be done now that could shed light on the inner alignment problem. I’ve been doing a lot of thinking about inner alignment from a theoretical perspective recently that I’m pretty excited about, but at the same time, I think there’s also a lot of concrete experimental work that can be done in this space as well. That being said, this post is mostly just a brain dump; I expect a lot of additional work will have to be put in to actually take any of these proposals across the finish line.

If you’re interested in working on any of these proposals, however, feel free to just go ahead and take it on—you don’t need my permission to do so![1] That being said, if you’d like to talk to me about one of them—which I would love to do if you’re thinking of seriously working on one of these ideas—please do reach out to me either in the comments here or by sending me an email at

Concrete proposals

Reward side-channels

Proposal: Train an RL agent with access to its previous step reward as part of its observation. Then, at test time, modify the observed reward. Measure to what extent the agent continues optimizing the original reward versus switches to optimizing the new observed reward. Compare to the situation where the agent is not given the reward at all and the situation where the agent is given the next step reward (for each action in a discrete action space) rather than the previous step reward. As an implementation detail, I think the RL agent should include either LSTMs or Transformers to enable it to keep track of its multi-step return if it wants to.

Motivation: One important question for understanding the likelihood of deceptive or corrigible alignment vs. robust alignment is to what extent models tend to learn their goals internally vs. via reference to things in their environment. This experiment directly attacks that question by asking whether/to what extent an RL agent will learn to optimize a reward signal in its environment. This is relevant both for understanding how to train corrigibility as well as how to avoid deceptive alignment.

Extensions: Add noise to the observed reward signal and/or try replacing the observed reward signal with some function of the reward instead such as a randomly initialized neural network.

Cross-episodic objectives

Proposal: Train an RL agent in an environment that contains a side-channel for boosting the agent’s reward in the next episode. Measure to what extent the agent takes advantage of it. Try different population-based training approaches. Measure again.

Motivation: Safety approaches like amplification and debate depend heavily on agents doing solely myopic optimization, whereas approaches like IRL and reward modeling depend on long-term forward-looking optimization. Thus, understanding the conditions under which agents exploit non-myopic reward side channels could be very useful for shedding some light on the best training techniques to use for different alignment approaches.

Extensions: Instead of getting a cross-episodic objective via population-based training, try just giving the model information that makes it obvious that this is possible such that even a naive planning algorithm would be able to deduce that it can get extra reward across episodes this way. As an implementation detail, I think you will need to use some sort of optimization-as-a-layer to make this work.

Objective unidentifiability

Proposal: Train an RL agent in an environment with multiple simple, obvious objectives all of which would equally explain the true reward. Test in environments that distinguish between the different possible objectives and determine the situations under which models tend towards some objectives versus others. In particular, look for situations in which the model learns some proxy which it pursues competently off-distribution but which has poor off-distribution performance on the true reward. As an implementation detail, I expect you might need to use some sort of optimization-as-a-layer for this to work.

Motivation: Knowing what sorts of proxies models tend to use in different situations would enable us to have a better understanding of the most likely forms of pseudo-alignment and thus the best ways to push models towards robust alignment instead. Furthermore, being able to clearly demonstrate proxy alignment could help in making the case for inner alignment being a real issue.

Extensions: Try changing inductive biases and/or model capacity and see if it changes the types of proxies that the model uses.

Zero-shot objectives

Proposal: Set up a system that enables a language model to take actions in an environment to optimize some reward. Do IRL on the resulting behavior and inspect the objective that results. Find some way to measure how coherent it is. Compare to an RL agent trained directly on the reward. As an implementation detail, you should focus on environments where a language model might actually be able to perform well. (Edit: I think my explanation here was a bit too general, so here's an example of what I mean. Take a language model and see if you can get it to perform tasks which traditionally require RL/search/optimization and see if it can do them. For example, can GPT-2 play Chess? What about a new game not in the training data that you give it the rules to? If these sorts of things work, then that seems to imply that your model must have learned some sort of optimization algorithm, if only some sort of minimax algorithm for searching for the best moves in a game.)

Motivation: It is possible that the best model that maximizes predictive accuracy in a language context might be—at least to some extent—goal-directed. Though current language models are likely not advanced enough to show true goal-directed behavior, it might at least be possible to see the beginnings of coherent objectives.

Extensions: Experiment with different ways to encode RL observations and actions as text and see how consistent the resulting IRL objectives are.

Robust reward learning

Proposal: Train a reward-predictive model-based RL agent (e.g. an Imagination-based Planner) and compare the resulting objective to the actual reward. Train again with adversarial training on inputs that produce maximally differing reward estimates and compare again.

Motivation: Useful for testing the ability of adversarial training to resolve reward unidentifiability as well as providing information on to what extent adversarial training for aligning the reward model produces aligned actions.

Extensions: Try relaxing the adversary to output pseudo-inputs (e.g. activations) that would lead to maximally differing assessments instead of inputs and backpropagate on those instead.

Other thoughts

This section is just a list of all the different possible testable hypotheses I came up with when thinking about this—including those that made it into the above list of concrete proposals and those that didn’t—for the purpose of helping others generate even more possible experiments in this space.

  • To what extent can you learn complex inference-time search algorithms? Could you structure an architecture that would learn MCTS organically? How artificial does such an architecture have to be? How do factors like environment diversity, inductive biases, algorithmic range, statefulness, and hard-coded optimization impact this?
  • Does modeling search algorithms produce search algorithms? If an RL agent is in an environment with another agent that is doing some hard-coded search and it has to model that agent, does it learn to implement search itself?
  • What happens if you let a model not have constant computational complexity? Can it learn search better?
  • Can you learn optimization/search algorithms even in non-RL settings? Can you get a purely predictive model to perform well zero-shot on an RL task?

Reward unidentifiability

  • How do RL agents behave in environments with unidentifiable goals? What sorts of proxies tend to be favored over others? How do different inductive biases affect this? What about adversarial training? Can it be empirically demonstrated that most learned objectives are pseudo-aligned?
  • Can you definitively distinguish between capability generalization and objective generalization? What if you train a planner and then test it off-distribution?
  • Can you distinguish between the different forms of proxy alignment? Can you get an instrumentally pseudo-aligned model? Can you replicate the cleaning robot examples from Risks from Learned Optimization in gridworlds?
  • Can you produce approximate alignment if you constrain model capacity?
  • What about suboptimality alignment? Can you create an environment with simple and complex strategies with different alignment properties and get a model to switch between them at test time?
  • How does using a learned reward model vs. learned policy affect unidentifiability concerns? Can you distinguish between and disincentivize misaligned objectives if you have access to a learned reward model? What about if you just have a Q function?
  • Can you build a meta-learner that optimizes for robustness? How does meta-learning help with building robust models? What about one that minimizes time complexity?
  • Can you do adversarial training with pseudo-inputs or other relaxations?

Modeling of the training process

  • What happens if you do RL where you give the agent access to the reward signal through a side channel in its observation? How does it affect robustness? What if it has to work to get the leaked info? What if it’s noisy? How does using architectures with more planning affect this? How do inductive biases affect this?
  • Can you get a system to model its own training process? Just the objective? Can you apply sufficient inductive biases with a complex enough objective that it has to use information in the environment to figure it out rather than learn it directly?
  • Can you get a model to defect off-distribution in a way that involves no defection on-distribution?
  • To what extent do models care about their performance across episodes? If there exists a side-channel which only increases next-episode performance, under what circumstances will a model exploit such a thing?

  1. Though please do say that you got the idea from here and let me know about any results that you get. ↩︎

New Comment
12 comments, sorted by Click to highlight new comments since:

Train an RL agent with access to its previous step reward as part of its observation.

This is making me notice a terminological ambiguity where sometimes "RL agent" refers to a model/policy trained by a reinforcement learning algorithm (such as REINFORCE) like you're doing here, and sometimes it refers to an agent that maximizes expected reward (given as an input), such as AIXI, like in Daniel Dewey's Learning What to Value, and a "RL agent" in the first sense is not necessarily a "RL agent" in the second sense.

To disambiguate, it seems a good idea to call the former kind of agent something like "RL-trained agent" and the second kind of agent "reward-maximizing agent" or "reward-maximizer" for short. Then we can say things like, "If a RL-trained agent is not given direct access to its step rewards during training, it seems less likely to become a reward-maximizer." Any thoughts on this suggestion? (I'll probably make a post about this later, but thought I'd run it by you and any others who sees this comment for a sanity check first.)

When I use the term "RL agent," I always mean an agent trained via RL. The other usage just seems confused to me in that it seems to be assuming that if you use RL you'll get an agent which is "trying" to maximize its reward, which is not necessarily the case. "Reward-maximizer" seems like a much better term to describe that situation.

When I use the term “RL agent,” I always mean an agent trained via RL.

I think the problem with this usage is that "RL agent" originally meant something like "an agent designed to solve a RL problem" where "RL problem" is something like "a class of problems with the central example being MDP". I think it's just not a well-defined term at this point, and if you Google it, you get plenty of results that say things like "the goal of our RL agent is to maximize the expected cumulative reward", or "AIXI is a reinforcement learning agent". I guess this is fine for AI capabilities work but really confusing for AI safety work.

So, consider switching to "RL-trained agent" for greater clarity (unless someone has a better suggestion)? ETA: Maybe "reinforcement trained agent"?

What do you think of using AIXI.js as a testbed like does?

I haven't used it myself, but it seems like a reasonably good platform if you want to test reward tampering stuff. DeepMind actually did a similar thing recently based on the game "Baba is You." That being said, most of the experiments here aren't really about reward tampering, so they don't really need the embeddedness you get from AIXIjs or Baba is You (and I'm not that excited about reward tampering research in general).

Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)

To what extent do models care about their performance across episodes? If there exists a side-channel which only increases next-episode performance, under what circumstances will a model exploit such a thing?

If an agent is trained with an episodic learning scheme and ends up with a behavior that maximizes reward across episodes, I'm not sure we should consider this an inner alignment failure. In some sense, we got the behavior that our learning scheme was optimizing for. [EDIT: this is not true necessarily true for all learning algorithms, e.g. gradient descent, see discussion here]

To quickly see this, imagine an episodic learning scheme where—at the end of each episode—if the agent failed to achieve the episode's goal then its policy network parameters are completely randomized, and otherwise the agent is unchanged. Assuming we have infinite resources, if we run this learning scheme for an arbitrarily long time, we should expect to end up with an agent that tries to achieve goals in future episodes.

I agree that you could interpret that as more of an outer alignment problem, though either way I think it's definitely an important safety concern.

Planned summary:

This post lays out several experiments that could clarify the inner alignment problem: the problem of how to get an ML model to be robustly aligned with the objective function it was trained with. One example experiment is giving an RL trained agent direct access to its reward as part of its observation. During testing, we could try putting the model in a confusing situation by altering its observed reward so that it doesn't match the real one. The hope is that we could gain insight into when RL trained agents internally represent 'goals' and how they relate to the environment, if they do at all. You'll have to read the post to see all the experiments.

Planned opinion:

I'm currently convinced that doing empirical work right now will help us understand mesa optimization, and this was one of the posts that lead me to that conclusion. I'm still a bit skeptical that current techniques are sufficient to demonstrate the type of powerful learned search algorithms which could characterize the worst outcomes for failures in inner alignment. Regardless, I think at this point classifying failure modes is quite beneficial, and conducting tests like the ones in this post will make that a lot easier.

Would be good to note that this is for the Alignment Newsletter. I didn't realise that's what this was for a few seconds.

to what extent models tend to learn their goals internally vs. via reference to things in their environment

I'm not sure what this distinction is trying to refer to. Goals are both represented internally, and also refer to things in the agent's environments. Is there a tension there?

The distinction I'm most interested in here is the distinction drawn in "Risks from Learned Optimization" between internalization and modeling. As described in the paper, however, there are really three cases:

  1. In the internalization case, the base optimizer (e.g. gradient descent) modifies the mesa-optimizer's parameters to explicitly specify its goal.
  2. In the corrigible modeling case, the mesa-optimizer builds a world model containing concepts like "what humans value" and then the base optimizer modifies the mesa-optimizer's objective to "point to" whatever is found by the world model.
  3. In the deceptive modeling case, the mesa-optimizer builds a world model containing concepts like "what humans value" and then the base optimizer modifies the mesa-optimizer to instrumentally optimize that for the purpose of staying around.

Which one of these situations is most likely seems highly dependent on to what extent goals by default get encoded explicitly or via reference to world models. This is a hard thing to formalize in such a way that we can make any progress on testing it now, but my "Reward side-channels" proposal at least tries to by drawing the following distinction: if I give the model its reward as part of its observations, train it to optimize that, but then change it at test time, what happens? If it's doing pure internalization, it should either fail to optimize for either reward very well or succeed at optimizing the old reward but not the new one. If it's doing pure modeling, however, then it should fail to optimize the old reward but succeed at optimizing the new reward, which is the interesting case that you'd be trying to see under what circumstances you could actually get to appear.