Concrete experiments in inner alignment

Train an RL agent with access to its previous step reward as part of its observation.

This is making me notice a terminological ambiguity where sometimes "RL agent" refers to a model/policy trained by a reinforcement learning algorithm (such as REINFORCE) like you're doing here, and sometimes it refers to an agent that maximizes expected reward (given as an input), such as AIXI, like in Daniel Dewey's Learning What to Value, and a "RL agent" in the first sense is not necessarily a "RL agent" in the second sense.

To disambiguate, it seems a good idea to call the former kind of agent something like "RL-trained agent" and the second kind of agent "reward-maximizing agent" or "reward-maximizer" for short. Then we can say things like, "If a RL-trained agent is not given direct access to its step rewards during training, it seems less likely to become a reward-maximizer." Any thoughts on this suggestion? (I'll probably make a post about this later, but thought I'd run it by you and any others who sees this comment for a sanity check first.)

[-]evhub6y10

When I use the term "RL agent," I always mean an agent trained via RL. The other usage just seems confused to me in that it seems to be assuming that if you use RL you'll get an agent which is "trying" to maximize its reward, which is not necessarily the case. "Reward-maximizer" seems like a much better term to describe that situation.

[-]Wei Dai6y*80

When I use the term “RL agent,” I always mean an agent trained via RL.

I think the problem with this usage is that "RL agent" originally meant something like "an agent designed to solve a RL problem" where "RL problem" is something like "a class of problems with the central example being MDP". I think it's just not a well-defined term at this point, and if you Google it, you get plenty of results that say things like "the goal of our RL agent is to maximize the expected cumulative reward", or "AIXI is a reinforcement learning agent". I guess this is fine for AI capabilities work but really confusing for AI safety work.

So, consider switching to "RL-trained agent" for greater clarity (unless someone has a better suggestion)? ETA: Maybe "reinforcement trained agent"?

[-]gwern6y50

What do you think of using AIXI.js as a testbed like https://arxiv.org/abs/1906.09136 does?

[-]evhub6y30

I haven't used it myself, but it seems like a reasonably good platform if you want to test reward tampering stuff. DeepMind actually did a similar thing recently based on the game "Baba is You." That being said, most of the experiments here aren't really about reward tampering, so they don't really need the embeddedness you get from AIXIjs or Baba is You (and I'm not that excited about reward tampering research in general).

[-]William_S6y30

Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)

https://arxiv.org/abs/1905.12149

[-]Ofer6y*30

To what extent do models care about their performance across episodes? If there exists a side-channel which only increases next-episode performance, under what circumstances will a model exploit such a thing?

If an agent is trained with an episodic learning scheme and ends up with a behavior that maximizes reward across episodes, I'm not sure we should consider this an inner alignment failure. In some sense, we got the behavior that our learning scheme was optimizing for. [EDIT: this is not true necessarily true for all learning algorithms, e.g. gradient descent, see discussion here]

To quickly see this, imagine an episodic learning scheme where—at the end of each episode—if the agent failed to achieve the episode's goal then its policy network parameters are completely randomized, and otherwise the agent is unchanged. Assuming we have infinite resources, if we run this learning scheme for an arbitrarily long time, we should expect to end up with an agent that tries to achieve goals in future episodes.

[-]evhub6y20

I agree that you could interpret that as more of an outer alignment problem, though either way I think it's definitely an important safety concern.

[-]Matthew Barnett6y20

Planned summary:

This post lays out several experiments that could clarify the inner alignment problem: the problem of how to get an ML model to be robustly aligned with the objective function it was trained with. One example experiment is giving an RL trained agent direct access to its reward as part of its observation. During testing, we could try putting the model in a confusing situation by altering its observed reward so that it doesn't match the real one. The hope is that we could gain insight into when RL trained agents internally represent 'goals' and how they relate to the environment, if they do at all. You'll have to read the post to see all the experiments.

Planned opinion:

I'm currently convinced that doing empirical work right now will help us understand mesa optimization, and this was one of the posts that lead me to that conclusion. I'm still a bit skeptical that current techniques are sufficient to demonstrate the type of powerful learned search algorithms which could characterize the worst outcomes for failures in inner alignment. Regardless, I think at this point classifying failure modes is quite beneficial, and conducting tests like the ones in this post will make that a lot easier.

[-]Ben Pace6y10

Would be good to note that this is for the Alignment Newsletter. I didn't realise that's what this was for a few seconds.

[-]Richard_Ngo6y20

to what extent models tend to learn their goals internally vs. via reference to things in their environment

I'm not sure what this distinction is trying to refer to. Goals are both represented internally, and also refer to things in the agent's environments. Is there a tension there?

[-]evhub6y20

The distinction I'm most interested in here is the distinction drawn in "Risks from Learned Optimization" between internalization and modeling. As described in the paper, however, there are really three cases:

In the internalization case, the base optimizer (e.g. gradient descent) modifies the mesa-optimizer's parameters to explicitly specify its goal.
In the corrigible modeling case, the mesa-optimizer builds a world model containing concepts like "what humans value" and then the base optimizer modifies the mesa-optimizer's objective to "point to" whatever is found by the world model.
In the deceptive modeling case, the mesa-optimizer builds a world model containing concepts like "what humans value" and then the base optimizer modifies the mesa-optimizer to instrumentally optimize that for the purpose of staying around.

Which one of these situations is most likely seems highly dependent on to what extent goals by default get encoded explicitly or via reference to world models. This is a hard thing to formalize in such a way that we can make any progress on testing it now, but my "Reward side-channels" proposal at least tries to by drawing the following distinction: if I give the model its reward as part of its observations, train it to optimize that, but then change it at test time, what happens? If it's doing pure internalization, it should either fail to optimize for either reward very well or succeed at optimizing the old reward but not the new one. If it's doing pure modeling, however, then it should fail to optimize the old reward but succeed at optimizing the new reward, which is the interesting case that you'd be trying to see under what circumstances you could actually get to appear.

Though please do say that you got the idea from here and let me know about any results that you get. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

30

Concrete experiments in inner alignment

30

Concrete proposals

Reward side-channels

Cross-episodic objectives

Objective unidentifiability

Zero-shot objectives

Robust reward learning

Other thoughts

Inference-time search

Reward unidentifiability

Modeling of the training process