Why almost every RL agent does learned optimization

[-]Steven Byrnes3y40

My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.

Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models. But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.

Maybe what you’re thinking is: “Maybe the learned planning algorithm will have some weird and dangerous goal”. My hunch is: (1) if the original RL agent lacks an affordance for planning in the human-written source code, then it won’t work very well, and in particular, it won’t be up to the task of building a sophisticated dangerous planner with a misaligned goal; (2) if the original RL agent has an affordance for planning in the human-written source code, then it could make a dangerous misaligned planner, but it would be a “mistake” analogous to how future humans might unintentionally make misaligned AGIs, and this problem might be solvable by making the AI read about the alignment problem and murphyjitsu and red-teaming etc., and cranking up its risk-aversion etc.

Sorry if I’m misunderstanding. RL² stuff has never made much sense to me.

[-]Lee Sharkey3y21

My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.

Hm, I don't think this quite captures what I view the post as saying.

Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models.

As far as there is a safety-related claim in the post, this captures it much better than the previous quote.

But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.

I think my hunch is in the other direction. One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system. But that's a lightly held view. It feels plausible to me that your later points (1) and (2) turn out to be right, but again I think I lean in the other direction from you on (1).

I can also imagine a middle ground between our hunches that looks something like "We gave our agent a pretty strong inductive bias toward learning a planning algorithm, but still didn't force it to learn one, yet it did."

[-]Steven Byrnes3y20

Thanks!

One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system.

See Section 3 here for why I think it would be a lot worse.

^{^}

Footnote: The Bayesian optimization objective that RL $^{2}$ agents implicitly optimize has a structure that resembles planning

Ortega et al. (2019) shows that the policy of an RL $^{2}$ agent, $π (a_{t})$ , is trained to approximate the following distribution:

$P (a_{t} | {a o - - -}_{< t}) = \int_{ψ} P (a_{t} | ψ, {a o - - -}_{< t}) P (ψ | {a o - - -}_{< t}) d ψ$

where:
$a_{t}$ is the optimal action at timestep $t$ ,
${a o - - -}_{< t}$ the action-observation history up to timestep $t$ ,
$P (a_{t} | {a o - - -}_{< t})$ is the probability of choosing the optimal action given the action-observation history, and
$ψ$ is the set of latent (inaccessible) task parameters that define the task instance. They are sampled from the task distribution. $ψ$ effectively defines the current world state.

How might scaled up RL $^{2}$ agents approximate this integral? Perhaps the easiest method to approximate complicated distributions is a Monte Carlo estimate (i.e. take a bunch of samples and take their average). It seems plausible that agents would learn to take Monte Carlo estimate of this distribution within their learned algorithms. Here's a sketch of what this might look like on an intuitive level:

- The agent has uncertainty over latent task variables/world state given its observation history. It can't consider all the possible configurations of the world state, so it just considers a small sample set of the most likely states of the world according to an internal model of $P (ψ | {a o - - -}_{< t})$ .
- For each of that small sample set of possible world states, the agent considers what the optimal action would be in each case, i.e. $P (a_{t} | ψ, {a o - - -}_{< t})$ . Generally, it's useful to predict the consequences of actions to evaluate how good they are. So the agent might consider the consequences of different actions given different world states and action-observation histories.
- After considering each of the possible worlds, it chooses the action that works best across those worlds, weighted according to how likely each world state is i.e. $\int_{ψ} P (a_{t} | ψ, {a o - - -}_{< t}) P (ψ | {a o - - -}_{< t}) d ψ$

Those steps resemble a planning algorithm.

It's not clear whether agents would actually learn to plan (i.e. learning approximations of each term in the integral that unroll serially, as sketched above) vs. something else (such as learning heuristics that, in parallel, approximate the whole integral). But the structure of the Bayesian optimization objective is suggestive of an optimization pressure in the direction of learning a planning algorithm.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

14

Why almost every RL agent does learned optimization

14

Or "Why (And why that matters)"

Background on RL $^{2}$

The conditions under which RL $^{2}$ emerges are the default RL training conditions

Ingredients for an RL $^{2}$ cake

Why these ingredients are the default conditions

So what? (Planning from RL $^{2}$ ?)

Footnote: The Bayesian optimization objective that RL $^{2}$ agents implicitly optimize has a structure that resembles planning

14

Why almost every RL agent does learned optimization

14

Background on RL2

The conditions under which RL2 emerges are the default RL training conditions

Ingredients for an RL2 cake

Why these ingredients are the default conditions

So what? (Planning from RL2?)

Footnote: The Bayesian optimization objective that RL2 agents implicitly optimize has a structure that resembles planning

Background on RL $^{2}$

The conditions under which RL $^{2}$ emerges are the default RL training conditions

Ingredients for an RL $^{2}$ cake

So what? (Planning from RL $^{2}$ ?)

Footnote: The Bayesian optimization objective that RL $^{2}$ agents implicitly optimize has a structure that resembles planning