Mesa-Search vs Mesa-Control

[-]evhub5y*190

Paul asks a related theory question. Vanessa gives a counterexample, which involves a control-type mesa-optimizer rather than one which implements an internal search.

You may also be interested in my counterexample, which preceded Vanessa's and uses a search-type mesa-optimizer rather than a control-type mesa-optimizer.

I've thought of two possible reasons so far.

Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.

More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

I would propose a third reason, which is just that learning done by the RL algorithm happens after the agent has taken all of its actions in the episode, whereas learning done inside the model can happen during the episode. Thus, if the task of taking good actions in the episode requires learning, then your model will have to learn some sort of learning procedure to do well.

I don't currently see any reason why the inner learner in an RL system would be more or less agentic than in text prediction.

I agree with this—it has long been my position that language modeling as a task ticks all of the boxes necessary to produce mesa-optimizers.

But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the "learned information" from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?

I agree here too—though also I think it's pretty reasonable to expect that future massive language models will have more recurrence in them.

[-]Vlad Mikulik5y30

I would propose a third reason, which is just that learning done by the RL algorithm happens after the agent has taken all of its actions in the episode, whereas learning done inside the model can happen during the episode.

This is not true of RL algorithms in general -- If I want, I can make weight updates after every observation. And yet, I suspect that if I meta-train a recurrent policy using such an algorithm on a distribution of bandit tasks, I will get a 'learning-to-learn' style policy.

So I think this is a less fundamental reason, though it is true in off-policy RL.

[-]evhub5y20

This is not true of RL algorithms in general -- If I want, I can make weight updates after every observation.

You can't update the model based on its action until its taken that action and gotten a reward for it. It's obviously possible to throw in updates based on past data whenever you want, but that's beside the point—the point is that the RL algorithm only gets new information with which to update the model after the model has taken its action, which means if taking actions requires learning, then the model itself has to do that learning.

[-]Vlad Mikulik5y20

I interpreted your previous point to mean you only take updates off-policy, but now I see what you meant. When I said you can update after every observation, I meant that you can update once you have made an environment transition and have an (observation, action, reward, observation) tuple. I now see that you meant the RL algorithm doesn't have the ability to update on the reward before the action is taken, which I agree with. I think I still am not convinced, however.

And can we taboo the word 'learning' for this discussion, or keep it to the standard ML meaning of 'update model weights through optimisation'? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn't have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn't necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.

You can't update the model based on its action until its taken that action and gotten a reward for it.

Right, I agree with that.

Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby 'updating' based on returns it hasn't yet received, and actions it hasn't yet made. I agree that it's possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don't do anything so sophisticated). It isn't clear to me a priori that from the point of view of the policy this is the best strategy to take.

But note that this argument means that to the extent learned responsiveness can do more than the RL algorithm's weight updates can, that cannot be due to recurrence. If it was, then the RL algorithm could just simulate the recurrent updates using the agent's weights, achieving performance parity. So for what you're describing to be the explanation for emergent learning-to-learn, you'd need the model to do all of its learned 'learning' within a single forward pass. I don't find that very plausible -- or rather, whatever advantageous responsive computation happens in the forward pass, I wouldn't be inclined to describe as learning.

You might argue that today's RL algorithms can't simulate the required recurrence using the weights -- but that is a different explanation to the one you state, and essentially the explanation I would lean towards.

if taking actions requires learning, then the model itself has to do that learning.

I'm not sure what you mean when you say 'taking actions requires learning'. Do you mean something other than the basic requirement that a policy depends on observations?

[-]evhub5y30

And can we taboo the word 'learning' for this discussion, or keep it to the standard ML meaning of 'update model weights through optimisation'? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn't have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn't necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.

I agree with all of that—I was using the term “learning” to be purposefully vague precisely because the space is so large and the point that I'm making is very general and doesn't really depend on exactly what notion of responsiveness/learning you're considering.

Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby 'updating' based on returns it hasn't yet received, and actions it hasn't yet made. I agree that it's possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don't do anything so sophisticated). It isn't clear to me a priori that from the point of view of the policy this is the best strategy to take.

This does in fact seem like an interesting angle from which to analyze this, though it's definitely not what I was saying—and I agree that current meta-learners probably aren't doing this.

I'm not sure what you mean when you say 'taking actions requires learning'. Do you mean something other than the basic requirement that a policy depends on observations?

What I mean here is in fact very basic—let me try to clarify. Let be the optimal policy. Furthermore, suppose that any polynomial-time (or some other similar constraint) algorithm that well-approximates $π^{*}$ has to perform some operation $f$ . Then, my point is just that, for $π$ to achieve performance comparable with $π^{*}$ , it has to do $f$ . And my argument for that is just simply because we know that you have to do $f$ to get good performance, which means either $π$ has to do $f$ or the gradient descent algorithm has to—but we know the gradient descent algorithm can't be doing something crazy like running $f$ at each step and putting the result into the model because the gradient descent algorithm only updates the model on the given state after the model has already produced its action.

[-]Vlad Mikulik5y30

I am quite confused. I wonder if we agree on the substance but not on the wording, but perhaps it’s worthwhile talking this through.

I follow your argument, and it is what I had in mind when I was responding to you earlier. If approximating within the constraints requires computing $f (o_{t})$ , then any policy that approximates $π^{*}$ must compute $f (o_{t})$ . (Assuming appropriate constraints that preclude the policy from being a lookup table precomputed by SGD; not sure if that’s what you meant by “other similar”, though this may be trickier to do formally than we take it to be).

My point is that for $f$ = ‘learning’, I can’t see how anything I would call learning could meaningfully happen inside a single timestep. ‘Learning’ in my head is something that suggests non-ephemeral change; and any lasting change has to feed into the agent’s next state, by which point SGD would have had its chance to make the same change.

Could you give an example of what you mean (this is partially why I wanted to taboo learning)? Or, could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning).

[-]evhub5y30

could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning)

How about language modeling? I think that the task of predicting what a human will say next given some prompt requires learning in a pretty meaningful way, as it requires the model to be able to learn from the prompt what the human is trying to do and then do that.

[-]Vlad Mikulik5y30

Good point -- I think I wasn't thinking deeply enough about language modelling. I certainly agree that the model has to learn in the colloquial sense, especially if it's doing something really impressive that isn't well-explained by interpolating on dataset examples -- I'm imagining giving GPT-X some new mathematical definitions and asking it to make novel proofs.

I think my confusion was rooted in the fact that you were replying to a section that dealt specifically with learning an inner RL algorithm, and the above sense of 'learning' is a bit different from that one. 'Learning' in your sense can be required for a task without requiring an inner RL algorithm; or at least, whether it does isn't clear to me a priori.

[-]gwern5y180

But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the “learned information” from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?

I think that's exactly it. There's no real difference between a history, and a recurrence. A recurrence is a (lossy) function of a history, so anything a recurrent hidden state can encode, a sufficiently large/deep feedforward model given access to the full history should be able to internally represent as well.

GPT with a context window of 1 token would be unable to do any kind of meta-learning, in much the same way that a RNN with no hidden state (or at its first step with a default hidden state) working one step at a time would be unable to do anything. Whether you compute your meta-learning 'horizontally' by repeated application to a hidden state, updating token by token, or 'vertically' inside a deep Transformer (an unrolled RNN?) conditioned on the entire history, makes no difference aside from issues of perhaps computational efficiency (a RNN is probably faster to run but slower to train) and needing more or less parameters or layers to achieve the same effective amount of pondering time (although see Universal Transformers there).

[-]Vanessa Kosoy5y70

There's no real difference between a history, and a recurrence.

That's true for unbounded agents but false for realistic (bounded) agents. Considering the following two-player zero-sum game:

Player A secretly writes some , then player B says some $y \in {0, 1}^{n}$ and finally player B says some $z \in {0, 1}^{n}$ . Player A gets reward $1$ unless $y = f (z)$ where $f : {0, 1}^{n} \to {0, 1}^{n}$ is a fixed one-way function. If $y = f (z)$ , player A gets a reward in $[0, 1]$ which is the fraction of bits $x$ and $z$ have in common.

The optimal strategy for player A is producing a random sequence. The optimal strategy for player B is choosing a random $z$ , computing $y := f (z)$ , outputting $y$ and then outputting $z$ . The latter is something that an RNN can implement (by storing $z$ in its internal state) but a stateless architecture like a transformer cannot implement. A stateless algorithm would have to recover $z$ from $y$ , but that is computationally unfeasible.

[-]Vanessa Kosoy5y20

On second thought, that's not a big deal: we can fix it by interspersing random bits in the input. This way, the transformer would see a history that includes and the random bits used to produce it (which encode $z$ ). More generally, such a setup can simulate any randomized RNN.

[-]gwern5y10

Er, maybe your notation is obscuring this for me, but how does that follow? Where is the RNN getting this special randomness from? Why aren't the internal activations of a many-layer Transformer perfectly adequate to first encode, 'storing z', and then transform?

[-]Vanessa Kosoy5y10

I'm assuming that either architecture can use a source of random bits.

The transformer produces one bit at a time, computing every bit from the history so far. It doesn't have any state except for the history. At some stage of the game the history consists of only. At this stage the transformer would have to compute $z$ from $y$ in order to win. It doesn't have any activations to go on besides those that can be produced from $y$ .

[-]gwern5y10

And the Transformer can recompute whatever function the RNN is computing over its history, no, as I said? Whatever a RNN can do with its potentially limited access to history, a Transformer can recompute with its full access to history as if it were the unrolled RNN. It can recompute that for every bit, generate the next one, and then recompute on the next step with that as the newest part of its history being conditioned on.

[-]Vanessa Kosoy5y10

No, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many "Monte Carlo" iterations until it produces the right history.

[-]gwern5y10

An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it's not, I don't see why that would make a difference, or why a Transformer couldn't be 'not deterministic' in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can't learn 'Monte Carlo iterations' internally (say, one per head).

[-]Vanessa Kosoy5y10

I already conceded a Transformer can be made stochastic. I don't see a problem with backproping: you treat the random inputs as part of the environment, and there's no issue with the environment having stochastic parts. It's stochastic gradient descent, after all.

[-]gwern5y10

Because you don't train the inputs, you're trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it's abusing the term 'stochastic' (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don't understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can't meta-learn in the same way as RNNs internally.

[-]Vanessa Kosoy5y10

I am not sure what do you mean by "stop cold?" It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.

As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?

[-]David Scott Krueger (formerly: capybaralet)5y10

Practically speaking, I think the big difference is that the history is outside of GPT-3's control, but a recurrent memory would be inside its control.

[-]Rohin Shah5y150

(EDIT: I responded in more detail on the original post.)

The core empirical claim, as I understand it, is that task performance continues to improve after weights are frozen, suggesting that learning is still taking place, implemented in neural activation changes rather than neural weight changes.

I'm fairly confident this is not what is happening (at least, if I understand your claim correctly). If <what I understand your claim to be> was happening, it would be pretty surprising to me.

I only skimmed the linked paper, but it seems like it studies bandit problems, where each episode of RL is a new bandit problem where the agent doesn't know which arm gives maximal reward. Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.

However, because bandit problems have been studied in the AI literature, and "learning algorithms" have been proposed to solve bandit problems, this very normal fact of a policy depending on observations is now trotted out as "learning algorithms spontaneously emerge". I don't understand why this was surprising to the original researchers, it seems like if you just thought about what the optimal policy would be given the observable information, you would make exactly this prediction.

More broadly, I don't understand what people are talking about when they speak of the "likelihood" of mesa optimization. If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it. If you mean the chance than a policy trained by RL will "learn" without gradient descent, I can't imagine a way that could fail to be true for an intelligent system trained by deep RL -- presumably a system that is intelligent is capable of learning quickly, and when we talk about deep RL leading to an intelligent AI system, presumably we are talking about the policy being intelligent (what else?), therefore the policy must "learn" as it is being executed.

[-]abramdemski5y110

Your assessment here seems to (mostly) line up with what I was trying to communicate in the post.

This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.

This is something I hoped to communicate in the "Mesa-Learning Everywhere?" section, especially point #3.

If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it.

This is a point I hoped to convey in the Search vs Control section.

If you mean the chance than a policy trained by RL will "learn" without gradient descent, I can't imagine a way that could fail to be true for an intelligent system trained by deep RL

Ah, here is where the disagreement seems to lie. In another comment, you write:

Here on LW / AF, "mesa optimization" seems to only apply if there's some sort of "general" learning algorithm, especially one that is "using search", for reasons that have always been unclear to me.

I currently think this:

There is a spectrum between "just learning the task" vs "learning to learn", which has to do with how "general" the learning is. DQN looking at the ball is very far on the "just learning the task" side.
This spectrum is very fuzzy. There is no clear distinction.
This spectrum is very relevant to inner alignment questions. If a system like GPT-3 is merely "locating the task", then its behavior is highly constrained by the training set. On the other hand, if GPT-3 is "learning on the fly", then its behavior is much less constrained by the training set, and have correspondingly more potential for misaligned behavior (behavior which is capably achieving a different goal than the intended one). This is justified by an interpolation-vs-extrapolation type intuition.
The paper provides a small amount of evidence that things higher on the spectrum are likely to happen. (I'm going to revise the post to indicate that the paper only provides a small amount of evidence -- I admit I didn't read the paper to see exactly what they did, and should have anticipated that it would be something relatively unimpressive like multi-armed-bandit.)
Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.

[-]Rohin Shah5y50

All of that seems reasonable. (I indeed misunderstood your claim, mostly because you cited the spontaneous emergence post.)

Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.

Fair enough; I guess I'm unclear on how you can think about it other than this way.

[-]abramdemski5y90

Fair enough; I guess I'm unclear on how you can think about it other than this way.

Yeahhh, idk. All I can currently articulate is that, previously, I thought of it as a black swan event.

[-]Rohin Shah5y100

Random question: does this also update you towards "alignment problems will manifest in real systems well before they are powerful enough to take over the world"?

Context: I see this as a key claim for the (relative to MIRI) alignment-by-default perspective, and I expect many people at MIRI disagree with this claim (though I don't know why they disagree).

[-]David Scott Krueger (formerly: capybaralet)5y30

I'm very curious to know whether people at MIRI in fact disagree with this claim.

I would expect that they don't... e.g. Eliezer seems to think we'll see them and patch them unsuccessfully: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D

[-]Rohin Shah5y30

Yeah it's plausible that the actual claims MIRI would disagree with are more like:

Problems manifest => high likelihood we understand the underlying cause

We understand the underlying cause => high likelihood we fix it (or don't build powerful AI) rather than applying "surface patches"

[-]David Scott Krueger (formerly: capybaralet)5y30

Yep. I'd love to see more discussion around these cruxes (e.g. I'd be up for a public or private discussion sometime, or moderating one with someone from MIRI). I'd guess some of the main underlying cruxes are:

How hard are these problems to fix?
How motivated will the research community be to fix them?
How likely will developers be to use the fixes?
How reliably will developers need to use the fixes? (e.g. how much x-risk would result from a small company *not* using them?)

Personally, OTTMH (numbers pulled out of my ass), my views on these cruxes are:

It's hard to say, but I'd say there's a ~85% chance they are extremely difficult (effectively intractable on short-to-medium (~40yrs) timelines).
A small minority (~1-20%) of researchers will be highly motivated to fix them, once they are apparent/prominent. More researchers (~10-80%) will focus on patches.
Conditioned on fixes being easy and cheap to apply, large orgs will be very likely to use them (~90%); small orgs less so (~50%). Fixes are likely to be easy to apply (we'll build good tools), if they are cheap enough to be deemed "practical", but very unlikely (~10%) to be cheap enough.
It will probably need to be highly reliable; "the necessary intelligence/resources needed to destroy the world goes down every year" (unless we make a lot of progress of governance, which seems fairly unlikely (~15%))

[-]Rohin Shah5y20

Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:

~90% that there aren't problems or we "could" fix them on 40 year timelines
I'm not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
"Are fixes used" is not a question in my ontology; something counts as a "fix" only if it's cheap enough to be used. You could ask "did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not" (possibly because they didn't know of its existence), then < 10% and I don't have enough information to distinguish between 0-10%.
I'll answer "how much x-risk would result from a small company *not* using them", if it's a single small company then < 10% and I don't have enough information to distinguish between 0-10% and I expect on reflection I'd say < 1%.

[-]David Scott Krueger (formerly: capybaralet)5y10

I guess most of my cruxes are RE your 2nd "=>", and can almost be viewed as breaking down this question into sub-questions. It might be worth sketching out a quantitative model here.

[-]evhub5y50

Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.

Fwiw, I agree with this, and also I think this is the same thing as what I said in my comment on the post regarding how this is a necessary consequence of the RL algorithm only updating the model after it takes actions.

[-]Rohin Shah5y50

I didn't understand what you meant by "requires learning", but yeah I think you are in fact saying the same thing.

[-]Vlad Mikulik5y30

I had a similar confusion when I first read Evan's comment. I think the thing that obscures this discussion is the extent to which the word 'learning' is overloaded -- so I'd vote taboo the term and use more concrete language.

[-]Vlad Mikulik5y40

I've thought of two possible reasons so far.

Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.

More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

I would be more inclined towards a more general version of the latter view, in which gradient updates just aren't a very effective way to track within-episode information.

The central example of learning-to-learn is a policy that effectively explores/exploits when presented with an unknown bandit from within the training distribution. An optimal policy essentially needs to keep track of sufficient statistics of the reward distributions for each action. If you're training a memoryless policy for a fixed bandit problem using RL, then the only way of tracking the sufficient stats you have is through your weights, which are changed through the gradient updates. But the weight-space might not be arranged in a way that's easily traversed by local jumps. On the other hand, a meta-trained recurrent agent can track sufficient stats in its activations, traversing the sufficient statistic space in whatever way it pleases -- its updates need not be local.

This has an interesting connection to MAML, because a converged memoryless MAML solution on a distribution of bandit tasks will presumably arrange the part of its weight-space that encodes bandit sufficient statistics in a way that makes it easy to traverse via SGD. That would be a neat (and not difficult) experiment to run.

[-]Rohin Shah5y30

Planned summary for the Alignment Newsletter:

This post discusses several topics related to mesa optimization, and the ideas in it led the author to update towards thinking inner alignment problems are quite likely to occur in practice. I’m not summarizing it in detail here because it’s written from a perspective on mesa optimization that I find difficult to inhabit. However, it seems to me that this perspective is common so it seems fairly likely that the typical reader would find the post useful.

Happy for others to propose a different summary for me to include. However, the summary will need to make sense to me; this may be a hard challenge for this post in particular.

[-]Rohin Shah5y30

I lean toward there being a meaningful distinction here: a system can learn a general-purpose learning algorithm, or it can 'merely' learn a very good conditional model.

Does human reasoning count as a general-purpose learning algorithm? I've heard it claimed that when we apply neural nets to tasks humans haven't been trained on (like understanding DNA or materials science) the neural nets can rocket past human understanding, with way less computation and tools (and maybe even data) than humans have had access to (depending on how you measure). Tbc, I find this claim believable but haven't checked it myself. Maybe SGD is the real general-purpose learning algorithm? Human reasoning could certainly be viewed formally as "a very good conditional model".

So overall I lean towards thinking this is a continuous spectrum with no discontinuous changes (except ones like "better than humans or not", which use a fixed reference point to get a discontinuity). So there could be a meaningful distinction, but it's like the meaningful distinction between "warm water" and "hot water", rather than the meaningful distinction between "water" and "ice".

[-]Vaniver5y30

The inner RL algorithm adjusts its learning rate to improve performance.

I have come across a lot of learning rate adjustment schemes in my time, and none of them have been 'obviously good', altho I think some have been conceptually simple and relatively easy to find. If this is what's actually going on and can be backed out, it would be interesting to see what it's doing here (and whether that works well on its own).

This is more concerning than a thermostat-like bag of heuristics, because an RL algorithm is a pretty agentic thing, which can adapt to new situations and produce novel, clever behavior.

Most RL training algorithms that we have look to me like putting a thermostat on top of a model; I think you're underestimating deep thermostats.

[-]Kaj_Sotala5y30

It sounds a bit absurd: you've already implemented a sophisticated RL algorithm, which keeps track of value estimates for states and actions, and propagates these value estimates to steer actions toward future value. Why would the learning process re-implement a scheme like that, nested inside of the one you implemented? Why wouldn't it just focus on filling in the values accurately?

I've thought of two possible reasons so far.

Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.

More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

Possibly obvious, but just to point it out: both of these seem like they also describe the case of genetic evolution vs. brains.

[-]gwern5y40

I'm a little confused as to why there's any question here. Every algorithm lies on a spectrum of tradeoffs from general to narrow. The narrower a class of solved problems, the more efficient (in any way you care to name) an algorithm can be: a Tic-Tac-Toe solver is going to be a lot more efficient than AIXI.

Meta-learning works because the inner algorithm can be far more specialized, and thus, more performant or sample-efficient than the highly general outer algorithm which learned the inner algorithm.

For example, in Dactyl, PPO trains a RNN to adapt to many possible robot hands on the fly in as few samples as possible; it's probably several orders of magnitude faster than online training of an RNN by PPO directly. "Why not just use that RNN for DoTA2, if it's so much better than PPO?" Well, because DoTA2 has little or nothing to do with robotic hands rotating cubes, an algorithm that excels at robot hand will not transfer to DoTA2. PPO will still work, though.

[-]Rohin Shah5y*90

Here on LW / AF, "mesa optimization" seems to only apply if there's some sort of "general" learning algorithm, especially one that is "using search", for reasons that have always been unclear to me. Some relevant posts taking the opposite perspective (which I endorse):

Is the term mesa optimizer too narrow?

Why is pseudo-alignment "worse" than other ways ML can fail to generalize?

[-]Vanessa Kosoy5y20

Why do you think my counterexample doesn't have internal search? In my counterexample, the circuit is simulating the behavior of another agent, which presumably is doing search, so the circuit is also doing search.

[-]abramdemski5y30

True, but it's a minimal circuit. When I wrote the remark, I was thinking: a minimal circuit will never do search; it will instead do something closer to memorizing the output of search (with some abstractions to compress further, so, not just a big memorized table). So I thought of the point of your counterexample as: "a minimal circuit may not do search, but it can implement the same policy as an agent which does search, which is exactly as concerning."

I agree this isn't really clear. I'll revise the remark.

[-]Vanessa Kosoy5y40

Well, running a Turning machine for time can be simulated by a circuit of size $O (t^{2})$ , so in terms of efficiency it's much closer to "doing search" than to "memorizing the output of search".

[-]abramdemski5y30

OK.

So if a search takes time exponential in the input size, the search-simulating circuit is size ... and if memorizing the answers also requires circuit length exponential in input size, they're roughly tied.

So the line where minimal circuits start memorizing rather than running is around there. Any search type worse than exponential, and it'll memorize; anything better, and it won't.

[-]David Scott Krueger (formerly: capybaralet)5y10

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

27

27

Search vs Control

Mesa-Searchers

Mesa-Learners

Mesa-Learning Everywhere?

1. Text prediction sounds benign, while RL sounds agentic.

2. Recurrence.

3. Mesa-learning isn't mesa-optimization.

4. This isn't even mesa-learning, it's just "task location".