Epistemic status: There’s a lot in this post and my general approach while working on it that, in retrospect, wasn’t thought out well enough. I’m posting it because I figure sharing flawed ideas is better than letting this languish in a doc forever, while I work on improving them.

Thanks to Paul Colognese, Tassilo Neubauer, John Wentworth, and Fabien Roger for useful conversations, Janus for suggesting something that led me to some of the ideas I mention here, and Shoshannah Tekofsky for feedback on a draft.

I spent some time trying to find the high-level structure in a neural net corresponding to a deep reinforcement learning model’s objective[1][2]. In this post, I describe some of the stuff I tried, thoughts from working on this, and mistakes I made.


Many approaches in current interpretability work as I understand them involve understanding low-level components of a model (Circuits and subsequent work, for example), as a way of building up to more complex components. I think we can make progress from the opposite frontier at the same time, and try to identify the presence or nature of certain high-level mechanistic structures or properties in a model. Examples of this include verifying whether a model is doing optimization, whether it’s myopic, isolating the objective of an optimizer, and essentially the entire class of properties with a singular answer for the entire model[3].

I prefer to distinguish these directions as low-level and high-level interpretability respectively for descriptive clarity. The framing of best-case and worst-case transparency as laid out in the transparency tech tree also points at the same concept. I expect that both directions are aiming at the same goal, but working top-down seems pretty tractable at least for some high-level targets that don’t require lots of deconfusion to even know what we’re looking for.

Ideally, an approach to high-level interpretability would be robust to optimization pressure and deception (in other words, to gradient descent and adversarial mesa optimizers), but I think there’s a lot we can learn toward that ideal from more naive approaches. In this post, I describe my thoughts on trying to identify the high-level structure in a network corresponding to a deep reinforcement learning model’s objective.

So I didn’t set out on this particular approach expecting to succeed at the overall goal (I don’t expect interpretability to be that easy), instead hoping that trying out a bunch of naive directions will give us insight into future high-level interpretability work. To that end, I try to lay out my reasoning at various points, riddled with gaps at points, while thinking about this.

I think there are plenty of plausible avenues to try out here, and for the most part will only be describing the problems with ones I thought of. Further, the directions I tried don’t seem very non-obvious and there are likely much better methods to be tried given more thought here.

A naive approach and patching

Take the case of a small RL model that can optimize for a reward in a given environment. What we want is information about how the model’s objective is represented in the network’s weights. Concretely, we might think about what kind of tests or processes disproportionately affect these weights more than any other, and try to isolate the information we want from there.

One approach we might try is:

  • Train the initialized model on reward RA for an extended period of time. Let’s call the model at the end of this step MA. We may think of MA as having learned a pretty sophisticated world model at this point, such that further training wouldn’t update it strongly.
  • Train the model MA on a different reward RB in the same environment[4] for enough timesteps to achieve non-trivial performance. Ideally, RB would be as orthogonal as possible from RA[5]. Let’s call the model at the end of this step MB.

Naively, one might expect the update signal from MA to MB to correspond to changes in the model’s internal objective. After all, given that the model already understood the environment well enough after the first training period, it can be intuitive to think that the primary element changing during the second period is what the model should try to optimize for.

This method fails however[6], because it relies on a few questionable assumptions. For one, it requires that the model’s policy is generated at runtime, inferring from its knowledge of the environment and its objective, instead of being a component of the model that is changed through updates - a priori as well as empirically, this seems unlikely to be true.

Is this a problem that we can patch easily? One direction I considered along these lines:

  • Take the difference of the weights between MB and MA (you can view this as all the updates the model received during the second round of training), and subtract this from the model MA. Let’s call the model you get after subtracting these updates MB*. Under the reasoning above, one might expect MB* to now optimize for minimizing the RB.
  • But since subtracting these updates messes with other components like the model’s policy, even if the model now leans toward minimizing RB, it’ll be drowned out by generally terrible performance. If you could then train MB* for a few timesteps (say, on RA) to account for capability loss, then if there is a lean toward minimization, it should become visible.

On testing, however, the retrained MB* does not show any visible inclination like this. In retrospect, that made sense - it relied on the assumption that the internal representation of the objective is bidirectional, that the parameter-reward mapping is linear. A high-level update signal in one direction doesn’t necessitate that the inverted signal results in the inverted direction. This direction was a bust, but it was useful for me to make incorrect implicit assumptions like this more explicit.

Another reason why patching this would be difficult is that there are a number of other potential confounding changes to the model’s internals. Policy changes are only one, there may also be other changes such as refocusing higher granularity on parts of the world model or more efficient modes of compressing the world model, phase changes as the model discovers new modes of reasoning, and other unknowns.

All this means that if there is information contained in the update signal corresponding to changes in the objective, it faces strong interference from other factors. Is it possible to extract this information however, using some creative methods?

Thoughts, and a (potentially) promising approach

So far, I’ve been talking about one central idea - devising a training procedure that differentially affects the internal objective from the rest of the network in some identifiable way. The naive approach above was too noisy to work for this, but my prior expectation was not that the first method (or convoluted patches on top of it) would make actual object-level progress on the problem of isolating this internal objective.

What I expected with these experiments so far was clarifying some of my own thoughts on objective representation in optimizers (and more broadly, optimizers themselves), and with non-trivial probability gaining new insights on what might and might not work or better framings for tackling this problem. I think I succeeded on the former front and probably failed on the latter.

That said, the idea I’m about to describe was one I had when I started working on this project - in retrospect, it still seems like one worth pursuing, though for the same reason of being a promising way to gain more insight into better directions (look, this line of research is really unexplored).

I hinted earlier that there are many confounders that skew any signal we might be able to extract. My current guess is however, that this is a quantitative problem more than a qualitative one - I don’t know whether we will be able to iteratively extract a purer signal corresponding to a model’s objective, but it seems plausible that accounting for all the confounders we can will give us relevant insights more easily accessible. (Measuring a lot of things would also be of use here, and is the intent behind my framing in this post.)

Several of these we can account for through more precise engineering - in other words, the obvious things I wasn’t doing because I either missed it entirely or wanted to get quick results and a lot of data without optimizing on the implementation too hard. If someone finds this line of research exciting and wants to work on it, please reach out!

Some confounders, however, will likely require more creative approaches to solve. This is where this section’s idea comes into play: what would we see if we superpose different contexts in which the model’s objective should be activated?

For example, we could train copies of the model MA on different RB, and smooth out the update signal from all of these contexts. Ideally, if the RBs we choose are orthogonal enough from each other, this should result in several confounders canceling themselves out.

This can also be framed as (although is not necessarily equivalent to) saying that in all of these contexts, the objective is plausibly one of the few things that we can control to change in a desired mechanistic manner, and thus we can select a smoothing mechanism that filters for this.

As mentioned earlier, this is only one approach out of many that may be promising, and there’s definitely a lot of space to be covered here in terms of other directions. Even if you disagree on these approaches being useful, the core idea seems very probably important, and seems to me to inspire generators for many promising approaches.

A very late appendix: what's an objective?

Now, one of the sections I should’ve worked on before starting to work on anything else in this post, which I’m instead writing weeks later. What exactly are we trying to find?

Without thinking about it too hard - which I hadn’t - you might handwave this away under the assumption that it won’t be that complicated. In my case, I thought something along the lines of “the internal mechanistic notion of objective a mesa-optimizer can use for planning or search, however sparsely that may be represented”.

This is troublesome for two reasons - first that this description and the intuitions it invokes as stated may not even be true; and second that even if we do have a good broad description, understanding it with greater granularity and what properties it would imply would be extremely helpful, plausibly crucial, to even coming up with useful ways to bang stuff against the wall to see what sticks.

So what are the scenarios in which the above description isn’t strictly accurate? 

Objectives as pointers

One thing that comes to mind is that it’s possible that the models we care about don’t learn their objectives internally, and instead contain pointers to the environment that inform its understanding of the reward function.

If this turns out to be the case, then any approach that seeks to identify the objective of an optimizer would have to be broad enough to identify those pointers and their mechanism with sufficient fidelity to gain the kind of insights we want in the other case. For example, we might need to identify the pointer inside the network / what it’s pointing to, and understand how they inform the model’s notion of objective, from the reward channel.

I think this isn’t going to be a significantly harder problem than just identifying a concrete objective structure in the network, but may involve different approaches to solving the two - the kind of approach described in this post, for example, likely would not be sufficient. We also need more empirical work into what kinds of models learn objectives in what way to make progress on this front, but for now it’s a consideration that future approaches should take.

Objectives in shards

That isn’t the only possibility, however. Especially because we’re working with toy models that ostensibly fit the description of an optimizer, we may end up with a model that mechanistically doesn’t have an explicit notion of objective.

For example (there’s a fair amount I’m still thinking about for this case, so there could be stuff I’m missing), we could end up with complex stews of contextually activated heuristics or subroutines chiseled through the outer objective that execute specific cognition without necessarily learning an internal objective to use.

It may be the case that any model we actually have cause to worry about - that has the capacity to do long-term deceptive planning - requires mechanistic internalization of some objective which the mesa-optimizer can use, faster than gradient descent can activate specific subroutines for this. While I find this intuitively compelling, I haven’t thought deeply enough about what’s necessary for planning to be entirely certain that this must be true, and even then it restricts the class of models useful to test with.

Luckily, thinking deeply about that is part of what I’m currently spending time on with this general approach, which is a lot of conceptual legwork on coming up with as formal/clear a treatment of a mechanistic objective in any network as is sufficient for somewhat promising experiments. I originally picked this particular high-level interpretability target over something like identifying myopia or agency because it felt like it involved less deconfusion work, and I still think that’s true - just that I overshot into the “requires no deconfusion” territory. Correcting for that, I’m relatively optimistic about this line of work.

  1. ^

    This could be a specific component of the model, or it could be sparsely represented across all the weights of the network - we want a setup that can work with both. What we get could look like a map of parameters and how relevant they are to an internal notion of objective - like a heatmap of the model.

  2. ^

    If your reaction is that I should be a lot more concrete about exactly what we’re looking for and what properties it might imply - I agree. This is one place I should have given a lot more thought into before jumping into experimentation. Refer to the last section for more details on this.

  3. ^

    Note that the answer itself may not be single-headed - in the case where we want to identify how a predictive model makes its predictions for example, the answer could involve multiple modes of computation such as direct simulation, heuristics, etc. The important distinction here is that all of this information is still part of the answer to the question of how the model makes predictions.

  4. ^

    Where RB could either be a single new reward or a combination of RA with a new component - hereafter I will just use RB to refer to the new component in the latter case for brevity.

  5. ^

    While working on this, I wanted to run a quick implementation and therefore did not think of this early enough or account for this well. This is one of a few instances where potential improvements on my approach seem obvious. 

  6. ^

    I plotted way too many graphs with far too much data that individually don't turn out to be all that informative. I'd be happy to share them if anyone wants, but I think they'd just clutter up this post without adding much.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 9:38 PM

On testing, however, the retrained MB* does not show any visible inclination like this. In retrospect, that made sense - it relied on the assumption that the internal representation of the objective is bidirectional, that the parameter-reward mapping is linear. A high-level update signal in one direction doesn’t necessitate that the inverted signal results in the inverted direction. This direction was a bust, but it was useful for me to make incorrect implicit assumptions like this more explicit.

I think it's improbable that agents internalize a single objective, but I applaud your concrete hypothesis and then going out to test it. I'm very excited about people trying to predict what algorithms a policy net will be running, thereby grounding out "mesa objectives" and other such talk in terms of falsifiable predictions about internal cognition and thus generalization behavior (e.g. going towards coin or going towards right or something else weirder than that).

Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.

Especially because we’re working with toy models that ostensibly fit the description of an optimizer, we may end up with a model that mechanistically doesn’t have an explicit notion of objective.

I think this is very likely to be the default for most toy models one trains RL on. In my model of agent value formation (which looks very much like this post), explicit representation of objectives is useful inasmuch the model already has some sort of internal "optimizer" or search process. And before that, simple "heuristics" (or shards) should suffice—especially in small training regimes.

Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.

Yeah, IMO "RL at scale trains search-based mesa optimizers" hypothesis predicts "solving randomly generated mazes via a roughly unitary mesa objective and heuristic search" with reasonable probability, and that seems like a toy domain to me.