Alex Turner

Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.


Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact


What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

In addition to my other comment, I'll further quote Behavioural statistics for a maze-solving agent:

We think the complex influence of spatial distances on the network’s decision-making might favor a ‘shard-like’ description: a description of the network's decisions as coalitions between heuristic submodules whose voting-power varies based on context. While this is still an underdeveloped hypothesis, it's motivated by two lines of thinking.

First, we weakly suspect that the agent may be systematically dynamically inconsistent from a utility-theoretic perspective. That is, the effects of  and (potentially)  might turn out to call for a behavior model where the agent's priorities in a given maze change based on the agent's current location. 

Second, we suspect that if the agent is dynamically consistent, a shard-like description may allow for a more compact and natural statement of an otherwise very gerrymandered-sounding utility function that fixes the value of cheese and top-right in a maze based on a "strange" mixture of maze properties. It may be helpful to look at these properties in terms of similarities to the historical activation conditions of different submodules that favor different plans.

While we consider our evidence suggestive in these directions, it's possible that some simple but clever utility function will turn out to be predictively successful.  For example, consider our two strongly observed effects: and . We might explain these effects by stipulating that: 

  • On each turn, the agent receives value inverse to the agent's distance from the top-right, 
  • Sharing a square with the cheese adds constant value, 
  • The agent doesn't know that getting to the cheese ends the game early, and 
  • The agent time-discounts. 

We're somewhat skeptical that models of this kind will hold up once you crunch the numbers and look at scenario-predictions, but they deserve a fair shot. 

We hope to revisit these questions rigorously when our mechanistic understanding of the network has matured. 

I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function.

Can you say more? Maybe give an example of what this looks like in the maze-solving regime?

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

This is a fair question, because I left a lot to the reader. I'll clarify now.

I was not claiming that you can't, after the fact, rationalize observed behavior using the extremely flexible reward-maximization framework. 

I was responding to the specific claim of assuming internal representation of a 'training-compatible' reward function. In evaluating this claim, we shouldn't just see whether this claim is technically compatible with empirical results, but we should instead reason probabilistically. How strongly does this claim predict observed data, relative to other models of policy formation?

In the maze setting, the cheese was always in the top-right 5x5 corner. The reward was sparse and only used to update the network when the mouse hit the cheese. The "training compatible goal set" is unconstrained on the test set. An example element might agree with the training reward on the training distribution, and then outside of the training distribution, assign 1 reward iff the mouse is on the bottom-left square.

The vast majority of such unconstrained functions will not involve pursuing cheese reliably across levels, and most of these reward functions will not be optimized by going to the top-right part of the maze. So this "training-compatible" hypothesis barely assigns any probability to the observed generalization of the network. 

However, other hypotheses -- like "the policy develops motivations related to obvious correlates of its historical reinforcement signals"[1] -- predict things like "the policy tends to go to the top-right 5x5, and searches for cheese more strongly once there." I registered such a prediction before seeing any of the generalization behavior. This hypothesis assigns high probability to the observed results.

So this paper's assumption is simply losing out in a predictive sense, and that's what I was critiquing. One can nearly always rationalize behavior as optimizing some reward function which you come up with after the fact. But if you want to predict generalization ahead of time, you shouldn't use this assumption in your reasoning.

Second, I think the network does not internally represent and optimize a reward function. I think that this representation claim is in some (but not total and undeniable) tension with our interpretability results. I am willing to take bets against you on the internal structure of the maze-solving nets. 

  1. ^

    You might respond "but this is informal." Yes. My answer is that it's better to be informal and right than to be formal and wrong. 

To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.

I'm responding to this post, so why should I strike that out? 

The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.

The post is talking about internal representations.

Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.

Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water

I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).

"There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."

I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.

This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.

You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations:

  • Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output. 
  • Layer 16: Noticeable tendency to mention French, and even talk in "French" a bit. 
  • Layer 20: Switches to French relatively quickly.

Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we're adding a "coefficient 56" steering vector to forward passes. This should probably be substantially smaller. I haven't examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn't. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema. 

However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models. 

In sum, we don't actually yet have a demonstration of "switches fluently to French and keeps making sense", but this schema seems very promising. Great work again.

You can look at what I did at this colab. It is a very short colab.

Your colab's "Check it can speak French" section seems to be a stub.

So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.

I think there's a way better third alternative: asking each reader to unilaterally switch to "policy." No coordination, no constant reminders, no communication difficulties (in my experience). I therefore don't see a case for using "agent" in the mentioned cases. 

I added to the post:

Don't wait for everyone to coordinate on saying "policy." You can switch to "policy" right now and thereby improve your private thoughts about alignment, whether or not anyone else gets on board. I've enjoyed these benefits for a month. The switch didn't cause communication difficulties.

I think the embodiment distinction is interesting and hadn't thought of it before (note that I didn't understand your point until reading the replies to your comment). I'm not yet sure if I find this distinction worth making, though. I'd refer to the embodied system as a "trained system" or -- after reading your suggestion -- an "embodiment." Neither feels quite right to me, though.

I think you can be more generous in your interpretation of RL experts' words and read less error in.

What other, more favorable interpretations might I consider?

Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.

I'm open to suggestions on how to phrase this differently when I next give this talk.

It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies tend to autonomously seek power in various non game-playing regimes. 

If I had to say something, I might say "If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies." This argument appropriately behaves differently if the "outcomes" are simply different sentiment generations being sampled from an LM -- sentiment shift doesn't require power-seeking.

For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don't work because of instrumental convergence.

My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight.

Maybe you don't care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard. 

I think that it's locally valid to point out "under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper." But I feel a tad queasy about the overall point, since I don't think alignment's difficulty has much to do with the difficulties pointed out by "Optimal Policies Tend to Seek Power." I feel better about saying "Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble." Maybe that's what you already communicate, though.

Load More