Alex Turner

Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.

Sequences

Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

Comments

Without being familiar with the literature, why should I buy that we can informally reason about what is "low-frequency" versus "high-frequency" behavior? I think reasoning about "simplicity" has historically gone astray, and worry that this kind of reasoning will as well.

That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive". 

You're right. I was critiquing "power-seeking due to your assumptions isn't probable, because I think your assumptions won't hold" and not "power-seeking isn't predictive." I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:

I don’t see a notion of “objective” that can be confidently claimed is:

  1. Probable: there is a good argument that the systems we build will have an “objective”, and
  2. Predictive: If I know that a system has an “objective”, and I know its behavior on a limited set of training data, I can predict significant aspects of the system’s behavior in novel situations (e.g. whether it will execute a treacherous turn once it has the ability to do so successfully).

Sorry for the confusion. I agree that power-seeking is predictive given your assumptions. I disagree that power-seeking is probable due to your assumptions being probable. The argument I gave above was actually: 

  1. The assumptions used in the post ("learns a randomly-selected training-compatible goal") assign low probability to experimental results, relative to other predictions which I generated (and thus relative to other ways of reasoning about generalization),
  2. Therefore the assumptions become less probable
  3. Therefore power-seeking becomes less probable (at least, due to these specific assumptions becoming less probable; I still think P(power-seeking) is reasonably large)

I suspect that you agree that "learns a training-compatible goal" isn't very probable/realistic. My point is then that the conclusions of the current work are weakened; maybe now more work has to go into the "can" in "Power-seeking can be probable and predictive." 

The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims.

It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like "the policy tends to go to the top-right 5x5, and searches for cheese once there, because that's where the cheese-seeking computations were more strongly historically reinforced" and "the policy sometimes pursues cheese and sometimes navigates to the top-right 5x5 corner." These predictions are (informally) gradable, even if the underlying intuitions are informal. 

As it pertains to shard theory more broadly, though, I agree that more precision is needed. Increasing precision and formalism is the reason I proposed and executed the project underpinning Understanding and controlling a maze-solving policy network. I wanted to understand more about realistic motivational circuitry and model internals in the real world. I think the last few months have given me headway on a more mechanistic definition of a "shard-based agent."

RL creates agents, and RL seemed to be the way to AGI. In the 2010s, reinforcement learning was the dominant paradigm for those interested in AGI (e.g. OpenAI). RL lends naturally to creating agents that pursue rewards/utility/objectives. So there was reason to expect that agentic AI would be the first (and by the theoretical arguments, last) form that superintelligence would take.

Why are you confident that RL creates agents? Is it the non-stochasticity of optimal policies for almost all reward functions? The on-policy data collection of PPO? I think there are a few valid reasons to suspect that, but this excerpt seems surprisingly confident. 

I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are!

I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions.

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

In addition to my other comment, I'll further quote Behavioural statistics for a maze-solving agent:

We think the complex influence of spatial distances on the network’s decision-making might favor a ‘shard-like’ description: a description of the network's decisions as coalitions between heuristic submodules whose voting-power varies based on context. While this is still an underdeveloped hypothesis, it's motivated by two lines of thinking.

First, we weakly suspect that the agent may be systematically dynamically inconsistent from a utility-theoretic perspective. That is, the effects of  and (potentially)  might turn out to call for a behavior model where the agent's priorities in a given maze change based on the agent's current location. 

Second, we suspect that if the agent is dynamically consistent, a shard-like description may allow for a more compact and natural statement of an otherwise very gerrymandered-sounding utility function that fixes the value of cheese and top-right in a maze based on a "strange" mixture of maze properties. It may be helpful to look at these properties in terms of similarities to the historical activation conditions of different submodules that favor different plans.

While we consider our evidence suggestive in these directions, it's possible that some simple but clever utility function will turn out to be predictively successful.  For example, consider our two strongly observed effects: and . We might explain these effects by stipulating that: 

  • On each turn, the agent receives value inverse to the agent's distance from the top-right, 
  • Sharing a square with the cheese adds constant value, 
  • The agent doesn't know that getting to the cheese ends the game early, and 
  • The agent time-discounts. 

We're somewhat skeptical that models of this kind will hold up once you crunch the numbers and look at scenario-predictions, but they deserve a fair shot. 

We hope to revisit these questions rigorously when our mechanistic understanding of the network has matured. 

I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function.

Can you say more? Maybe give an example of what this looks like in the maze-solving regime?

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

This is a fair question, because I left a lot to the reader. I'll clarify now.

I was not claiming that you can't, after the fact, rationalize observed behavior using the extremely flexible reward-maximization framework. 

I was responding to the specific claim of assuming internal representation of a 'training-compatible' reward function. In evaluating this claim, we shouldn't just see whether this claim is technically compatible with empirical results, but we should instead reason probabilistically. How strongly does this claim predict observed data, relative to other models of policy formation?

In the maze setting, the cheese was always in the top-right 5x5 corner. The reward was sparse and only used to update the network when the mouse hit the cheese. The "training compatible goal set" is unconstrained on the test set. An example element might agree with the training reward on the training distribution, and then outside of the training distribution, assign 1 reward iff the mouse is on the bottom-left square.

The vast majority of such unconstrained functions will not involve pursuing cheese reliably across levels, and most of these reward functions will not be optimized by going to the top-right part of the maze. So this "training-compatible" hypothesis barely assigns any probability to the observed generalization of the network. 

However, other hypotheses -- like "the policy develops motivations related to obvious correlates of its historical reinforcement signals"[1] -- predict things like "the policy tends to go to the top-right 5x5, and searches for cheese more strongly once there." I registered such a prediction before seeing any of the generalization behavior. This hypothesis assigns high probability to the observed results.

So this paper's assumption is simply losing out in a predictive sense, and that's what I was critiquing. One can nearly always rationalize behavior as optimizing some reward function which you come up with after the fact. But if you want to predict generalization ahead of time, you shouldn't use this assumption in your reasoning.

Second, I think the network does not internally represent and optimize a reward function. I think that this representation claim is in some (but not total and undeniable) tension with our interpretability results. I am willing to take bets against you on the internal structure of the maze-solving nets. 

  1. ^

    You might respond "but this is informal." Yes. My answer is that it's better to be informal and right than to be formal and wrong. 

To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.

I'm responding to this post, so why should I strike that out? 

The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.

The post is talking about internal representations.

Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.

Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water

I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).

Load More