Alex Turner

Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.


Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

Wiki Contributions


Reward is not the optimization target

(Haven't checked out Agent 57 in particular, but expect it to not have the "actually optimizes reward" property in the cases I argue against in the post.)

Reward is not the optimization target

which we published three years ago and started writing four years ago, is extremely explicit that we don't know how to get an agent that is actually optimizing for a specified reward function.

That isn't the main point I had in mind. See my comment to Chris here.


This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.

Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?

Reward is not the optimization target

"Wireheading is improbable" is only half of the point of the essay. 

The other main point is "reward functions are not the same type of object as utility functions." I haven't reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as "objectives":

The particular type of robustness problem that mesa-optimization falls into
is the reward-result gap, the gap between the reward for which the system was
trained (the base objective) and the reward that can be reconstructed from it using
inverse reinforcement learning (the behavioral objective).


The assumption in that work is that a monotonic relationship between
the learned reward and true reward indicates alignment, whereas deviations from
that suggest misalignment. Building on this sort of research, better theoretical
measures of alignment might someday allow us to speak concretely in terms of
provable guarantees about the extent to which a mesa-optimizer is aligned with the
base optimizer that created it.

Which is reasonable parlance, given that everyone else uses it, but I don't find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an 'objective' at all. 

(You might have privately known about this distinction. Fine by me! But I can't back it out from a skim of RFLO, even already knowing the insight and looking for it.)

Reward is not the optimization target

I perceive you as saying "These statements can make sense." If so, the point isn't that they can't be viewed as correct in some sense—that no one sane could possibly emit such statements. The point is that these quotes are indicative of misunderstanding the points of this essay. That if someone says a point as quoted, that's unfavorable evidence on this question. 

This describes some possible goals, and I don't see why you think the goals listed are impossible (and don't think they are).

I wasn't implying they're impossible, I was implying that this is somewhat misguided. Animals learn to achieve goals like "optimizing... the expected sume of future rewards"? That's exactly what I'm arguing against as improbable. 

TurnTrout's shortform feed

80% credence: It's very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds or # of happy people or reward-signal) and not quantities like (# of times I have to look at a cube in a blue room or -1 * subjective micromorts accrued).


  • I expect contextually activated heuristics to be the default, and that agents will learn lots of such contextual values which don't cash out to being strictly about diamonds or people, even if the overall agent is mostly motivated in terms of diamonds or people. 
  • Agents might also "terminalize" instrumental subgoals by caching computations (e.g. cache the heuristic that dying is bad, without recalculating from first principles for every plan in which you might die).
  • Therefore, I expect this value-spread to be convergently hard to avoid.
Reward is not the optimization target

Actually, while I did recheck the Reward is Enough paper, I think I did misunderstand part of it in a way which wasn't obvious to me while I reread, which makes the paper much less egregious. I am updating that you are correct and I am not spending enough effort on favorably interpreting existing discourse. 

I still disagree with parts of that essay and still think Sutton & co don't understand the key points. I still think you underestimate how much people don't get these points. I am provisionally retracting the comment you replied to while I compose a more thorough response (may be a little while).

Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won't get embedded as a terminal goal, but the idea that it needs to be "magically spawned" is very strawmanny.

Agreed on both counts for your first sentence. 

The "and" in "reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts" is doing important work; "magically" is meant to apply to the conjunction of the clauses. I added the second clause in order to pre-empt this objection. Maybe I should have added "reinforce those reward-focused thoughts into terminal values." Would that have been clearer? (I also have gone ahead and replaced "magically" with "automatically.")

Reward is not the optimization target

This specific point is why I said "relatively" little idea, and not zero idea. You have defended the common-sense version of "improving" a reward function (which I agree with, don't reward obvious bad things), but I perceive you to have originally claimed a much more aggressive and speculative claim, which is something like "'amplified' reward signals are improvements over non-'amplified' reward signals" (which might well be true, but how would we know?). 

Reward is not the optimization target

These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent's easy reach, and the agent doesn't explore into the button early in training, by the time it's smart enough to model the effects of the distant reward button, the agent won't want to go mash the button as fast as possible.

Reward is not the optimization target

I think fewer other people were making this mistake than you expect (including people in the standard field of RL)

I think that few people understand these points already. If RL professionals did understand this point, there would be pushback on Reward is Enough from RL professionals pointing out that reward is not the optimization target. After 15 minutes of searching, I found no one making the counterpoint. I mean, that thesis is just so wrong, and it's by famous researchers, and no one points out the obvious error.

RL researchers don't get it.[1] It's not complicated to me. 

(Do you know of any instance at all of someone else (outside of alignment) making the points in this post?)

for reasons that Paul laid out above.

Currently not convinced by / properly understanding Paul's counterpoints.

  1. ^

    Although I flag that we might be considering different kinds of "getting it", where by my lights, "getting it" means "not consistently emitting statements which contravene the points of this post", while you might consider "if pressed on the issue, will admit reward is not the optimization target" to be "getting it."

Reward is not the optimization target

When you say things like "Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported", this assumes that the people doing this reasoning were using the premise in the mistaken way

I have considered the hypothesis that most alignment researchers do understand this post already, while also somehow reliably emitting statements which, to me, indicate that they do not understand it. I deem this hypothesis unlikely. I have also considered that I may be misunderstanding them, and think in some small fraction of instances I might be.

I do in fact think that few people actually already deeply internalized the points I'm making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed. 

I did preface "Here are some major updates which I made:". The post is ambiguous on whether/why I believe others have been mistaken, though. I felt that if I just blurted out my true beliefs about how people had been reasoning incorrectly, people would get defensive. I did in fact consider combing through Ajeya's post for disagreements, but I thought it'd be better to say "here's a new frame" and less "here's what I think you have been doing wrong." So I just stated the important downstream implication: Be very, very careful in analyzing prior alignment thinking on RL+DL.  

I now think that, even though there's some sense in which in theory "building good cognition within the agent" is the only goal we care about, in practice this claim is somewhat misleading, because incrementally improving reward functions (including by doing things like making rewards depend on activations, or amplification in general) is a very good mechanism for moving agents towards the type of cognition we'd like them to do - and we have very few other mechanisms for doing so.

I have relatively little idea how to "improve" a reward function so that it improves the inner cognition chiseled into the policy, because I don't know the mapping from outer reward schedules to inner cognition within the agent. Does an "amplified" reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it? 

I think it's easy to say "and we have improved the reward function", but this is true exactly to the extent to which the reward schedule actually produces more desirable cognition within the AI. Which comes back to my point: Build good cognition, and don't lose track that that's the ultimate goal. Find ways to better understand how reward schedules + data -> inner values. 

(I agree with your excerpt, but I suspect it makes the case too mildly to correct the enormous mistakes I perceive to be made by substantial amounts of alignment thinking.)

Load More