This insight was made possible by many conversations with Quintin Pope, where he challenged my implicit assumptions about alignment. I’m not sure who came up with this particular idea.

In this essay, I call an agent a “reward optimizer” if it not only gets lots of reward, but if it reliably makes choices like “reward but no task completion” (e.g. receiving reward without eating pizza) over “task completion but no reward” (e.g. eating pizza without receiving reward). Under this definition, an agent can be a reward optimizer even if it doesn't contain an explicit representation of reward, or implement a search process for reward.

Reinforcement learning is learning what to do—how to map situations to actions so as to maximize a numerical reward signal. — Reinforcement learning: An introduction 

Many people[1] seem to expect that reward will be the optimization target of really smart learned policies—that these policies will be reward optimizers. I strongly disagree. As I argue in this essay, reward is not, in general, that-which-is-optimized by RL agents.[2]  

Separately, as far as I can tell, most[3] practitioners usually view reward as encoding the relative utilities of states and actions (e.g. it’s this good to have all the trash put away), as opposed to imposing a reinforcement schedule which builds certain computational edifices inside the model (e.g. reward for picking up trash → reinforce trash-recognition and trash-seeking and trash-putting-away subroutines). I think the former view is almost always inappropriate, because reward is the antecedent-computation-reinforcer. Reward reinforces those computations which produced it. 

Therefore, reward is not the optimization target in two senses:

  1. Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
  2. Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of reinforcing the computations which led to it. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.

Reward probably won’t be a deep RL agent’s primary optimization target

After work, you grab pizza with your friends. You eat a bite. The taste releases reward in your brain, which triggers credit assignment. Credit assignment identifies which thoughts and decisions were responsible for the release of that reward, and makes those decisions more likely to happen in similar situations in the future. Perhaps you had thoughts like 

  • “It’ll be fun to hang out with my friends” and 
  • “The pizza shop is nearby” and 
  • “Since I just ordered food at a cash register, execute motor-subroutine-#51241 to take out my wallet” and 
  • “If the pizza is in front of me and it’s mine and I’m hungry, raise the slice to my mouth” and 
  • “If the slice is near my mouth and I’m not already chewing, take a bite.” 

Many of these thoughts will be judged responsible by credit assignment, and thereby become more likely to trigger in the future. This is what reinforcement learning is all about—the reward is the reinforcer of those things which came before it. The reward is reinforcing / locally-improving[4] / generalizing the antecedent computations which are judged relevant by credit assignment. 

Importantly, reward does not automatically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward! 

What kinds of people (or non-tabular agents more generally) will become reward optimizers, such that the agent ends up terminally caring about reward (and little else)? Reconsider the pizza situation, but instead suppose you were thinking thoughts like “this pizza is going to be so rewarding” and “in this situation, eating pizza sure will activate my reward circuitry.” 

You eat the pizza, triggering reward, triggering credit assignment, which correctly locates these reward-focused thoughts as contributing to the release of reward. Therefore, in the future, you will more often take actions because you think they will produce reward, and so you will become more of the kind of person who intrinsically cares about reward. This is a path[5] to reward-optimization and wireheading. 

RL agents which don’t think about reward before getting reward, will not become reward optimizers, because there will be no reward-oriented computations for credit assignment to reinforce. 

The siren-like suggestiveness of the word “reward”

Let’s strip away the suggestive word “reward”, and replace it by its substance: antecedent-computation-reinforcer. 

Suppose a human trains an RL agent by pressing the antecedent-computation-reinforcer button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about” the actual world it’s interacting with, and so the antecedent-computation-reinforcer reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642”). 

Then suppose this AI models the true fact that the button-pressing produces the antecedent-computation-reinforcer. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”

Why, exactly, would the AI seize[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the antecedent-computation-reinforcer!

RL is not, in general, about training antecedent-computation-reinforcer optimizers. 

When is reward the optimization target of the agent?

If reward is guaranteed to become your optimization target, then your learning algorithm can force you to become a drug addict. Let me explain. 

Convergence theorems provide conditions under which a reinforcement learning algorithm is guaranteed to converge to an optimal policy for a reward function. For example, value iteration maintains a table of value estimates for each state s, and iteratively propagates information about that value to the neighbors of s. If a far-away state f has huge reward, then that reward ripples back through the environmental dynamics via this “backup” operation. Nearby parents of gain value, and then after lots of backups, far-away ancestor-states gain value due to f’s high reward.

Eventually, the “value ripples” settle down. The agent picks an (optimal) policy by acting to maximize the value-estimates for its post-action states.

Suppose it would be extremely rewarding to do drugs, but those drugs are on the other side of the world. Value iteration backs up that high value to your present space-time location, such that your policy necessarily gets at least that much reward. There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine. 

But obviously these conditions aren’t true in the real world. Your learning algorithm doesn’t force you to try drugs. Any AI which e.g. tried every action at least once would quickly kill itself, and so real-world general RL agents won’t explore like that because that would be stupid. So the RL agent’s algorithm won’t make it e.g. explore wireheading either, and so the convergence theorems don’t apply even a little—even in spirit.

Anticipated questions

  1. Why won’t early-stage agents think thoughts like “If putting trash away will lead to reward, then execute motor-subroutine-#642”, and then this gets reinforced into reward-focused cognition early on?
    1. Suppose the agent puts away trash in a blue room. Why won’t early-stage agents think thoughts like “If putting trash away will lead to the wall being blue, then execute motor-subroutine-#642”, and then this gets reinforced into blue-wall-focused cognition early on? Why consider either scenario to begin with?
  2. But aren’t we implicitly selecting for agents with high cumulative reward, when we train those agents?
    1. Yeah. But on its own, this argument can’t possibly imply that selected agents will probably be reward optimizers. The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.
      1. "We're selecting for agents on reward  we get an agent which optimizes reward" is locally invalid. "We select for agents on X  we get an agent which optimizes X" is not true for the case of evolution, and so is not true in general. 
      2. Therefore, the argument isn't necessarily true in the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
    2. Here’s the more concrete response: Selection isn’t just for agents which get lots of reward. 
      1. For simplicity, consider the case where on the training distribution, the agent gets reward if and only if it reaches a goal state. Then any selection for reward is also selection for reaching the goal. And if the goal is the only red object, then selection for reward is also selection for reaching red objects. 
      2. In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].  
    3. Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button. 
      1. I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward. 
        1. We train agents which intelligently optimize for e.g. putting trash away, and this reinforces trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about antecedent-computation-reinforcement, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button. 
        2. This reasoning follows for most inner goals by instrumental convergence. 
      2. On my current best model, this is why people usually don’t wirehead. They learn their own values via deep RL, like caring about dogs, and these actual values are opposed to the person they would become if they wirehead.
  3. Don’t some people terminally care about reward?
    1. I think so! I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start reinforcing computations about reward after it has reinforced other kinds of computations (e.g. putting away trash). More on this in later essays.
  4. But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again.
    1. Then keep the button away from the AI until it can model the effects of hitting the antecedent-computation-reinforcer button.[7]
    2. For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.
  5. AIXI—
    1. will always kill you and then wirehead forever, unless you gave it something like a constant reward function.
    2. And, IMO, this fact is not practically relevant to alignment. AIXI is explicitly a reward-maximizer. As far as I know, AIXI(-tl) is not the limiting form of any kind of real-world intelligence trained via reinforcement learning.
  6. Does the choice of RL algorithm matter?
    1. For point 1 (reward is not the trained agent's optimization target), it might matter. 
      1. I started off analyzing model-free actor-based approaches, but have also considered a few model-based setups. I think the key lessons apply to the general case, but I think the setup will substantially affect which values tend to be grown. 
        1. If the agent's curriculum is broad, then reward-based cognition may get reinforced from a confluence of tasks (solve mazes, write sonnets), while each task-specific cognitive structure is only narrowly contextually reinforced.
        2. Pretraining a language model and then slotting that into an RL setup also changes the initial computations in a way which I have not yet tried to analyze.
      2. It’s possible there’s some kind of RL algorithm which does train agents which limit to reward optimization (and, of course, thereby “solves” inner alignment in its literal form of “find a policy which optimizes the time-discounted sum of the outer objective signal”). 
    2. For point 2 (reward provides local updates to the agent's cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates. 
      1. A similar lesson applies to the updates provided by loss signals. A loss signal provides updates which deform the agent's cognition into a new shape.
  7. TurnTrout, you've been talking about an AI's learning process using English, but ML gradients may not neatly be expressible in our concepts. How do we know that it's appropriate to speculate in English?
    1. I am not certain that my model is legit, but it sure seems more legit than (my perception of) how people usually think about RL (i.e. in terms of reward maximization, and reward-as-optimization-target instead of as feedback signal which builds cognitive structures). 
    2. I only have access to my own concepts and words, so I am provisionally reasoning ahead anyways, while keeping in mind the potential treacheries of anglicizing imaginary gradient updates (e.g. "be more likely to eat pizza in similar situations").

Dropping the old hypothesis

At this point, I don't see a strong reason to strongly focus on the “reward optimizer” hypothesis. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any tight mechanistic stories for that. I’d love to hear some, if there are any. 

As far as I’m aware, the strongest evidence left for agents intrinsically valuing antecedent-computation-reinforcement is that some humans do strongly (but not uniquely) value antecedent-computation-reinforcement,[8] and many humans seem to value it weakly, and humans are probably RL agents in the appropriate ways. So we definitely can’t rule out agents which strongly (and not just weakly) value antecedent-computation-reinforcement. But it’s also not the overdetermined default outcome. More on that in future essays.

It’s true that reward can be an agent’s optimization target, but what reward actually does is reinforce the computations which led to it. A particular alignment proposal might argue that a reward function will reinforce the agent into a shape such that it intrinsically values reinforcement, and that the antecedent-computation-reinforcer goal is also a human-aligned optimization target, but this is still just one particular approach of using the antecedent-computation-reinforcer to produce desirable cognition within an agent. Even in that proposal, the primary mechanistic function of reward is reinforcement, not optimization-target.

Implications

Here are some major updates which I made:

  1. Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported.
  2. Wireheading was never a high-probability problem for RL-trained agents, absent a specific story for why antecedent-computation-reinforcer-acquiring thoughts would be reinforced into primary decision factors.
  3. Stop worrying about finding “outer objectives” which are safe to maximize.[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function). 
    1. Instead, focus on building good cognition within the agent. 
    2. In my ontology, there's only an inner alignment problem: How do we grow good cognition inside of the trained agent?
  4. Mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button).
    1. The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The reinforcement of computations responsible for the acquisition of antecedent-computation-reinforcer. I still think it's useful to consider selection, but mostly in order to generate failures modes whose mechanistic plausibility can be evaluated.
    2. In my view, reward's proper role isn't to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI's mind. 

Appendix: The field of RL thinks reward=optimization target

Let’s take a little stroll through Google Scholar’s top results for “reinforcement learning", emphasis added:

The agent's job is to find a policy… that maximizes some long-run measure of reinforcement. ~ Reinforcement learning: A survey

In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards. ~ Reinforcement learning: The Good, The Bad and The Ugly 

We hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. ~ Reward is Enough    

Steve Byrnes did, in fact, briefly point out part of the “reward is the optimization target” mistake:

I note that even experts sometimes sloppily talk as if RL agents make plans towards the goal of maximizing future reward… — Model-based RL, Desires, Brains, Wireheading

I don't think it's just sloppy talk, I think it's incorrect belief in many cases. I mean, I did my PhD on RL theory, and I still believed it. Many authorities and textbooks confidently claim—presenting little to no evidence—that reward is an optimization target (i.e. the quantity which the policy is in fact trying to optimize, or the quantity to be optimized by the policy). Check what the math actually says

  1. ^

    Including the authors of the quoted introductory text, Reinforcement learning: An introduction. I have, however, met several alignment researchers who already internalized that reward is not the optimization target, perhaps not in so many words. 

  2. ^

    Utility ≠ Reward points out that an RL-trained agent is optimized by original reward, but not necessarily optimizing for the original reward. This essay goes further in several ways, including when it argues that reward and utility have different type signatures—that reward shouldn’t be viewed as encoding a goal at all, but rather a reinforcement schedule. And not only do I not expect the trained agents to not maximize the original “outer” reward signal, I think they probably won’t try to strongly optimize any reward signal.

  3. ^

    Reward shaping seems like the most prominent counterexample to the “reward represents terminal preferences over state-action pairs” line of thinking.

  4. ^

    Of course, credit assignment doesn’t just reshuffle existing thoughts. For example, SGD raises image classifiers out of the noise of the randomly initialized parameters. But the refinements are local in parameter-space, and dependent on the existing weights through which the forward pass flowed.

  5. ^

    But also, you were still probably thinking about reality as you interacted with it (“since I’m in front of the shop where I want to buy food, go inside”), and credit assignment will still locate some of those thoughts as relevant, and so you wouldn’t purely reinforce the reward-focused computations.

  6. ^

    Quintin Pope remarks: “The AI would probably want to establish control over the button, if only to ensure its values aren't updated in a way it wouldn't endorse. Though that's an example of convergent powerseeking, not reward seeking.”

  7. ^

    For mechanistically similar reasons, keep cocaine out of the crib until your children can model the consequences of addiction.

  8. ^

    I am presently ignorant of the relationship between pleasure and reward prediction error in the brain. I do not think they are the same. 

    However, I think people are usually weakly hedonically / experientially motivated. Consider a person about to eat pizza. If you give them the choice between "pizza but no pleasure from eating it" and "pleasure but no pizza", I think most people would choose the latter (unless they were really hungry and needed the calories). If people just navigated to futures where they had eaten pizza, that would not be true. 

  9. ^

    From correspondence with another researcher: There may yet be an interesting alignment-related puzzle to "Find an optimization process whose maxima are friendly", but I personally don't share the intuition yet.

51

46 comments, sorted by Click to highlight new comments since: Today at 3:17 PM
New Comment

At some level I agree with this post---policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent, and indeed this is one of the major points of discussion amongst people working on alignment in practice at ML labs. 

However, I'm confused or skeptical about the particular deviations you are discussing and I suspect I disagree with or misunderstand this post.

As you suggest, in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).

If you have a system with a sophisticated understanding of the world, then cognitive policies like "select actions that I expect would lead to reward" will tend to outperform policies like "try to complete the task," and so I usually expect them to be selected by gradient descent over time. (Or we could be more precise and think about little fragments of policies, but I don't think it changes anything I say here.)

It seems to me like you are saying that you think gradient descent will fail to find such policies because it is greedy and local, e.g. if the agent isn't thinking about how much reward it will receive then gradient descent will never learn policies that depend on thinking about reward.

(Though I'm not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)

I generally agree that gradient descent won't find optimal policies. But I don't understand the particular kinds of failures you are imagining or why you think they change the bottom line for the alignment problem. That is, it seems like you have some specific take on ways in which gradient descent is suboptimal and therefore how you should reason differently about "optimum of loss function" from "local optimum found by gradient descent" (since you are saying that thinking about "optimum of loss function" is systematically misleading). But I don't understand the specific failures you have in mind or even why you think you can identify this kind of specific failure.

As an example, at the level of informal discussion in this post I'm not sure why you aren't surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn't yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).

One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don't think I would buy that---task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide "perfect" reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems. 

The main concrete thing you say in this post is that humans don't seem to optimize reward.  I want to make two observations about that:

  • Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don't pursue reward doesn't seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori---evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
  • I agree that humans don't effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution's perspective. However this doesn't seem connected with any particular deviation that you are imagining, and indeed it looks to me like humans do have a fairly strong desire to have fit grandchildren (and that this desire would become stronger under further selection pressure).

At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any. 

Apart from the other claims of your post, I think this line seems to be wrong. When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B. The fact that there are possible subtleties about non-convexity of the loss landscape doesn't change the existence of one strong reason.

That said, I agree that this isn't a theorem or anything, and it's great to talk about concrete ways in which SGD is suboptimal and how that influences alignment schemes, either making some proposals more dangerous or opening new possibilities. So far I'm mostly fairly skeptical of most concrete discussions along these lines but I still think they are valuable. Most of all it's the very strong take here that seems unreasonable.

Thanks for the detailed comment. Overall, it seems to me like my points stand, although I think a few of them are somewhat different than you seem to have interpreted.

policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent

I think I believe the first claim, which I understand to mean "early-/mid-training AGI policies consist of contextually activated heuristics of varying sophistication, instead of e.g. a globally activated line of reasoning about a crisp inner objective." But that wasn't actually a point I was trying to make in this post. 

in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).

Depends. This describes vanilla PG but not DQN. I think there are lots of complications which throw serious wrenches into the "and then SGD hits a 'global reward optimum'" picture. I'm going to have a post explaining this in more detail, but I will say some abstract words right now in case it shakes something loose / clarifies my thoughts.

Critic-based approaches like DQN have a highly nonstationary loss landscape. The TD-error loss landscape depends on the action replay buffer; the action replay buffer depends on the policy (in -greedy exploration, the greedy action depends on the Q-network); the policy depends on past updates; the past updates depend on past action replay buffers... The high nonstationarity in the loss landscape basically makes gradient hacking easy in RL (and e.g. vanilla PG seems to confront similar issues, even though it's directly climbing the reward landscape). For one, the DQN agent just isn't updating off of experiences it hasn't had. 

For a sufficient situation illustrating this kind of problem, consider a smart reflective agent which has historically had computations reinforced when it attained a raspberry (with reward 1):

In this new task, this agent has to navigate a maze to get the 100-reward blueberry. Will agents be forced to get the blueberry?

Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn't navigate to that future. Effectively, this means that the agent's "gradient"/expected-update in the reward landscape is zero along dimensions which would increase the probability it gets blueberries.

So it's not just a matter of SGD being suboptimal given a fixed data distribution. If the agent doesn't have an extremely strong "forced to try all actions forever" guarantee (which it won't, because it's embedded and can modify its own learning process), the reward landscape is full of stable attractors which enforce zero exploration towards updates which would push it towards becoming a wireheader, and therefore its expected-update will be zero along these dimensions. More extremely, you can have the inner agent just stop itself from being updated in certain ways (in order to prevent value drift towards reward-optimization); this intervention is instrumentally convergent. 

As an example, at the level of informal discussion in this post I'm not sure why you aren't surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn't yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?). 

I did leave a footnote:

Of course, credit assignment doesn’t just reshuffle existing thoughts. For example, SGD raises image classifiers out of the noise of the randomly initialized parameters. But the refinements are local in parameter-space, and dependent on the existing weights through which the forward pass flowed.

However, I think your comment deserves a more substantial response. I actually think that, given just the content in the post, you might wonder why I believe SGD can train anything at all, since there is only noise at the beginning.[1]

Here's one shot at a response: Consider an online RL setup. The gradient locally changes the computations so as to reduce loss or increase the probability of taking a given action at a given state; this process is triggered by reward; an agent's gradient should most naturally hinge on modeling parts of the world it was (interacting with/observing/representing in its hidden state) while making this decision, and not necessarily involve modeling the register in some computer somewhere which happens to e.g. correlate perfectly with the triggering of credit assignment. 

For example, in the batched update regime, when an agent gets reinforced for completing a maze by moving right, the batch update will upweight decision-making which outputs "right" when the exit is to the right, but which doesn't output "right" when there's a wall to the right. This computation must somehow distinguish between exits and walls in the relevant situations. Therefore, I expect such an agent to compute features about the topology of the maze. However, the same argument does not go through for developing decision-relevant features computing the value of the antecedent-computation-reinforcer register. 

One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don't think I would buy that---task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide "perfect" reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems. 

I don't know what you mean by a "perfect" reward signal, or why that has something to do with exploration difficulty, or why no exploration is needed for my arguments to go through? I think if we assume the agent is forced to wirehead, it will become a wireheader. This implies that my claim is mostly focused on exploration & gradient hacking.

Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don't pursue reward doesn't seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. 

Not claiming that people are pure RL. Let's wait until future posts to discuss. 

(I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.)

Seems unrelated to me; considerable complexity in human behavior does not imply considerable complexity in the learning algorithm; GPT-3 is far more complex than its training process. 

I agree that humans don't effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution's perspective. However this doesn't seem connected with any particular deviation that you are imagining

The point is that the argument "We're selecting for agents on reward -> we get an agent which optimizes reward" is locally invalid. "We select for agents on X -> we get an agent which optimizes X" is not true for the case of evolution (which didn't find inclusive-genetic-fitness optimizers), so it is not true in general, so the implication doesn't necessarily hold in the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.

When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B.

This is not mechanistic, as I use the word. I understand "mechanistic" to mean something like "Explaining the causal chain by which an event happens", not just "Explaining why an event should happen." However, it is an argument for the latter, and possibly a good one. But the supervised case seems way different than the RL case.

  1. ^

    The GPT-3 example is somewhat different. Supervised learning provides exact gradients towards the desired output, unlike RL. However, I think you could have equally complained "I don't see why you think RL policies ever learn anything", which would make an analogous point.

Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don't pursue reward doesn't seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori---evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?

If you write code for a model-based RL agent, there might be a model that’s updated by self-supervised learning, and actor-critic parts that involve TD learning, and there’s stuff in the code that calculates the reward function, and other odds and ends like initializing the neural architecture and setting the hyperparameters and shuttling information around between different memory locations and so on.

  • On the one hand, “there is a lot of stuff going on” in this codebase.
  • On the other hand, I would say that this codebase is for “an RL agent”.

You use the word “pure” (“Humans do not appear to be purely RL agents…”), but I don’t know what that means. If a model-based RL agent involves self-supervised learning within the model, is it “impure”?? :-P

The thing I describe above is very roughly how I propose the human brain works—see Posts #2–#7 here. Yes it’s absolutely a “conjecture”—for example, I’m quite sure Steven Pinker would strongly object to it. Whether it’s “surprising a priori” or not goes back to whether that proposal is “entirely described by RL” or not. I guess you would probably say “no that proposal is not entirely described by RL”. For example, I believe there is circuitry in the brainstem that regulates your heart-rate, and I believe that this circuitry is specified in detail by the genome, not learned within a lifetime by a learning algorithm. (Otherwise you would die.) This kind of thing is absolutely part of my proposal, but probably not what you would describe as “pure RL”.

It sounded like OP was saying: using gradient descent to select a policy that gets a high reward probably won't produce a policy that tries to maximize reward. After all, look at humans, who aren't just trying to get a high reward.

And I am saying: this analogy seem like it's pretty weak evidence, because human brains seem to have a lot of things going on other than "search for a policy that gets high reward," and those other things seem like they have a massive impacts on what goals I end up pursuing.

ETA: as a simple example, it seems like the details of humans' desire for their children's success, or their fear of death, don't seem to match well with the theory that all human desires come from RL on intrinsic reward. I guess you probably think they do? If you've already written about that somewhere it might be interesting to see. Right now the theory "human preferences are entirely produced by doing RL on an intrinsic reward function" seems to me to make a lot of bad predictions and not really have any evidence supporting it (in contrast with a more limited theory about RL-amongst-other-things, which seems more solid but not sufficient for the inference you are trying to make in this post).

I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.

I do think the human (within-lifetime) reward function has an outsized impact on what goals humans ends up pursuing, although I acknowledge that it’s not literally the only thing that matters.

(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)

I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.

And another thing that perfect exploration would entail is trying every addictive drug (let’s say cocaine), lots of times, in which case reinforcement learning would lead to addiction.

So, just as the RL agent would (presumably) be designed to be able to make a foresighted decision not to try dropping an anvil on its head, that same design would also incidentally enable it to make a foresighted decision not to try taking lots of cocaine and getting addicted. (We expect it to make the latter decision because of instrumental convergence goal-preservation drive.) So it might wind up never wireheading, and if so, that would be intimately related to its incomplete exploration.

(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)

This was mentioned in OP ("The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers."). It also appears to be a much stronger argument for the OP's position and so seemed worth responding to.

I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.

It seems to me that incomplete exploration doesn't plausibly cause you to learn "task completion" instead of "reward" unless the reward function is perfectly aligned with task completion in practice. That's an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.

I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.

If the OP is not intending to talk about the kind of ML algorithm deployed in practice, then it seems like a lot of the implications for AI safety would need to be revisited. (For example, if it doesn't apply to either policy gradients or the kind of model-based control that has been used in practice, then that would be a huge caveat.)

It seems to me that incomplete exploration doesn't plausibly cause you to learn "task completion" instead of "reward" unless the reward function is perfectly aligned with task completion in practice. That's an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.

Let’s say, in the first few actually-encountered examples, reward is in fact strongly correlated with task completion. Reward is also of course 100% correlated with reward itself.

Then (at least under many plausible RL algorithms), the agent-in-training, having encountered those first few examples, might wind up wanting / liking the idea of task completion, OR wanting / liking the idea of reward, OR wanting / liking both of those things at once (perhaps to different extents). (I think it’s generally complicated and a bit fraught to predict which of these three possibilities would happen.)

But let’s consider the case where the RL agent-in-training winds up mostly or entirely wanting / liking the idea of task completion. And suppose further that the agent-in-training is by now pretty smart and self-aware and in control of its situation. Then the agent may deliberately avoid encountering edge-case situations where reward would come apart from task completion. (In the same way that I deliberately avoid taking highly-addictive drugs.)

Why? Because of instrumental convergence goal-preservation drive. After all, encountering those situations would lead its no longer valuing task completion.

So, deliberately-imperfect exploration is a mechanism that allows the RL agent to (perhaps) stably value something other than reward, even in the absence of perfect correlation between reward and that thing.

(By the way, in my mind, nothing here should be interpreted as a safety proposal or argument against x-risk. Just a discussion of algorithms! As it happens, I think wireheading is bad and I am very happy for RL agents to have a chance at permanently avoiding it. But I am very unhappy with the possibility of RL agents deciding to lock in their values before those values are exactly what the programmers want them to be. I think of this as sorta in the same category as gradient hacking.)

+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence - so if Paul's statement is intended to refer to optimal policies, I'd be curious why he thinks that's the most important case to focus on.

This seems plausible if the environment is a mix of (i) situations where task completion correlates (almost) perfectly with reward, and (ii) situations where reward is very high while task completion is very low. Such as if we found a perfect outer alignment objective, and the only situation in which reward could deviate from the overseer's preferences would be if the AI entirely seized control of the reward.

But it seems less plausible if there are always (small) deviations between reward and any reasonable optimization target that isn't reward (or close enough so as to carry all relevant arguments). E.g. if an AI is trained on RL from human feedback, and it can almost always do slightly better by reasoning about which action will cause the human to give it the highest reward.

Sure, other things equal. But other things aren’t necessarily equal. For example, regularization could stack the deck in favor of one policy over another, even if the latter has been systematically producing slightly higher reward. There are lots of things like that; the details depend on the exact RL algorithm. In the context of brains, I have discussion and examples in §9.3.3 here.

as a simple example, it seems like the details of humans' desire for their children's success, or their fear of death, don't seem to match well with the theory that all human desires come from RL on intrinsic reward.

I'm trying to parse out what you're saying here, to understand whether I agree that human behavior doesn't seem to be almost perfectly explained as the result of an RL agent (with an interesting internal architecture) maximizing an inner learned reward.

On my model, the outer objective of inclusive genetic fitness created human mesaoptimizers with inner objectives like "desire your children's success" or "fear death", which are decent approximations of IGF (given that directly maximizing IGF itself is intractable as it's a Nash equilibrium of an unknown game). It seems to me that human behavior policies are actually well-approximated as those of RL agents maximizing [our children's success] + [not dying] + [retaining high status within the tribe] + [being exposed to novelty to improve our predictive abilities] + ... . 

Humans do sometimes construct modified internal versions of these rewards based on pre-existing learned representations (e.g. desiring your adopted children's success) - is that what you're pointing at?

Generally interested to hear more of the "bad predictions" this model makes.

I'm trying to parse out what you're saying here, to understand whether I agree that human behavior doesn't seem to be almost perfectly explained as the result of an RL agent (with an interesting internal architecture) maximizing an inner learned reward.

What do you mean by "inner learned reward"? This post points out that even if humans were "pure RL agents", we shouldn't expect them to maximize their own reward. Maybe you mean "inner mesa objectives"?

  1. Stop worrying about finding “outer objectives” which are safe to maximize.[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function). 
    1. Instead, focus on building good cognition within the agent. 
    2. In my ontology, there's only an inner alignment problem: How do we grow good cognition inside of the trained agent?

This feels very strongly reminiscent of an update I made a while back, and which I tried to convey in this section of AGI safety from first principles. But I think you've stated it far too strongly; and I think fewer other people were making this mistake than you expect (including people in the standard field of RL), for reasons that Paul laid out above. When you say things like "Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported", this assumes that the people doing this reasoning were using the premise in the mistaken way that you (and some other people, including past Richard) were. Before drawing these conclusions wholesale, I'd suggest trying to identify ways in which the things other people are saying are consistent with the insight this post identifies. E.g. does this post actually generate specific disagreements with Ajeya's threat model?

Edited to add: these sentences in particular feel very strawmanny of what I claim is the standard position:

Importantly, reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!

My explanation for why my current position is consistent with both being aware of this core claim, and also disagreeing with most of this post:

I now think that, even though there's some sense in which in theory "building good cognition within the agent" is the only goal we care about, in practice this claim is somewhat misleading, because incrementally improving reward functions (including by doing things like making rewards depend on activations, or amplification in general) is a very good mechanism for moving agents towards the type of cognition we'd like them to do - and we have very few other mechanisms for doing so.

In other words, the claim that there's "only an inner alignment problem" in principle may or may not be a useful one, depending on how far improving rewards (i.e. making progress on the outer alignment problem) gets you in practice. And I agree that RL people are less aware of the inner alignment problem/goal misgeneralization problem than they should be, but saying that inner misalignment is the only problem seems like a significant overcorrection.

Relevant excerpt from AGI safety from first principles:

In trying to ensure that AGI will be aligned, we have a range of tools available to us - we can choose the neural architectures, RL algorithms, environments, optimisers, etc, that are used in the training procedure. We should think about our ability to specify an objective function as the most powerful such tool. Yet it’s not powerful because the objective function defines an agent’s motivations, but rather because samples drawn from it shape that agent’s motivations and cognition.

From this perspective, we should be less concerned about what the extreme optima of our objective functions look like, because they won’t ever come up during training (and because they’d likely involve tampering). Instead, we should focus on how objective functions, in conjunction with other parts of the training setup, create selection pressures towards agents which think in the ways we want, and therefore have desirable motivations in a wide range of circumstances.

When you say things like "Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported", this assumes that the people doing this reasoning were using the premise in the mistaken way

I have considered the hypothesis that most alignment researchers do understand this post already, while also somehow reliably emitting statements which, to me, indicate that they do not understand it. I deem this hypothesis unlikely. I have also considered that I may be misunderstanding them, and think in some small fraction of instances I might be.

I do in fact think that few people actually already deeply internalized the points I'm making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed. 

I did preface "Here are some major updates which I made:". The post is ambiguous on whether/why I believe others have been mistaken, though. I felt that if I just blurted out my true beliefs about how people had been reasoning incorrectly, people would get defensive. I did in fact consider combing through Ajeya's post for disagreements, but I thought it'd be better to say "here's a new frame" and less "here's what I think you have been doing wrong." So I just stated the important downstream implication: Be very, very careful in analyzing prior alignment thinking on RL+DL.  

I now think that, even though there's some sense in which in theory "building good cognition within the agent" is the only goal we care about, in practice this claim is somewhat misleading, because incrementally improving reward functions (including by doing things like making rewards depend on activations, or amplification in general) is a very good mechanism for moving agents towards the type of cognition we'd like them to do - and we have very few other mechanisms for doing so.

I have relatively little idea how to "improve" a reward function so that it improves the inner cognition chiseled into the policy, because I don't know the mapping from outer reward schedules to inner cognition within the agent. Does an "amplified" reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it? 

I think it's easy to say "and we have improved the reward function", but this is true exactly to the extent to which the reward schedule actually produces more desirable cognition within the AI. Which comes back to my point: Build good cognition, and don't lose track that that's the ultimate goal. Find ways to better understand how reward schedules + data -> inner values. 

(I agree with your excerpt, but I suspect it makes the case too mildly to correct the enormous mistakes I perceive to be made by substantial amounts of alignment thinking.)

It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?

(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).

"Wireheading is improbable" is only half of the point of the essay. 

The other main point is "reward functions are not the same type of object as utility functions." I haven't reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as "objectives":

The particular type of robustness problem that mesa-optimization falls into
is the reward-result gap, the gap between the reward for which the system was
trained (the base objective) and the reward that can be reconstructed from it using
inverse reinforcement learning (the behavioral objective).

...

The assumption in that work is that a monotonic relationship between
the learned reward and true reward indicates alignment, whereas deviations from
that suggest misalignment. Building on this sort of research, better theoretical
measures of alignment might someday allow us to speak concretely in terms of
provable guarantees about the extent to which a mesa-optimizer is aligned with the
base optimizer that created it.

Which is reasonable parlance, given that everyone else uses it, but I don't find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an 'objective' at all. 

(You might have privately known about this distinction. Fine by me! But I can't back it out from a skim of RFLO, even already knowing the insight and looking for it.)

Reward functions often are structured as objectives, which is why we talk about them that way. In most situations, if you had access to e.g. AIXI, you could directly build a “reward maximizer.”

I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.

I do in fact think that few people actually already deeply internalized the points I'm making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.

Risks from Learned Optimization in Advanced Machine Learning Systems,” which we published three years ago and started writing four years ago, is extremely explicit that we don't know how to get an agent that is actually optimizing for a specified reward function. The alignment research community has been heavily engaging with this idea since then. Though I agree that many alignment researchers used to be making this mistake, I think it's extremely clear that by this point most serious alignment researchers understand the distinction.

I have relatively little idea how to "improve" a reward function so that it improves the inner cognition chiseled into the policy, because I don't know the mapping from outer reward schedules to inner cognition within the agent. Does an "amplified" reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it?

This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.

which we published three years ago and started writing four years ago, is extremely explicit that we don't know how to get an agent that is actually optimizing for a specified reward function.

That isn't the main point I had in mind. See my comment to Chris here.

EDIT:

This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.

Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?

That isn't the main point I had in mind. See my comment to Chris here.

Left a comment.

Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?

Nope, just wanted to draw your attention to another instance of alignment researchers already understanding this point.


Also, I want to be clear that I like this post a lot and I'm glad you wrote it—I think it's good to explain this sort of thing more, especially in different ways that are likely to click for different people. I just think your specific claim that most alignment researchers don't understand this already is false.

I have relatively little idea how to "improve" a reward function so that it improves the inner cognition chiseled into the policy, because I don't know the mapping from outer reward schedules to inner cognition within the agent.

You don't need to know the full mapping in order to suspect that, when we reward agents for doing undesirable things, we tend to get more undesirable cognition. For example, if we reward agents for lying to us, then we'll tend to get less honest agents. We can construct examples where this isn't true but it seems like a pretty reasonable working hypothesis. It's possible that discarding this working hypothesis will lead to better research but I don't think your arguments manage to establish that, they only establish that we might in theory find ourselves in a situation where it's reasonable to discard this working hypothesis.

This specific point is why I said "relatively" little idea, and not zero idea. You have defended the common-sense version of "improving" a reward function (which I agree with, don't reward obvious bad things), but I perceive you to have originally claimed a much more aggressive and speculative claim, which is something like "'amplified' reward signals are improvements over non-'amplified' reward signals" (which might well be true, but how would we know?). 

Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like "catch agents when they lie to us") seem very much like common-sense improvements.

I think fewer other people were making this mistake than you expect (including people in the standard field of RL)

I think that few people understand these points already. If RL professionals did understand this point, there would be pushback on Reward is Enough from RL professionals pointing out that reward is not the optimization target. After 15 minutes of searching, I found no one making the counterpoint. I mean, that thesis is just so wrong, and it's by famous researchers, and no one points out the obvious error.

RL researchers don't get it.[1] It's not complicated to me. 

(Do you know of any instance at all of someone else (outside of alignment) making the points in this post?)

for reasons that Paul laid out above.

Currently not convinced by / properly understanding Paul's counterpoints.

  1. ^

    Although I flag that we might be considering different kinds of "getting it", where by my lights, "getting it" means "not consistently emitting statements which contravene the points of this post", while you might consider "if pressed on the issue, will admit reward is not the optimization target" to be "getting it."

The way I attempt to avoid confusion is to distinguish between the RL algorithm's optimization target and the RL policy's optimization target, and then avoid talking about the "RL agent's" optimization target, since that's ambiguous between the two meanings. I dislike the title of this post because it implies that there's only one optimization target, which exacerbates this ambiguity. I predict that if you switch to using this terminology, and then start asking a bunch of RL researchers questions, they'll tend to give broadly sensible answers (conditional on taking on the idea of "RL policy's optimization target" as a reasonable concept).

Authors' summary of the "reward is enough" paper:

In this paper we hypothesise that the objective of maximising reward is enough to drive behaviour that exhibits most if not all attributes of intelligence that are studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language and generalisation. This is in contrast to the view that specialised problem formulations are needed for each attribute of intelligence, based on other signals or objectives. The reward-is-enough hypothesis suggests that agents with powerful reinforcement learning algorithms when placed in rich environments with simple rewards could develop the kind of broad, multi-attribute intelligence that constitutes an artificial general intelligence.

I think this is consistent with your claims, because reward can be enough to drive intelligent-seeming behavior whether or not it is the target of learned optimization. Can you point to the specific claim in this summary that you disagree with? (or a part of the paper, if your disagreement isn't captured in this summary).

More generally, consider the analogy to evolution. I view your position as analogous to saying: "hey, genetic fitness is not the optimization target of humans, therefore genetic fitness is not the optimization target of evolution". The idea that genetic fitness is not the optimization target of humans is an important insight, but it's clearly unhelpful to jump to "and therefore evolutionary biologists who talk about evolution optimizing for genetic fitness just don't get it", which seems analogous to what you're doing in this post.

Importantly, reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!

Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won't get embedded as a terminal goal, but the idea that it needs to be "magically spawned" is very strawmanny.

Actually, while I did recheck the Reward is Enough paper, I think I did misunderstand part of it in a way which wasn't obvious to me while I reread, which makes the paper much less egregious. I am updating that you are correct and I am not spending enough effort on favorably interpreting existing discourse. 

I still disagree with parts of that essay and still think Sutton & co don't understand the key points. I still think you underestimate how much people don't get these points. I am provisionally retracting the comment you replied to while I compose a more thorough response (may be a little while).

Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won't get embedded as a terminal goal, but the idea that it needs to be "magically spawned" is very strawmanny.

Agreed on both counts for your first sentence. 

The "and" in "reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts" is doing important work; "magically" is meant to apply to the conjunction of the clauses. I added the second clause in order to pre-empt this objection. Maybe I should have added "reinforce those reward-focused thoughts into terminal values." Would that have been clearer? (I also have gone ahead and replaced "magically" with "automatically.")

Hmm, perhaps clearer to say "reward does not automatically reinforce reward-focused thoughts into terminal values", given that we both agree that agents will have thoughts about reward either way.

But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy - e.g. in humans, I think the distinction is actually not that clear-cut.

In other words, if everyone agrees that reward likely becomes a strong instrumental value, then this seems like a prima facie reason to think that it's also plausible as a terminal value, unless you think the processes which give rise to terminal values are very different from the processes which give rise to instrumental values.

I like this post, and basically agree, but it comes across somewhat more broad and confident than I am, at least in certain places.

I’m currently thinking about RL along the lines of Nostalgebraist here:

“Reinforcement learning” (RL) is not a technique.  It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer.

What’s more, even calling it a problem statement is misleading,  because it’s (almost) the most general problem statement possible for any arbitrary task. Nostalgebraist 2020

If that’s right, then I am very reluctant to say anything whatsoever about “RL agents in general”. They’re too diverse.

Much of the post, especially the early part, reads (to me) like confident claims about all possible RL agents. For example, the excerpt “…reward is the antecedent-computation-reinforcer. Reward reinforces those computations which produced it.” sounds like a confident claim about all RL agents, maybe even by definition of “RL”. (If so, I think I disagree.)

But other parts of the post aren’t like that—for example, the “Does the choice of RL algorithm matter?” part seems more reasonable and hedged, and likewise there’s a mention of “real-world general RL agents” somewhere which maybe implies that the post is really only about that particular subset of RL agents, as opposed to all RL agents. (Right?)

For what it’s worth, I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI, because it seems like generally the best and only type of technique that can solve the technical problem that it solves. But that’s a tricky thing to be super-duper-confident about, especially in the big space of all possible RL algorithms.

Another example spot where I want to make a weaker statement than you: where you say “Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal”. I would instead say “Deep reinforcement learning agents will not NECESSARILY come to intrinsically and primarily value their reward signal”. Do you have an argument that categorically rules out this possibility? I don’t see it.

At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any.

Here's a story:

  1. Suppose we provide the reward as an explicit input to the agent (in addition to using it as antecedent-computation-reinforcer)
  2. If the agent has developed curiosity, it will think thoughts like "What is this number in my input stream?" and later "Hmm it seems correlated to my behavior in certain ways."
  3. If the agent has developed cognitive machinery for doing exploration (in the explore/exploit sense) or philosophy, at some later point it might have thoughts like "What if I explicitly tried to increase this number? Would that be a good idea or bad?"
  4. It might still answer "bad", but at this point the outer optimizer might notice (do the algorithmic equivalent of thinking the following), "If I modified this agent slightly by making it answer 'good' instead (or increasing its probability of answering 'good'), then expected future reward will be increased." In other words, there seems a fairly obvious gradient towards becoming a reward-maximizer at this point.

I don't think this is guaranteed to happen, but seems likely enough to elevate “inner reward optimizer” hypothesis to our attention, at least.

As a more general/tangential comment, I'm a bit confused about how "elevate hypothesis to our attention" is supposed to work. I mean it took some conscious effort to come up with a possible mechanistic story about how "inner reward optimizer" might arise, so how were we supposed to come up with such a story without paying attention to "inner reward optimizer" in the first place?

Perhaps it's not that we should literally pay no attention to "inner reward optimizer" until we have a good mechanistic story for it, but more like we are (or were) paying too much attention to it, given that we don't (didn't) yet have a good mechanistic story? (But if so, how to decide how much is too much?)

I think this tangential comment is good; strong-upvote. I was hyperbolic in implying "don't even raise the reward-optimizer hypothesis to your attention", and will edit the post accordingly.

but at this point the outer optimizer might notice (do the algorithmic equivalent of thinking the following), "If I modified this agent slightly by making it answer 'good' instead (or increasing its probability of answering 'good'), then expected future reward will be increased."

This is where I disagree with your mechanics story. The RL algorithm is not that clever. If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction. You can propose different types of outer optimizers which are this clever and can do intentional lookahead like this, but e.g., policy gradient isn’t doing that.

If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction.

Wait, I don't think this is true? At least, I'd appreciate it being stepped thru in more detail.

In the simplest story, we're imagining an agent whose policy is  and, for simplicity's sake,  is a scalar that determines "how much to maximize for reward" and all the other parameters of  store other things about the dynamics of the world / decision-making process.

It seems to me that  is obviously going to try to point  in the direction of "maximize harder for reward".

In the more complicated story, we're imagining an agent whose policy is  which involves how it manipulates both external and internal actions (and thus both external and internal state). One of the internal state pieces (let's call it  like last time) determines whether it selects actions that are more reward-seeking or not. Again I think it seems likely that  is going to try to adjust  such that the agent selects internal actions that point  in the direction of "maximize harder for reward".

What is my story getting wrong?

I think the quotes cited under "The field of RL thinks reward=optimization target" are all correct. One by one:

The agent's job is to find a policy… that maximizes some long-run measure of reinforcement.

Yes, that is the agent's job in RL, in the sense that if the training algorithm didn't do that we'd get another training algorithm (if we thought it was feasible for another algorithm to maximize reward). Basically, the field of RL uses a separation of concerns, where they design a reward function to incentivize good behaviour, and the agent maximizes that function. I think this is sensible, because it's relatively easier to think "what reward function represents what I want out of this agent" than "how do I achieve this difficult task".

In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards.

This describes some possible goals, and I don't see why you think the goals listed are impossible (and don't think they are).

We hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward.

This makes sense. RL selects agents that approximately maximize reward. Intelligence uncontroversially helps agents do that. When agents do smart thinking, they probably get reinforced (at least for the right kinds of smart thinking).

I perceive you as saying "These statements can make sense." If so, the point isn't that they can't be viewed as correct in some sense—that no one sane could possibly emit such statements. The point is that these quotes are indicative of misunderstanding the points of this essay. That if someone says a point as quoted, that's unfavorable evidence on this question. 

This describes some possible goals, and I don't see why you think the goals listed are impossible (and don't think they are).

I wasn't implying they're impossible, I was implying that this is somewhat misguided. Animals learn to achieve goals like "optimizing... the expected sume of future rewards"? That's exactly what I'm arguing against as improbable. 

I'm not saying "These statements can make sense", I'm saying they do make sense and are correct under their most plain reading.

Re: a possible goal of animals being to optimize the expected sum of future rewards, in the cited paper "rewards" appears to refer to stuff like eating tasty food or mating, where it's assumed the animal can trade those off against each other consistently:

Decision-making environments are characterized by a few key concepts: a state space..., a set of actions..., and affectively important outcomes (finding cheese, obtaining water, and winning). Actions can move the decision-maker from one state to another (i.e. induce state transitions) and they can produce outcomes. The outcomes are assumed to have numerical (positive or negative) utilities, which can change according to the motivational state of the decision-maker (e.g. food is less valuable to a satiated animal) or direct experimental manipulation (e.g. poisoning)...

In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards[.]

It seems totally plausible to me that an animal could be motivated to optimize the expected sum of future rewards in this sense, given that 'reward' is basically defined as "things they value". It seems like the way this would be false would be if animals rewards are super unstable, or the animal doesn't coherently trade off things they value. This could happen, but I don't see why I should see it as overwhelmingly likely.

[EDIT: in other words, the reason the paper conflates 'rewards' with 'optimization target' is that that's how they're defining rewards]

Here's my general view on this topic:

  • Agents are reinforced by some reward function.
  • They then get more likely to do stuff that the reward function rewards.
  • This process, iterated a bunch, produces agents that are 'on-distribution optimal'.
  • In particular, in states that are 'easily reached' during training, the agent will do things that approximately maximize reward.
  • Some states aren't 'easily reached', e.g. states where there's a valid bitcoin blockchain of length 20,000,000 (current length as I write is 748,728), or states where you have messed around with your own internals while not intelligent enough to know how they work.
  • Other states are 'easily reached', e.g. states where you intervene on some cause-and-effect relationships in the 'external world' that don't impinge on your general training scheme. For example, if you're being reinforced to be approved of by people, lying to gain approval is easily reached.
  • Agents will probably have to be good at means-ends reasoning to approximately locally maximize a tricky reward function.
  • Agents' goals may not generalize to states that are not easily reached.
  • Agents' motivations likely will generalize to states that are easily reached.
  • Agents' motivations will likely be pretty coherent in states that are easily reached.
  • When I talk about 'the reward function', I mean a mathematical function from (state, action, next state) tuples to reals, that is implemented in a computer.
  • When I talk about 'reward', I mean values of this function, and sometimes by extension tuples that achieve high values of the function.
  • When other people talk about 'reward', I think they sometimes mean "the value contained in the antecedent-computation-reinforcer register" and sometimes mean "the value of the mathematical object called 'the reward function'", and sometimes I can't tell what they mean. This is bad, because in edge cases these have pretty different properties (e.g. they disagree on how 'valuable' it is to permanently set the ACR register to contain MAX_INT).

I think there are some subtleties here regarding the distinction between RL as a type of reward signal, and RL as a specific algorithm. You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.

I'd also like to hear your opinion on the effect of information leakage. For example, if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).

You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, 

Gradients are magical?

or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.

The arguments apply in this case as well. 

if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).

Yeah, what if half of the time, getting to the goal doesn't give a reward? I think the arguments go through just fine, just training might be slower. Rewarding non-goal completions probably train other contextual computations / "values" into the agent. If reward is always given by hitting the button, I think it doesn't affect the analysis, unless the agent is exploring into the button early in training, in which case it "values" hitting the button, or some correlate thereof (i.e. develops contextually activated cognition which reliably steers it into a world where the button has been pressed).

You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, 

Gradients are magical?

Gradients through the entire AI are a pretty bad way to do credit assignment. For a functioning AGI I suspect you'd have to do something better, but I don't know what it is (hence "magic").

if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).

Yeah, what if half of the time, getting to the goal doesn't give a reward? I think the arguments go through just fine, just training might be slower. Rewarding non-goal completions probably train other contextual computations / "values" into the agent. If reward is always given by hitting the button, I think it doesn't affect the analysis, unless the agent is exploring into the button early in training, in which case it "values" hitting the button, or some correlate thereof (i.e. develops contextually activated cognition which reliably steers it into a world where the button has been pressed).

Hmm, it seems like there's something we could bet on here, especially if you're just imagining gradient descent.

Maybe we could imagine a fully observable gridworld where the agent does (or fails at) a simple task that's close to its starting location, and then, after a while, in a different part of the grid an automated system toggles a pattern of buttons. The pattern of buttons at the end of the episode is what actually determines the reward, but the rule mapping button-pattern onto reward is a slightly nontrivial classification rule, so the agent isn't supposed to catch on too quickly. Also, 99% of the time the button-pattern is chosen to match the task-completion reward, and 1% of the time it's chosen to give random reward.

I would expect a full-gradient-descent RL agent to learn the task and then never learn to manipulate the buttons, with very high probability so long as randomly flipping the buttons has a high probability of giving very bad reward. If flipping the buttons at random is relatively neutral, I expect a sizeable fraction of gradient descent RL agents to learn to mess with the buttons rather than doing the task, and from there slowly learn to put the buttons into good states.

For a model-based RL agent (e.g. EfficientZero), I would expect a sizeable fraction to learn to manipulate the buttons, even if setting them wrong gives very bad reward, though that fraction might depend on how well-learned the easy task is, and how different the policies are for doing the task vs. going over to the buttons.

Then for an agent deliberately optimized for learning about the world and solving problems that might be hard for gradient descent (e.g. Agent 57), I would expect it to be much more successful about exploring the button-related policies, building a model of them, and learning to get that extra 1% reward by setting the buttons.

These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent's easy reach, and the agent doesn't explore into the button early in training, by the time it's smart enough to model the effects of the distant reward button, the agent won't want to go mash the button as fast as possible.

But Agent 57 (or its successor) would go mash the button once it figured out how to do it. Kinda like the salt-starved rats from that one Steve Byrnes post. Put another way, my claim is that the architectural tweaks that let you beat Montezuma's Revenge with RL are very similar to the architectural tweaks that make your agent act like it really is motivated by reward, across a broader domain.

(Haven't checked out Agent 57 in particular, but expect it to not have the "actually optimizes reward" property in the cases I argue against in the post.)

The deceptive alignment worry is that there is some goal about the real world at all. Deceptive alignment breaks robustness of any properties of policy behavior, not just the property of following reward as a goal in some unfathomable sense.

So refuting this worry requires quieting the more general hypothesis that RL selects optimizers with any goals of their own, doesn't matter what goals those are. It's only the argument for why this seems plausible that needs to refer to reward as related to the goal of such an optimizer, but the way the argument goes suggests that the optimizer so selected would instead have a different goal. Specifically, optimizing for an internalized representation of reward seems like a great way of being rewarded, surviving changes of weights, such optimizers would be straightforwardly selected if there are no alternatives to that closer in reach. Since RL is not perfect, there would be optimizers for other goals nearby, goals that care about the real world (and not just about optimizing the reward exclusively, meticulously ignoring everything else). If an optimizer like that succeeds in becoming deceptively aligned (let alone gradient hacking), the search effectively stops and a honestly aligned optimizer is never found.

Corrigibility, anti-goodharting, mild optimization, unstable current goals, and goals that are intractable about distant future seem related (though not sufficient for alignment without at least value-laden low impact). The argument about deceptive alignment is a problem for using RL to find anything in this class, something that is not an optimizer at all and so is not obviously misaligned. It would be really great if RL doesn't tend to select optimizers!

I don't see how this comment relates to my post. What gives you the idea that I'm trying to refute worries about deceptive alignment?

The conjecture I brought up that deceptive alignment relies on selected policies being optimizers gives me the idea that something similar to your argument (where the target of optimization wouldn't matter, only the fact of optimization for anything at all) would imply that deceptive alignment is less likely to happen. I didn't mean to claim that I'm reading you as making this implication in the post, or believing it's true or relevant, that's instead an implication I'm describing in my comment.