Models don't "get" reward. Reward is the mechanism by which we select parameters, it is not something "given" to the model. Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation. This has implications for how one should think about AI alignment. 

Popular Comments

Recent Discussion

gwern62

The Meta-Lesswrong Doomsday Argument predicts long AI timelines and that we can relax:

LessWrong was founded in 2009 (16 years ago), and there have been 44 mentions of the 'Doomsday argument' prior to this one, and it is now 2025, at 2.75 mentions per year.

By the Doomsday argument, we medianly-expect mentions to stop in: after 44 additional mentions over 16 additional years or in 2041. (And our 95% CI on that 44 would then be +1 mention to +1,1760 mentions, corresponding to late-2027 AD to 2665 AD.)

10Steve Byrnes
No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”. Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right? If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and cheats.) So that’s the first problem. Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all. Another possibility is: OpenAI used a cheating-proof CoT-post-training process for o3, and this training process pushed it in the direction of ruthless consequentialism, which in turn (mis)generalized into lying and cheating in deployment. Again, the end-result is still bad, but it’s not “reward hacking”. Separately, sycophancy is not “reward hacking”, even if it came from RL on A/B tests, unless the average user doesn’t like sycophancy. But I’d guess that the average user does like quite high levels of sycophancy. (Remember, the average user is some random high school jock.) Am I misunderstanding something? Or are people just mixing up “reward hacking” with “ruthless consequentialism”, since they have the same vibe / mental image?

I agree people often aren't careful about this.

Anthropic says

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of "reward hacking" during reinforcement learning training.

Similarly OpenAI suggests that cheating behavior is due to RL.

3cubefox
There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question. Though I also wanted to just add a quick comment about this part: It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a "wrong" solution as they get for saying something like "I don't know". Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer "by accident" (for the wrong reasons). At least something like that seems to be suggested by this: Source: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Rant on "deceiving" AIs

tl;dr: Keep your promises to AIs; it's fine to do stuff like teaching them false facts or misleading them about their situation during testing and deployment; but if you wanna do cheap stuff to cause them to know that they might [have been taught false facts / be being tested / be being misled during deployment], sure, go for it.

Disclaimer: maybe more like explaining my position than justifying my position.


Sometimes we really want to deceive AIs (or at least make them uncertain about their situation). E.g.:
 

  1. Training them to beli
... (read more)

Every now and then, some AI luminaries

  • (1) propose that the future of powerful AI will be reinforcement learning agents—an algorithm class that in many ways has more in common with MuZero (2019) than with LLMs; and
  • (2) propose that the technical problem of making these powerful future AIs follow human commands and/or care about human welfare—as opposed to, y’know, the Terminator thing—is a straightforward problem that they already know how to solve, at least in broad outline.

I agree with (1) and strenuously disagree with (2).

The last time I saw something like this, I responded by writing: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem.

Well, now we have a second entry in the series, with the new preprint book chapter “Welcome to the Era of Experience” by...

2Steve Byrnes
I was assuming that the RL agent understands how the button works and indeed has a drawer of similar buttons in its basement which it attaches to wires all the time for its various projects. I’d like to think I’m pretty smart, but I don’t want to take highly-addictive drugs. Although maybe your perspective is “I don’t want to take cocaine → RL is the wrong way to think about what the human brain is doing”, whereas my perspective is “RL is the right way to think about what the human brain is doing → RL does not imply that I want to take cocaine”?? As they say, one man’s modus ponens is another man’s modus tollens. If it helps, I have more discussion of wireheading here. I don’t claim to have any plan at all, let alone a non-brittle one, for a reward function (along with training environment etc.) such that an RL agent superintelligence with that reward function won’t try to kill its programmers and users, and I claim that nobody else does either. That was my thesis here. …But separately, if someone says “don’t even bother trying to find such a plan, because no such plan exists, this problem is fundamentally impossible”, then I would take the other side and say “That’s too strong. You might be right, but my guess is that a solution probably exists.” I guess that’s the argument we’re having here? If so, one reason for my surmise that a solution probably exists, is the fact that at least some humans seem to have good values, including some very smart and ambitious humans. And see also “The bio-determinist child-rearing rule of thumb” here which implies that innate drives can have predictable results in adult desires and personality, robust to at least some variation in training environment. [But more wild variation in training environment, e.g. feral children, does seem to matter.] And also Heritability, Behaviorism, and Within-Lifetime RL
5Vladimir Slepnev
My perspective (well, the one that came to me during this conversation) is indeed "I don't want to take cocaine -> human-level RL is not the full story". That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I'm not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans. It's just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it'll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it'll start to realize that behind the button there's a wire, and the wire leads to the agent's own reward circuit and so on. Can you engineer things just right, so the agent learns to care about just the right level of "realness"? I don't know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: "you'll care about reality in this specific way". So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the "realness"? That's the point I was trying to make a couple comments ago, but maybe didn't phrase it well.

Thanks! I’m assuming continuous online learning (as is often the case for RL agents, but is less common in an LLM context). So if the agent sees a video of the button being pressed, they would not feel a reward immediately afterwards, and they would say “oh, that’s not the real thing”.

(In the case of humans, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable, and then turns it off and probably won’t bother even trying again in th... (read more)

3David Manheim
In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment - we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn't.

I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but the most fundamental is my pursuit of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side of sharing half-formed ideas, so there may well be parts that don’t make sense yet. Nevertheless, I think this broad research direction is very promising.

This post has two sections. The first describes what I mean by a theory of intelligent agency, and some problems with existing (non-scale-free) attempts. The second outlines my current path towards formulating a scale-free theory of intelligent agency, which I’m calling coalitional agency. 

Theories of intelligent agency

By a “theory of intelligent agency” I mean a...

My point is more that you wouldn't want to define individuals as companies, or to say that only companies but not individuals can have agency. And that the things you are trying to get out of coalitional agency could already be there within individual agency.

A notion of choosing to listen to computations (or of analyzing things that are not necessarily agents in terms of which computations influence them) keeps coming up in my own investigations as a way of formulating coordination, decision making under logical uncertainty, and learning/induction. It inci... (read more)

2Vladimir Slepnev
Here's maybe a related point: AIs might find it useful to develop an ability to reveal their internals in a verifiable way under certain conditions (say, when the other AI offers to do the same thing and there's a way to do a secure "handshake"). So deception ability would be irrelevant, because AIs that can credibly refrain from deception with each other would choose to do so and get a first-best outcome, instead of second-best as voting theory would suggest. A real world analogy is some of the nuclear precommitments mentioned in Schelling's book. Like when the US and Soviets knowingly refrained from catching some of each other's spies, because if a flock of geese triggers the warning radars or something, spies could provide their side with the crucial information that an attack isn't really happening and there's no need to retaliate.
4Richard Ngo
I think my thought process when I typed "risk-averse money-maximizer" was that an agent could be risk-averse (in which case it wouldn't be an EUM) and then separately be a money-maximizer. But I didn't explicitly think "the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility", so I appreciate the clarification.
4Richard Ngo
Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2. In other words, your example rebuts the claim that an EUM can't prefer a probabilistic mixture of two options to the expectation of those two options. But that's not the claim I made.
Load More