What does it mean for an AI to wirehead its reward function? We're pretty clear on what it means for a human to wirehead - artificial stimulation of part of the brain rather than genuine experiences - but what does it mean for an AI?

We have a lot of examples of wireheading, especially in informal conversation (and some specific prescriptive examples which I'll show later). So, given those examples, can we define wireheading well - cut reality at its joints? The definition won't be - and can't be - perfectly sharp, but it should allow us to have clear examples of what is and what isn't wireheading, along with some ambiguous intermediate cases.

Intuitive examples

Suppose we have a weather-controlling AI whose task is to increase air pressure; it gets a reward for so doing.

What if the AI directly rewrites its internal reward counter? Clearly wireheading.

What if the AI modifies the input wire for that reward counter? Clearly wireheading.

What if the AI threatens the humans that decide on what to put on that wire? Clearly wireheading.

What if the AI takes control of all the barometers of the world, and sets them to record high pressure? Clearly wireheading.

What if the AI builds small domes around each barometer, and pumps in extra air? Clearly wireheading.

What if the AI fills the atmosphere with CO₂ to increase pressure that way? Clearly wire... actually, that's not so clear at all. This doesn't seem a central example of wireheading. It's a failure of alignment, yes, but it doesn't seem to be wireheading.

Thus not every example of edge or perverse instantiation is an example of wireheading.

Prescriptivist wireheading, and other definitions

A lot of posts and papers (including some of mine) take a prescriptivist approach to wireheading.

They set up a specific situation (often with a causal diagram), and define a particular violation of some causal assumptions as wireheading (eg "if the agent changes the measured value without changing the value of , which is being measured, that's wireheading").

And that is correct, as far as it goes. But it doesn't cover all the possible examples of wireheading.

Conversely, this post defines wireheading as a divergence between a true utility and a substitute utility (calculated with respect to a model of reality).

This is too general, almost as general as saying that every Goodhart curse is an example of wireheading.

Note, though, that the converse is true: every example of wireheading is a Goodhart curse. That's because every example of wireheading is maximising a proxy, rather than the intended objective.

The definition

The most intuitive example of wireheading is that there is some property of the world that we want to optimise, and that there is some measuring system that estimates that property. If the AI doesn't optimise the property, but instead takes control of the measuring system, that's wireheading (bonus points if the measurements the AI manipulates go down an actual wire).

This re-emphasises that "wireheading is in the eye of the beholder": if our true goal is actually the measuring system (maybe our AI is in competition with another one to maximise a score in a game, and we really don't care how it does this), then there will be no wireheading, just an AI following a correct objective.

Thus wireheading is always a failure of some (implicit or explicit) goal; thus every example of wireheading is a failure of value alignment, though the converse is not true.

Also key to the definition is the fact that the measuring system is, in some sense "much smaller" than whatever property of the system it is measuring. Pumping out CO₂ is not the correct instantiation of some goal along the lines of "increase air pressure so humans enjoy better weather"; but nor is it merely manipulating the measurement of that goal.

The definition

Thus we can define wireheading as:

  • Given some implicit goal G, an agent wireheads if, instead of moving towards G, it manipulates some narrow measurement channel that is intended to measure G, but will fail to do so after the agent's manipulation.

The difference with the prescriptivist approach is that the measurement channel is not specified; instead, we ask whether we can usefully characterise some feature of the setup as a "narrow measurement channel", and then apply the definition.

This can be seen as a particular failure of abstraction: the abstract goal G was collapsed to the output of the measurement channel.

Examples, counter-examples, and edge cases

Under this definition, all the intuitive examples of wireheading above fit: the measurement channel the AI takes over (its internal counter, the wire going into it, the statements made by humans, the barometers, the immediate environments of the barometers) is always much smaller than the whole atmosphere, which was its intended goal.

And that's why the CO₂ example isn't wireheading: the AI is doing a massive manipulation of the world, on the same scale as its intended goal; it isn't just manipulating the measurement channel[1].

The case of the domes around the barometers is an interesting one to consider. Obviously, if the AI put a dome around the planet and pumped in extra air, this wouldn't count as wireheading. Thus, we can imagine the domes growing bigger and merging, thus giving a smooth transition from "clearly wireheading" to "clearly not wireheading", and showing that ambiguous cases must exist.

We can also produce examples of Goodhart curse that are not wireheading. Take the practice of "teaching to the test". In this case, there is a simple objective (the test results) and the school acts to optimise for that objective. However, in typical schools this is not wireheading; teaching to the test involves drilling students in specific skills, training them, and having them memorise certain facts. Though these are done specifically to pass the test, these are the kinds of actions that a teacher would undertake anyway. One can talk about how this "narrows" the intellect, but, except in extreme cases, this cannot be characterised as gaining control of a narrow measurement channel.

For an interesting edge case, consider the RL agent playing the game CoastRunners. As described here, the score-maximising agent misbehaved in an interesting way: instead of rushing to complete the level with the highest score possible, the agent instead found a way to boat in circles, constantly hitting the same targets and ever increasing its score.

Is that wireheading? Well, it's certainly Goodhart: there is a discrepancy between the implicit goals (got round the course fast, hitting targets) and the explicit (maximise the score). But do we feel that the agent has control of a "narrow" measurement channel?

I'd argue that it's probably not the case for CoastRunners. The "world" for this agent is not a particularly rich one; going round and round and hitting targets is what the agent is intended to do; it has just found an unusual way of doing so.

If, instead, this behaviour happened in some subset of a much richer game (say, SimCity), then we might see it more naturally as wireheading. The score there is intended to measure a wider variety of actions (building and developing a virtual city while balancing tax revenues, population, amenities, and other aspects of the city), so "getting a high score while going round in circles" is much closer to "controlling a measurement channel that is narrow (as compared to the implicit goal)" than in the CoastRunners situation.

But, this last example can illustrate the degree of judgement and ambiguity that can exist when identifying wireheading in some situations.

  1. Note that the CO₂ example can fit with the definition of this post. One just needs to imagine that the agent's model does not specify the gaseous content of the air in sufficient detail to exclude a CO₂-rich air as a solution to the goal.

    This illustrates that the definition used in that post doesn't fully capture wireheading. ↩︎

New Comment
6 comments, sorted by Click to highlight new comments since:

Thanks Stuart, nice post.

I've moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:

The top-level category is reward hacking / reward corruption, which means that the agent's observed reward differs from true reward/task performance.

Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.

Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.

One advantage with this terminology is that it makes it clearer what we're talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.

That said, I think you're post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like "wireheading = tampering with goal measurement".

Seems like the idea is that wireheading denotes specification gaming that is egregious in its focus on the measurement channel. I'm inclined to agree..

Where "measurement channel" not just one specific channel, but anything that has the properties of a measurement channel.

Planned summary:

This post points out that "wireheading" is a fuzzy category. Consider a weather-controlling AI tasked with increasing atmospheric pressure, as measured by the world's barometers. If it made a tiny dome around each barometer and increased air pressure within the domes, we would call it wireheading. However, if we increase the size of the domes until it's a dome around the entire Earth, then it starts sounding like a perfectly reasonable way to optimize the reward function. Somewhere in the middle, it must have become unclear whether or not it was wireheading. The post suggests that wireheading can be defined as a subset of <@specification gaming@>(@Specification gaming examples in AI@), where the "gaming" happens by focusing on some narrow measurement channel, and the fuzziness comes from what counts as a "narrow measurement channel".

Planned opinion:

You may have noticed that this newsletter doesn't talk about wireheading very much; this is one of the reasons why. It seems like wireheading is a fuzzy subset of specification gaming, and is not particularly likely to be the only kind of specification gaming that could lead to catastrophe. I'd be surprised if we found some sort of solution where we'd say "this solves all of wireheading, but it doesn't solve specification gaming" -- there don't seem to be particular distinguishing features that would allow us to have a solution to wireheading but not specification gaming. There can of course be solutions to particular kinds of wireheading that _do_ have clear distinguishing features, such as <@reward tampering@>(@Designing agent incentives to avoid reward tampering@), but I don't usually expect these to be the major sources of AI risk.

I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.

Proxy alignment. The basic idea of proxy alignment is that a mesa-optimizer can learn to optimize for some proxy of the base objective instead of the base objective itself.

Suppose the base objective was to increase atmospheric pressure. One effect of increased atmospheric pressure is that less cosmic radiation reaches the ground, (more air to block it). So an AI whose mesa goal was to protect earth from radiation would be a proxy aligned agent. It has the failure mode of surrounding earth in an iron shell to block radiation. Note that this failure can happen whether or not the AI has any radiation sensors. An agent that wants to protect earth from radiation did well enough on the training, and now that is what it will do, protect the earth from radiation.

An agent with the mesa goal of maximizing pressure near all barometers would put them all in a pressure dome. (Or destroy all barometers and drop one "barometer" into the core of Jupiter.)

An agent with the mesa goal of maximizing the reading on all barometers would be the same. That agent will go around breaking all the worlds barometers.

Another mesa objective that you could get is to maximize the number on this reward counter in this computer chip here.

Wireheading is a special case of a proxy aligned mesa optimizer where the mesa objective is something to do with the agents own workings.

As with most real world categories, "something to do with" is a fuzzy concept. There are mesa objectives that are clear instances of wireheading, and ones that are clearly not and borderline cases. This is about word definitions, not real world uncertainty.

If anyone can describe a situation in which wireheading would occur that wasn't a case of mesa optimiser misalignment, then I would have to rethink this. (Obviously you can build an agent with the hard coded goal of maximizing some feature of its own circuitry, with no mesa optimization.)

I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.

I agree. I've now added this line, which I thought I'd put in the original post, but apparently missed out:

Note, though, that the converse is true: every example of wireheading is a Goodhart curse.