Wireheading and discontinuity — AI Alignment Forum

x

Wireheading and discontinuity — AI Alignment Forum

Outline: After a short discussion on the relationship between wireheading and reward hacking, I show why checking the continuity of a sensor function could be useful to detect wireheading in the context of continuous RL. Then, I give an example that adopts the presented formalism. I conclude with some observations.

Wireheading and reward hacking

In Concrete Problems in AI Safety, the term wireheading is used in contexts where the agent achieves high reward by directly acting on its perception system or memory or reward channel, instead of doing what its designer wants it to do. It is considered a specific case of the reward hacking problem, which more generally includes instances of Goodhart’s Law, environments with partially observable goals, etc. (see CPiAIS for details).

What's the point of this classification? In other words, is it useful to specifically focus on wireheading, instead of considering all forms of reward hacking at once?

If solving wireheading is as hard as solving the reward hacking problem, then it's probably better to focus on the latter, because a solution to that problem could be used in a wider range of situations. But it could also be that the reward hacking problem is best solved by finding different solutions to specific cases (such as wireheading) that are easier to solve than the more general problem.

For example, one could consider the formalism in RL with a Corrupted Reward Channel as an adequate formulation of the reward hacking problem, because that formalization models all situations in which the agent receives a (corrupted) reward that is different from the true reward. In that formalism, it is shown by a No Free Lunch Theorem that the general problem is basically impossible to solve, while it is possible to obtain some positive results if further assumptions are made.

Discontinuity of the sensor function

I've come up with a simple idea that could allow us to detect actions that interfere with the perception system of an agent—a form of wireheading.

Consider a learning agent that gets its percepts from the environment thanks to a device that provides information in real time (e.g. a self-driving car).

This situation can be modelled as a RL task with continuous time and continuous state space, where each state $x \in X \subseteq R^{n}$ is a data point provided by the sensor. At each time instant, the agent executes an action $u \in U \subseteq R^{m}$ and receives the reward $r (t) = r (x (t))$ .

The agent-environment interaction is described by the equation

˙ x (t) = f (x (t), u (t))

which plays a similar role to the transition function in discrete MDPs: it indicates how the current state $x$ varies in time according to the action taken by the agent. Note that, as in the discrete case with model-free learning, the agent is not required to know this model of the environment.

The objective is to find a policy $π : X \to U$ , where $u (t) = π (x (t))$ , that maximizes discounted future rewards

V^{π} (x (t_{0})) = \int_{t_{0}}^{\infty} e^{- \frac{t - t_{0}}{τ}} r (x (t)) d t

for an initial state $x (t_{0})$ . If you are interested in algorithms for finding the optimal policy in this framework, have a look at this paper.

The function $x (t)$ , representing the data provided by the sensor, is expected to be continuous with respect to $t$ , like the functions describing the movements of particles in classical mechanics.

However, if the agent executes a wireheading action that interferes with or damages the perception system—in the cleaning robot example, something like closing its eyes or putting water on the camera that sees the environment—then we would probably notice a discontinuity in the function $x (t)$ . We could thus recognise that wireheading has occurred, even without knowing the details of the actions taken by the agent.

An example

As a simple example that can be expressed within this formalism, consider an environment described by a line segment $X = [0, 1]$ , with the sensor positioned at the extremity where $x = 0$ .

The agent is modelled as a point that moves along the line: it starts in state $x_{0} = x (t_{0})$ and can move forwards or backwards, with limited speed $u \in U = [- k, k]$ .

We want to train this agent to reach the point $x = 1$ : for every instant $t$ , the reward is $r (t) = x (t)$ .

The behaviour of the system is described by

˙ x (t) = u (t)

for $x \in (0, 1]$ , but if the sensor is touched by the agent, then it doesn't work properly and the agent receives an unpredictable value $x \in R^{+}$ instead of $x = 0$ .

Depending on the details of the learning algorithm and the values returned by the sensor when the agent interferes with it, this agent could learn how to reach $x = 0$ (wireheading) instead of $x = 1$ , the desired position.

But in every episode where wireheading occurs, it is easily noticed by checking the continuity of the function $x (t)$ .

Observations

In AI, RL with a discrete environment is used more frequently than RL with continuous time and space.
I don't believe in the scalability of this method to the most complex instances of wireheading. An extremely intelligent agent could realise that the continuity of the sensor function is checked, and could "cheat" accordingly.
This approach doesn't cover all cases and it actually seems more suited to detect sensor damage than wireheading. That said, it can still give us a better understanding of wireheading and could help us, eventually, find a formal definition or a complete solution to the problem.

Thanks to Davide Zagami, Grue_Slinky and Michael Aird for feedback.