tl;dr: there is no natural category called "wireheading", only wireheading relative to some desired ideal goal.

Suppose that we have a built an AI, and have invited a human H to help test it. The human H is supposed to press a button B if the AI seems to be behaving well. The AI's reward is entirely determined by whether H presses B or not.

So the AI manipulates or tricks H into pressing B. A clear case of the AI wireheading itself.

Or is it? Suppose H was a meddlesome government inspector that we wanted to keep away from our research. Then we want H to press B, so we can get them our of our hair. In this case, the AI is behaving entirely in accordance with our preferences. There is no wireheading involved.

Same software, doing the same behaviour, and yet the first is wireheading and the second isn't. What gives?

Well, initially, it seemed that pressing the button was a proxy goal for our true goal, so manipulating H to press it was wireheading, since that wasn't what we intended. But in the second case, the proxy goal is the true goal, so maximising that proxy is not wireheading, it's efficiency. So it seems that the definition of wireheading is only relative to what we actually want to accomplish.

In other domains

I similarly have the feeling that wireheading-style failures in value-learning, low impact, and corrigibility, also depend on a specification of our values and preferences - or at least a partial specification. The more I dig into these areas, the more I'm convinced they require partial value specification in order to work - they are not fully value-agnostic.

New Comment
2 comments, sorted by Click to highlight new comments since:

Is this analogous to the stance-dependency of agents and intelligence?

It is analogous, to some extent; I do look into some aspect of Daniel Dennett's classification here:

I also had a more focused attempt at defining AI wireheading here:

I think you've already seen that?