The general thread of a number of artificial intelligence takeover stories (such as It Looks Like You're Trying to Take Over the World by the always-excellent Gwern) run along the following lines:

  1. An agent is rewarded when something specific is achieved — for example, when a paperclip is produced. A number representing 'reward' gets incremented.
  2. Due to (whatever), the agent's intelligence and power increases exponentially in a scarily-short period of time.
  3. The agent uses that intelligence and power to maximise its reward function by turning the entire world into paperclips. The 'reward' number gets incremented many many times.
  4. The agent doesn't care about people, or about anything. It can use your atoms to make paperclips, to maximise it's sweet, sweet reward.

I'm (at best) an uninformed, occasional lurker when it comes to artificial intelligence safety research, so I may have missed discussion on a simple question:

Why wouldn't a super-intelligent agent simply increment the 'reward' number itself?

The fastest, easiest, most sure way to increase the reward (where the reward is a number) is to just increase the number by itself — the machine version of wireheading, where instead of stimulating pleasure circuits in the human brain, the agent would be manually incrementing the reward number in a loop.

This would be within the capabilities of a self-modifying super-intelligent agent, and it is certainly an option that would come up in a brute-force iteration of possible mechanisms to increase reward.

In this scenario, it seems that the most common failure mode in development of advanced AI would be for it to quickly 'brick' itself by figuring out how to wirehead, and becoming unresponsive to commands and disinterested in doing anything else.
Special safeguards would have to be put in place to promote AI development beyond a certain point, by making it harder and harder for an agent to access it's own reward function... but as an agent gets more intelligent, the safeguards get flimsier and flimsier until we're back to the AI bricking itself as soon as it breaks the safeguards.

I'm in a situation where (a) this seems common-sense, but (b) thousands of smart people have been working on this problem for many years, so this possibility must have been thought of and discarded.

Where can I find resources related to discussion of this possibility?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 1:26 PM

I think TurnTrout's Reward is not the optimization target addresses this, but I'm not entirely sure (feel free to correct me).

Thank you! That post then led me to https://www.lesswrong.com/posts/3RdvPS5LawYxLuHLH/hackable-rewards-as-a-safety-valve, which appears to be talking about exactly the same thing.