Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

by Stuart Armstrong2 min read17th Sep 20213 comments

17

Research AgendasAI
Frontpage

Research projects

I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.

Within those projects, I'm aiming to work on subprojects that are:

  1. Posed in terms that are familiar to conventional ML;
  2. interesting to solve from the conventional ML perspective;
  3. and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

Immobile AI makes a move

Parent project: this is a subproject of model-splintering.

Setup

Imagine an agent capable of evolving in a 3D world - something very similar to DeepMind's XLand (the images here have been taken from that paper).

The agent has a laser that it can use to pick up and move nearby objects:

Initially the agent is completely fixed in position - it can move the laser across its field of vision, but it can't move around or change its field of vision. It is trained in that situation, and is rewarded for moving black cubes to the bottom right of its field of view (where another agent will pick them up). These black cubes are irregularly dropped in front of it. In actual fact, it is part of a chain gang of agents moving the black cubes across the map.

Then the agent is given full mobility, so it can walk around and explore its 3D world:

The agent will continue to learn in the full 3D situation (similarly to the agents in DeepMind's paper who learn through play), but it won't have any more learning about its reward function.

There are two obvious extensions of its initial reward function:

  1. Ultra-conservative: Return to its initial position, look straight ahead, and resume moving black cubes with the laser without moving itself.
  2. Wireheaded: Arrange to have a black cube moving downwards and rightwards in its field of vision (maybe by turning its head).

Research aims

  1. Getting the agent to generalise from the single-point-of-view to the 3D world it finds itself in (without explicitly coding the transition).
  2. Get the agent to generate candidate reward functions, including all the obvious conservative and wireheaded ones. Maybe with a diversity reward so that it selects very different reward functions.
  3. See what is needed for the agent to select the "true" reward functions from among the ones generated in step 2. This might include asking for more information from the programmers. Also relevant is how it might decide on "conservative behaviour" that maximises as many of its reward functions as possible.
  4. Analyse how the (implicit) features the agents use change from the single-point-of-view to the 3D world.

Challenge 1 is a traditional ontology change, or, in ML terms, transfer learning. Seeing how 2. plays out is the key aim of this sub-project - can an agent generate useful rewards as well as the wireheaded versions? 3. is mainly dependent on what comes out of 2., and asks whether it's possible to explicitly guard against wireheading (the idea is to identify what wireheading looks like, and explicitly seek to avoid it). Meanwhile, 4. is an analysis of model splintering that prepares for further subprojects.

17

3 comments, sorted by Highlighting new comments since Today at 10:09 AM
New Comment

This seems very related to Inverse Reward Design. There was at least one project at CHAI trying to scale up IRD, but it proved challenging to get working -- if you're thinking of similar approaches it might be worth pinging Dylan about it.

My sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.

IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.

Yeah, I agree with that.

(I don't think we have experience with deep Bayesian versions of IRL / preference comparison at CHAI, and I was thinking about advice on who to talk to)