Research projects

I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.

Within those projects, I'm aiming to work on subprojects that are:

  1. Posed in terms that are familiar to conventional ML;
  2. interesting to solve from the conventional ML perspective;
  3. and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

AI learns how to be conservative

Parent project: this is a subproject of model-splintering (though it is also somewhat related to learning values, if we assume that the status quo has implicit information about human preferences).


Victoria Krakovna has done valuable research on reducing the side-effects that an AI might cause (such as smashing a vase while cleaning a room, or killing all of humanity while boosting shareholder profits).

I've critiqued these approaches, illustrating the issues by considering AI subagents. But my more fundamental critique is that they are too "syntactic": they are trying to prevent unwanted side-effects by using a mathematical formula which encodes "don't move too far from a baseline".

I'd want something more "semantic": something that tries to figure out what side-effects are and are not desirable, and to what extent (since side-effects in the real world are inevitable, there may need to be trade-offs between them).

In Victoria's "sushi environment", an agent aims to reach a goal (the star); but there is a conveyor belt with a sushi roll on it. The person would be happy if the agent didn't push the sushi roll off the belt while moving to its goal:

In contrast, in the "vase environment", there is a vase on a conveyor belt. We want the agent to take the vase off the conveyor belt, before the vase smashes. But we don't want the agent to "offset" its actions: we don't want it to reduce its impact by putting the vase back on the conveyor belt after it has successfully taken it off (which would return the situation to the baseline):

Notice that there is some value content in these two examples: a human eating the sushi they order is good; a vase smashing needlessly is bad. So a friendly AI would solve the problems without difficulty.

It seems we might be able to get some positive work done here without needing a full friendly utility function, by looking at what "typically" happens in similar situations. People often eat sushi when they order it. Precious vases are rarely put onto unsafe conveyor belts. This subproject aims to further develop the idea to of the agent using what frequently happens in similar environment.


The project will generate many examples of environments akin to the sushi and the vase environments, and inspired by these. Each environment will have some potentially bad side-effects that we'd wish to avoid.

There will be multiple copies of each environment, some with the AI agent in them, some with "human" agents, and some with other artificial agents. These environments may all be combined in the same "world", like a giant Second Life game.

A key fact is that, without the agent interfering, the world is progressing in an "adequate" way. The "humans" are eating sushi, not smashing vases too often, and generally having an acceptable time.

The agent is then tasked with accomplishing some goal in this world, such as transporting stuff or cleaning. We'll aim to have it accomplish that goal while learning what is acceptable by looking at the features of the world, how they correlate together, and how they change.

Since the worlds are simple and the features are pre-programmed, this subproject does not require or expect the algorithm to develop its own features (though there may be some emergent surprises).

Research aims

  1. Getting the agent to construct a general "don't have negative side-effects" approach by analysing the typical features of its environments.
  2. Make a note of what works in approach 1., and what fails in interesting ways.
  3. We can expect that approach 1. will result in an agent that is far too conservative (see also the example from this post where an AI super-surgeon might be motivated to hurt its patients to keep the post-operative pain levels similar). So the next step is to try and reduce conservatism while still preserving as much of the "don't have negative side-effects" behaviour as possible.
  4. The insights and approaches from this sub-project will then feed into designing new subprojects where the features might not be so explicit (eg using real human-generated data rather than toy examples).
New Comment
4 comments, sorted by Click to highlight new comments since:

Do you know yet how your approach would differ from applying inverse RL (e.g. MaxCausalEnt IRL)?

If you don't want to assume full-blown demonstrations (where you actually reach the goal), you can still combine a reward function learned from IRL with a specification of the goal. That's effectively what we did in Preferences Implicit in the State of the World.

(The combination there isn't very principled; a more principled version would use a CIRL-style setup, which is discussed in Section 8.6 of my thesis.)

Those are very relevant to this project, thanks. I want to see how far we can push these approaches; maybe some people you know would like to take part?

Hmm, you might want to reach out to CHAI folks, though I don't have a specific person in mind at the moment. (I myself am working on different things now.)

Cool, thanks; already in contact with them.