Research projects

I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.

Within those projects, I'm aiming to work on subprojects that are:

Posed in terms that are familiar to conventional ML;
interesting to solve from the conventional ML perspective;
and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

Generating multiple rewards and objectives

Thanks to Rohin Shah and Ramana Kumar for suggesting to investigate this problem, and to Rebecca Gorman.

Parent project: this is a subproject of model-splintering (value extrapolation).

Generating multiple rewards

Suppose that an agent is trained on videos of happy humans. The wireheaded reward is for it to create similar videos for it to watch. We'd prefer that it instead worked to make real humans happy.

But for that to be possible, it needs to consider that "make real humans happy" is even a possible reward function. Thus it needs to generate multiple reward functions that can explain the same data.

Working in CoinRun

CoinRun is a procedurally generated set of environments, a simplified Mario-style platform game. The reward is given by reaching the coin on the right:

Since the coin is always at the right of the level, there are two equally valid simple explanations of the reward: the agent must reach the coin, or the agent must reach the right side of the level.

When agents trained on CoinRun are tested on environments that move the coin to another location, they tend to ignore the coin and go straight to the right side of the level. So that one reward is chosen by default. The aim of the research project is to make the algorithm generate multiple (simple) reward functions that explain the initial data, including the "reach the coin" reward. This needs to be done in a generalisable way.

Multiple image classifications

Consider the image classification task of classifying images of huskies versus images of lions:

A naïve image classifier could be trained on images of this type. The simplest one would probably become a brown-versus-white classifier. We'd want to force the algorithm to generate more classifiers (more "reward functions" for the task of correct classification).

One way to do that is to give the algorithm many other images, unlabelled. Then, in a semi-supervised way, the AI will figure out the key features of these images. Then different classifiers will be trained on the original image data, using these features.

The ultimate aim is for the algorithm to produce, eg, one classifier that classifies through colour, another that classifies through landscape, another through husky versus lion, etc...

Research aims

Figure out how to generate multiple policies from the same data.
Figure out how to generate multiple simple reward functions or classifiers from the same data (there is a connection with reward modelling).
Ensure that these reward functions (or, at worst, policies) are "independent" from each other. This involves figuring out a general definition of independence (in CoinRun, the two independent reward functions - coin and right side of the level - are clear).
See if these reward functions "span" the set of reward functions in a useful way - ideally, every possible reward function should be a function of these simple reward functions.
Generalise this approach beyond CoinRun, image classification, and similar setups, to general reward data in general situations.

This is a simpler version of the project presented here, generating multiple reward functions from a simpler environment.

AI ALIGNMENT FORUM
AF