I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.
Within those projects, I'm aiming to work on subprojects that are:
The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.
The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.
Thanks to Rohin Shah and Ramana Kumar for suggesting to investigate this problem, and to Rebecca Gorman.
Parent project: this is a subproject of model-splintering (value extrapolation).
Suppose that an agent is trained on videos of happy humans. The wireheaded reward is for it to create similar videos for it to watch. We'd prefer that it instead worked to make real humans happy.
But for that to be possible, it needs to consider that "make real humans happy" is even a possible reward function. Thus it needs to generate multiple reward functions that can explain the same data.
CoinRun is a procedurally generated set of environments, a simplified Mario-style platform game. The reward is given by reaching the coin on the right:
Since the coin is always at the right of the level, there are two equally valid simple explanations of the reward: the agent must reach the coin, or the agent must reach the right side of the level.
When agents trained on CoinRun are tested on environments that move the coin to another location, they tend to ignore the coin and go straight to the right side of the level. So that one reward is chosen by default. The aim of the research project is to make the algorithm generate multiple (simple) reward functions that explain the initial data, including the "reach the coin" reward. This needs to be done in a generalisable way.
Consider the image classification task of classifying images of huskies versus images of lions:
A naïve image classifier could be trained on images of this type. The simplest one would probably become a brown-versus-white classifier. We'd want to force the algorithm to generate more classifiers (more "reward functions" for the task of correct classification).
One way to do that is to give the algorithm many other images, unlabelled. Then, in a semi-supervised way, the AI will figure out the key features of these images. Then different classifiers will be trained on the original image data, using these features.
The ultimate aim is for the algorithm to produce, eg, one classifier that classifies through colour, another that classifies through landscape, another through husky versus lion, etc...
This is a simpler version of the project presented here, generating multiple reward functions from a simpler environment.