In Stable Pointers to Value, I discussed various ways in which we can try to “robustly point at what we want” (ie, do value learning). I can tidy up the discussion there into three categories:
I want to point at an analogy to three categories of approach to the problem of generalizable environmental goals (as defined in the alignment for advanced machine learning agenda). It’s a fairly messy analogy, and there’s probably a better way of organizing the landscape, but FWIW.
Imagine you’re trying to teach a system to build bridges by showing it examples. You could learn a big neural network which distinguishes cases of “successfully building a bridge” from everything else, and then use this to drive the system.
If the agent is an RL or OU agent, it is incentivised to “fool itself” by doing things like playing a video of bridge-building in front of its camera. You can try and train the classifier to notice this sort of thing, of course; you give it negative training examples in which someone puts a TV set in front of it and things thereafter appear as they do in one of the positive examples. However, you can’t figure out all the different negative training examples you need to give it ahead of time – especially if the rest of the system will continue to learn later on as the classifier remains fixed.
To me, this feels closely analogous to trying to prevent RL systems from wireheading themselves by giving them strongly negative reward for trying to mess with their reward circuits. You don’t know ahead of time what all the things you need to punish are, but you would need to, since the system keeps getting smarter as the reward circuit remains the same. (Or, if humans are managing the reward button, they need to be able to recognize any attempts to mess with the hardware or take over control of the reward button or manipulate the humans.)
One way you might try to solve this: the AI is learning a model of the world in an unsupervised way, only trying to predict well, not thinking at all about its goals. Separately, the AI is learning a classifier representing the goals. This classifier takes the model state, rather than the observations.
So, returning to the bridge-building example, the system is shown lots of examples of building bridges and not building bridges. It infers a physical model of what’s going on in those examples, plus a predicate on the physical situations which tells it whether the state of affairs corresponds to proper bridge-building.
As before, we can show it many negative training examples involving different methods of attempting to fool itself.
Now, we might reasonably expect that if the AI considers a novel way of “fooling itself” which hasn’t been given in a training example, it will reject such things for the right reasons: the plan does not involve physically building a bridge.
This can also deal with the problem of ontological crisis, even without new classifier data. As the physical model changes in response to new data, the classifier is simply re-learned so that it remains accurate on the original training examples.
Unfortunately, this approach has serious problems.
Since humans (or something) must be labeling the original training examples, the hypothesis that building bridges means “what humans label as building bridges” will always be at least as accurate as the intended classifier. I don’t mean “whatever humans would label”. I mean they hypothesis that “build a bridge” means specifically the physical situations which were recorded as training examples for this system in particular, and labeled by humans as such.
This time, there’s no way to patch the problem with negative training examples. You can’t label an example as both positive and negative!
How can we avoid simple-but-wrong hypotheses like this?
Just as approval-directed agents put more work on the humans in the control loop, we can try and do the same here.
As in model-utility systems, we build a model of the environment through unsupervised learning, and also try to learn the utility in a supervised way.
However, this time the system gets feedback on the quality of hypotheses from humans, and also tries to anticipate such feedback in its model selection. I’m not sure exactly how this should work, but one version is: ask the humans to classify made-up examples. Such examples of bridge-building can be in imaginary worlds where there are no humans evaluating whether bridge-building is going on, so as to differentiate the pathological hypothesis mentioned above from the desired hypothesis.
For this to work, though, we also have to solve the problem of providing human-understandable explanations of the AI’s learned models, which is its own pandora’s box.
The overall point I’m trying to make here has similarities to the Reinforcement Learning with a Corrupted Reward Channelpaper, particularly section 4.1: the way the system gets feedback matters a lot. The way humans get put into the loop can be very tricky; seemingly obvious answers lead to pathological behaviors for highly capable systems. Trying to fix this behavior can lead us down a rabbit-hole of trying patch after patch after patch, until a change in perspective like observation-utility learning eliminates the need for all those patches in one fell swoop (and then we find ourselves making entirely new patches on a higher level and about more important things…).