A potential problem with inverse reinforcement learning as a way for an AI to learn human values is that human actions might not contain enough information to infer human values from accurately enough. If this is the case, then it might be necessary to figure out how to get information about a human's preferences by actually looking inside their brain. This is a very hard problem, so I suggest starting with the toy problem of figuring out how to determine the goals of simple AI/ML systems, and then subsequently trying to scale up these techniques to try to make them work on human brains.

There are several directions you could go with this idea. For instance, you could try creating a machine learning system that takes in neural networks as input, and learning the concepts that these neural networks have been trained to recognize. So this system should be able to learn the alphabet by looking at a neural network that's been trained to recognize and distinguish between letters, and learn what cats are by looking at a neural network that's been trained to recognize cats. A trivial solution to this is to create a copy of the neural network you've seen, and then apply it to object-level inputs, thus demonstrating that you can perfectly reproduce the neural network's behavior. This is an unsatisfying solution, and its analog scaled up to humans would be to create a copy of the human, and ask the copy whenever it wants to know what humans want, which might not be practical for questions too complicated for the human to evaluate. Instead, we would want to learn a representation of a neural net's concepts such that it is possible to identify instances of the concept more accurately or more efficiently than the neural net does.

Another more precise potential toy problem is to look at code for a game-playing agent, and determine the rules of the game (in particular, the win conditions). Doing this correctly would inevitably end up with a representation of the goal that could be optimized for more effectively or efficiently. Subsequently playing the game it has learned well would just be an AI capabilities problem.

There are a few different things that “learning an agent's goals” could mean, and we would want to be careful about which one to learn. For instance, we could learn what the agent's designers intended for the agent to do. This has a possible failure mode when scaled up to humans that humans could be recognized as evolutionary fitness maximizers, with preferences that aren't useful for evolutionary fitness being attributed to bad design by evolution. Or we could find an internal representation of reward inside the agent, and attempt to maximize that. This could have the failure mode of the value learner concluding that the agent's utility can be maximized by wireheading the agent. Hopefully, fiddling around with notions of “learning an agent's goals” for simple agents could help us find such a notion that is actually what we mean, and as a result, does not result in failure modes when scaled up to humans.

Scaling up this sort of thing to humans would require whole brain emulation, which might not be practical when we want to do that. So it might be good to learn goals from partial information about an agent. For instance, you could be given the graph structure of a neural network without the edge weights, and see how much each neuron activates on example inputs, and try to reconstruct its concepts and goals from that.

It's possible that this sort of problem would not end up being useful. For example, it might end up being practical to rely on indirect normativity ( and avoid having to think in advance about how to learn human preferences. Or formalizing mild optimization ( would make it possible to create safe AGI that does not have extremely precise understanding of human goals, in which case it is more plausible that human actions and expressed preferences are enough to learn human goals to the precision needed by a mild optimizer.

Personal Blog


New Comment