Three preference frameworks for goal-directed agents

jessicata

In the previous post, I discussed three different applications of a value learning model. In this post, I will go into more detail about value learning for goal-directed agents by discussing three possible preference frameworks for goal-directed agents.

These three proposals probably do not cover all useful goal-directed AI preference frameworks; they should be taken as three points in a larger design space. Still, looking at these three should be instructive for coming up with problems that apply across different AGI architectures. I plan to write an additional post for each of these three proposals.

Value-learning sovereign

One proposal is to create an agent that infers human values and optimizes for them in expectation. It does a mixture of gaining information about human values and optimizing for them, depending on the value of additional information about human values. The main existing proposal that falls into this category is the original coherent extrapolated volition idea.

Depending on the setup, an agent like this might learn either "more terminal" or "more instrumental" preferences. This is similar to the distinction between short-distance, medium-distance, and long-distance extrapolation in the coherent extrapolated volition document. If the agent is set to learn and optimize for terminal preferences (the kind discussed in the Arbital article on "value"), then this agent is quite brittle: if it learns the wrong preferences, then it will attempt to optimize the universe for these wrong preferences, possibly with catastrophic results. On the other hand, as Paul argues, it looks more tractable for an agent to infer instrumental preferences instead. I hope to clarify the difference between learning terminal and instrumental preferences in a future post.

It is my opinion (and the opinion of everyone at MIRI I have talked to about this) that the process of constructing a value-learning sovereign is probably too error-prone for this to be the first advanced AGI we build. This opinion may change if new information shows that this setup is unexpectedly easy compared to the alternatives.

Reduced-impact genie for concrete tasks

There are many somewhat concrete physical tasks that would be useful to have an AI accomplish. For example, an AI that could efficiently and safely create food would be quite useful. AIs that accomplish these concrete tasks are often called genies.

Since the genie's goal is not our own, it is vital that, in pursuing its goal, the genie should be both reduced-impact and corrigible: it should avoid disturbing the world too much, and it should defer to humans (e.g. by shutting down when humans want it to). Reduced-impact and corrigibility appear to be somewhat value-laden concepts, so it may be necessary for the genie to learn human values to a limited extent. Hopefully, the genie will be able to use strategies that are considered reduced-impact across a wide variety of possible human values, so it will be more robust to errors in the value-learning model than the value-learning sovereign would be.

From a strategic perspective, the first powerful AGIs built should help us accomplish a positive pivotal achievement: they should help ensure that humanity survives and eventually takes advantage of the cosmic endowment. There are some somewhat concrete physical tasks that would help with this. If humans already have the source code for a brain emulation but need powerful computers to run this program on, then a useful concrete task would be to safely create many powerful computers (e.g. using nanotechnology) and give humans a terminal to access them, so the humans can run brain emulations of AI researchers.

An example of a less-useful, but simpler, concrete task is to create lots of diamond. Hopefully, the actual physical tasks we want to be performed are not much more complicated than maximizing diamond. This would allow us to focus attention on this simpler problem.

One advantage of this system is that, for many useful applications, it is not necessary for it to model other minds in great detail (that is, it can be Butlerian). This partially avoids some problems with modelling minds, including operator manipulation, blackmail, the possibility that simulated minds in the model could intentionally change the system's functioning, and mind crime. However, it is not clear how easy it is to create a system that succeeds in modelling complex physical objects but not minds.

Automated philosopher

It would be desirable for an AI to help humans with philosophy. Such an AI could help humans to develop the correct theory for creating value-aligned AIs. This AI should infer things about the operator's mental state and help clarify their beliefs. For example, if humans who believed in Cartesian souls interacted with this AI, it should help them clarify their ideas by introducing them to the concept of a material mind. While this idea is quite vague at the moment, I will present some more concrete models for automated philosophy in a future post.

It is not clear whether or not practical automated philosopher AIs will be "goal-directed" in the same way that the other two AI designs are, but some ideas that I have in this space rely on some form of ontology identification and value learning, and so they will face similar problems to other goal-directed agents.

2

Three preference frameworks for goal-directed agents

2

Value-learning sovereign

Reduced-impact genie for concrete tasks

Automated philosopher