Learning doesn't solve philosophy of ethics

Stuart_Armstrong

A putative new idea for AI control; index here.

This post will use the formalism of this post to illustrate some well known philosophical thought experiments and show why learning algorithms are not sufficient to solve them.

Examples

Death and life extension

A human consists of two agents, $A_{100}$ and $A_{1}$ . The agent $A_{100}$ is a long-term agent; it has the preference that the human not live longer than a century. The agent $A_{2}$ is a short term agent; it prefers that the human survive for the coming year.

The human meta preferences $M$ are that $A_{100}$ and $A_{1}$ be eventually be brought into compatibility with each other.

By observation and prediction, the AI knows that, under the normal course of events, $A_{1}$ will never sync with $A_{100}$ : the human will continue to believe that it shouldn't live another hundred years, but will never want to die that year.

The AI can trigger human introspection $I$ in two ways; the first one removed the long term death preference in $A_{100}$ , the second one will remove the short term death-avoidance in $A_{1}$ , at some later point, so that the human will act consistently with its current $A_{100}$ (and thus die within the century).

Just based on this information, what is the human's preferences?

Total utilitarianism

The human has the preference $r$ that humans not be reduced to a large population of barely-happy individuals. They also have the meta-preference $M$ that individual utility be additive.

The AI can trigger the human's awareness of the repugnant conclusion. And it can do this in a differential or integral fashion, which will cause the human to either reject its current $r$ (and embrace the repugnant conclusion) or reject $M$ (and reject the repugnant conclusion).

Just based on this information, what is the human's preferences?

The malarial drowning child

Peter Singer has an argument about a drowning child and our duty to them.

To model that contradiction in a human, let $r$ contain the preference to save a drowning child in front of them, and a preference not to send money to distant people dying of malaria. Let $M$ contain the desire that the human preferences not be different across different ways of dying or physical distance.

As before, the right presentation on the AI's part, within the usual bounds of how humans reason, can cause the human to emphasise their preferences or their meta-preferences.

Balconies with a view

The human is modelled as two agents $A_{1}$ (basically system 1) and $A_{2}$ (system 2).

The human travels a lot, and likes to go out on the balcony to look at various views. They have an instinctive ( $A_{1}$ ) of falling, but typically overrides this with reason ( $A_{2}$ ). Except that $A_{1}$ 's fear varies in intensity. It wants to avoid wooden balconies with a (consciously imperceptible) faint smell of rot. It also wants to avoid balconies around sunset.

Given that faint rot increases danger and sunsets don't, what are we to make of this agent's true preferences?

The big question: what's tolerable?

Now, the first three examples illustrate big differences in outcomes: the difference between a total utilitarian and not are non-trivial, wanting life extension technology or not could make a huge difference in outcome, and so on.

However, all are within the scope of "tolerable outcomes", very broadly defined. None result in optimisation of the universe for money or paperclips or immediate human extinction. We could extend the models to get those situations (eg by having some of these agents in a position to make long term or large impact decisions).

But the key question remains: if we add more details of the model of human rationality along with some principles for resolving these types of conflicts (principles which the AI can't simply "learn"), we will still likely end up with the AI's computed reward function being something unpredictable in a large class of functions. However, can we ensure it's "tolerable", or does anything less that perfect modelling of human irrationality result in a disastrous optimise outcome?

How approximately can we input human irrationalities into a learning AI?