The last few posts have motivated an analysis of the human-AI system rather than an AI system in isolation. So far we’ve looked at the notion that the AI system should get feedback from the user and that it could use reward uncertainty for corrigibility. These are focused on the AI system, but what about the human? If we build a system that explicitly solicits feedback from the human, what do we have to say about the human policy, and how the human should provide feedback?

Interpreting human actions

One major free variable in any explicit interaction or feedback mechanism is what semantics the AI system should attach to the human feedback. The classic examples of AI risk are usually described in a way where this is the problem: when we provide a reward function that rewards paperclips, the AI system interprets it literally and maximizes paperclips, rather than interpreting it pragmatically as another human would.

(Aside: I suspect this was not the original point of the paperclip maximizer, but it has become a very popular retelling, so I’m using it anyway.)

Modeling this classic example as a human-AI system, we can see that the problem is that the human is offering a form of “feedback”, the reward function, and the AI system is not ascribing the correct semantics to it. The way it uses the reward function implies that the reward function encodes the optimal behavior of the AI system in all possible environments -- a moment’s thought is sufficient to see that this is not actually the case. There will definitely be many cases and environments that the human did not consider when designing the reward function, and we should not expect that the reward function incentivizes the right behavior in those cases.

So what can the AI system assume if the human provides it a reward function? Inverse Reward Design (IRD) offers one answer: the human is likely to provide a particular reward function if it leads to high true utility behavior in the training environment. So, in the boat race example, if we are given the reward “maximize score” on a training environment where this actually leads to winning the race, then “maximize score” and “win the race” are about equally likely reward functions, since they would both lead to the same behavior in the training environment. Once the AI system is deployed on the environment in the blog post, it would notice that the two likely reward functions incentivize very different behavior. At that point, it could get more feedback from humans, or it could do something that is good according to both reward functions. The paper takes the latter approach, using risk-averse planning to optimize the worst-case behavior.

Similarly, with inverse reinforcement learning (IRL), or learning from preferences, we need to make some sort of assumption about the semantics of the human demonstrations or preferences. A typical assumption is Boltzmann rationality: the human is assumed to take better actions with higher probability. This effectively models all human biases and suboptimalities as noise. There are papers that account for biases rather than modeling them as noise. A major argument against the feasibility of ambitious value learning is that any assumption we make will be misspecified, and so we cannot infer the “one true utility function”. However, it seems plausible that we could have an assumption that would allow us to learn some values (at least to the level that humans are able to).

The human policy

Another important aspect is how the human actually computes feedback. We could imagine training human overseers to provide feedback in the manner that the AI system expects. Currently we “train” AI researchers to provide reward functions that incentivize the right behavior in the AI systems. With IRD, we only need the human to extensively test their reward function in the training environment and make sure the resulting behavior is near optimal, without worrying too much about generalization to other environments. With IRL, the human needs to provide demonstrations that are optimal. And so on.

(Aside: This is very reminiscent of human-computer interaction, and indeed I think a useful frame is to view this as the problem of giving humans better, easier-to-use tools to control the behavior of the AI system. We started with direct programming, then improved upon that to reward functions, and are now trying to improve to comparisons, rankings, and demonstrations.)

We might also want to train humans to give more careful answers than they would have otherwise. For example, it seems really good if our AI systems learn to preserve option value in the face of uncertainty. We might want our overseers to think deeply about potential consequences, be risk-averse in their decision-making, and preserve option value with their choices, so that the AI system learns to do the same. (The details depend strongly on the particular narrow value learning algorithm -- the best human policy for IRL will be very different from the best human policy for CIRL.) We might hope that this requirement only lasts for a short amount of time, after which our AI systems have learnt the relevant concepts sufficiently well that we can be a bit more lax in our feedback.

Learning human reasoning

So far I’ve been analyzing AI systems where the feedback is given explicitly, and there is a dedicated algorithm for handling the feedback. Does the analysis also apply to systems which get feedback implicitly, like iterated amplification and debate?

Well, certainly these methods will need to get feedback somehow, but they may not face the problem of ascribing semantics to the feedback, since they may have learned the semantics implicitly. For example, a sufficiently powerful imitation learning algorithm will be able to do narrow value learning simply because humans are capable of narrow value learning, even though it has no explicit assumption of semantics of the feedback. Instead, it has internalized the semantics that we humans give to other humans’ speech.

Similarly, both iterated amplification and debate inherit the semantics from humans by learning how humans reason. So they do not have the problems listed above. Nevertheless, it probably still is valuable to train humans to be good overseers for other reasons. For example, in debate, the human judges are supposed to say which AI system provided the most true and useful information. It is crucial that the humans judge by this criterion, in order to provide the right incentives for the AI systems in the debate.


If we reify the interaction between the human and the AI system, then the AI system must make some assumption about the meaning of the human’s feedback. The human should also make sure to provide feedback that will be interpreted correctly by the AI system.

New Comment
2 comments, sorted by Click to highlight new comments since:

This led me to think... why do we even believe that human values are good? Perhaps the typical human behaviour amplified by possibilities of a super-intelligence would actually destroy the universe. I don't personally find this very likely (that's why I never posted it before), but, given that almost all AI safety is built around "how to check that AI's values are convergent with human values" one way or another, perhaps something else should be approached - like remodeling history (actual, human history) from a given starting point (say, Roman Principatus or 1945) with actors assigned values different from human values (but in similar relationship to each other, if applicable) and finding what leads to better results (and, in particular, in us not being destroyed by 2020). All with the usual sandbox precautions, of course.

(Addendum: Of course, pace "fragility of value". We should have some inheritance from metamorals. But we don't actually know how well our morals (and systems in "reliable inheritance" from them) are compatible with our metamorals, especially in an extreme environment such as superintelligence.)

why do we even believe that human values are good?

Because they constitute, by definition, our goodness criterion? It's not like we have two separate modules - one for "human values", and one for "is this good?". (ETA or are you pointing out how our values might shift over time as we reflect on our meta-ethics?)

Perhaps the typical human behaviour amplified by possibilities of a super-intelligence would actually destroy the universe.

If I understand correctly, this is "are human behaviors catastrophic?" - not "are human values catastrophic?".