I've shown that, even with simplicity priors, we can't figure out the preferences or rationality of a potentially irrational agent (such as a human H).

But we can get around that issue with 'normative assumptions'. These can allow us to zero in on a 'reasonable' reward function RH.

We should however note that:

Even if RH is highly complex, a normative assumption need not be complex to single it out.

This post gives an example of that for general agents, and discusses how a similar idea might apply to the human situation.

Formalism

An agent takes actions (A) and gets observations (O), and together these form histories, with H the set of histories (I won't present all the details of the formalism here). The policies Π={π:H→A} are maps from histories to actions. The reward functions R={R:H→R} are maps from histories H to real numbers), and the planners P={p:R→Π} are maps from reward functions to policies.

By observing an agent, we can deduce (part of) their policy π. Then a reward-planner pair (p,R) is compatible with π if p(R)=π. Further observations cannot distinguish between different compatible pairs.

Then a normative assumption α is something that distinguishes between compatible pairs. It could be a prior on P×R, or an assumption of full rationality (which removes all-but-the-rational planner from P), or something that takes in more details about the agent or the situation.

Assumptions that use a lot of information

Assume that the agent's algorithm π is written in some code, as Cπ, and that α will have access to this. Then suppose that α scans Cπ, looking for the following: an object CR that takes a history as an input and has a real number as an output, an object Cp that takes CR and a history as inputs, and outputs an action, and a guarantee that Cπ chooses actions by running Cp on CR and the input history.

The α need not be very complex to do that job. Because of rice's theorem and obfuscated code, it will be impossible for α to check those facts in general. But, for many examples of Cπ, it will be able to check that those things hold. In that case, let α return R; otherwise, let it return the trivial 0 reward.

So, for a large set S of possible algorithms, α can return a reasonable reward function estimate. Even if the complexity of Cπ and R is much, much higher than the complexity of α itself, there are still examples of these where α can successfully identity the reward function.

Of course, if we run α on a human brain, it would return 0. But what I am looking for is not α, but a more complicate αH, that, when run on the set SH of human agents, will extract some 'reasonable' RH. It doesn't matter what αH does when run on non-human agents, so we can load it with assumptions about how humans work. When I talk about extracting preferences through looking at internal models, this is the kind of thing I had in mind (along with some method for synthesising those preferences into a coherent whole).

So, though my desired αH might be complex, there is no a priori reason to think that it need be as complex as the RH output.

I've shown that, even with simplicity priors, we can't figure out the preferences or rationality of a potentially irrational agent (such as a human H).

But we can get around that issue with 'normative assumptions'. These can allow us to zero in on a 'reasonable' reward function RH.

We should however note that:

This post gives an example of that for general agents, and discusses how a similar idea might apply to the human situation.

## Formalism

An agent takes actions (A) and gets observations (O), and together these form histories, with H the set of histories (I won't present all the details of the formalism here). The policies Π={π:H→A} are maps from histories to actions. The reward functions R={R:H→R} are maps from histories H to real numbers), and the planners P={p:R→Π} are maps from reward functions to policies.

By observing an agent, we can deduce (part of) their policy π. Then a reward-planner pair (p,R) is

compatiblewith π if p(R)=π. Further observations cannot distinguish between different compatible pairs.Then a normative assumption α is something that distinguishes between compatible pairs. It could be a prior on P×R, or an assumption of full rationality (which removes all-but-the-rational planner from P), or something that takes in more details about the agent or the situation.

## Assumptions that use a lot of information

Assume that the agent's algorithm π is written in some code, as Cπ, and that α will have access to this. Then suppose that α scans Cπ, looking for the following: an object CR that takes a history as an input and has a real number as an output, an object Cp that takes CR and a history as inputs, and outputs an action, and a guarantee that Cπ chooses actions by running Cp on CR and the input history.

The α need not be very complex to do that job. Because of rice's theorem and obfuscated code, it will be impossible for α to check those facts in general. But, for many examples of Cπ, it will be able to check that those things hold. In that case, let α return R; otherwise, let it return the trivial 0 reward.

So, for a large set S of possible algorithms, α can return a reasonable reward function estimate. Even if the complexity of Cπ and R is much, much higher than the complexity of α itself, there are still examples of these where α can successfully identity the reward function.

Of course, if we run α on a human brain, it would return 0. But what I am looking for is not α, but a more complicate αH, that, when run on the set SH of human agents, will extract some 'reasonable' RH. It doesn't matter what αH does when run on non-human agents, so we can load it with assumptions about how humans work. When I talk about extracting preferences through looking at internal models, this is the kind of thing I had in mind (along with some method for synthesising those preferences into a coherent whole).

So, though my desired αH might be complex, there is no a priori reason to think that it need be as complex as the RH output.