A simple way of thinking that I feel clarifies a lot of issues (related to Blue Minimising Robot):

Suppose you have an entity H that follows algorithm alH. Then define:

What Hdoes is its actions/outputs in the environment.

What His is alH.

What Hwants is an interpretation of what Hdoes (and possibly what it is), in order to construct a utility function or reward function corresponding with its preferences.

The interpretation part of wants is crucial, but it is often obscured in practice in value learning. That's because we often start with things like `H is a boundedly rational agent that maximises u...', or we lay out the agent in such a way that that's clearly the case.

What we're doing there is writing the entity as alH(u) --- an algorithm with a special variable u that tracks what the entity wants. In the case of cooperative inverse reinforcement learning, this is explicit, as the human's values are given by a θ, known to the human. Thus the human's true algorithm is alH(⋅), the human observes θ, meaning that θ is an objective fact about the universe. And then the human follows alH(θ).

Note here that knowing what the human is in the one-variable sense (i.e. knowing alH(⋅)) helps with the correct deduction about what they want - while simply knowing the joint alH(θ) does not.

In contrast an interpretation starts with a zero-variable algorithm, and attempts to construct a one-variable one. There for, given alH it constructs (one or more) alHi(⋅) and ui such that

alH=alHi(ui).

This illustrates the crucial role of interpretation, especially if alH is highly complex.

A putative new idea for AI control; index here.A simple way of thinking that I feel clarifies a lot of issues (related to Blue Minimising Robot):

Suppose you have an entity H that follows algorithm alH. Then define:

doesis its actions/outputs in the environment.isis alH.wantsis aninterpretationof what Hdoes(and possibly what itis), in order to construct a utility function or reward function corresponding with its preferences.The interpretation part of

wantsis crucial, but it is often obscured in practice in value learning. That's because we often start with things like `H is a boundedly rational agent that maximises u...', or we lay out the agent in such a way that that's clearly the case.What we're doing there is writing the entity as alH(u) --- an algorithm with a special variable u that tracks what the entity wants. In the case of cooperative inverse reinforcement learning, this is explicit, as the human's values are given by a θ, known to the human. Thus the human's true algorithm is alH(⋅), the human observes θ, meaning that θ is an objective fact about the universe. And then the human follows alH(θ).

Note here that knowing what the human

isin the one-variable sense (i.e. knowing alH(⋅)) helps with the correct deduction about what they want - while simply knowing the joint alH(θ) does not.In contrast an interpretation starts with a zero-variable algorithm, and attempts to construct a one-variable one. There for, given alH it constructs (one or more) alHi(⋅) and ui such that

This illustrates the crucial role of interpretation, especially if alH is highly complex.