Dynamic inconsistency of the inaction and initial state baseline

Planned summary for the Alignment Newsletter:

In a fixed, stationary environment, we would like our agents to be time-consistent: that is, they should not have a positive incentive to restrict their future choices. However, impact measures like <@AUP@>(@Towards a New Impact Measure@) calculate impact by looking at what the agent could have done otherwise. As a result, the agent has an incentive to change what this counterfactual is, in order to reduce the penalty it receives, and it might accomplish this by restricting its future choices. This is demonstrated concretely with a gridworld example.

Planned opinion:

It’s worth noting that measures like AUP do create a Markovian reward function, which typically leads to time consistent agents. The reason that this doesn’t apply here is because we’re assuming that the restriction of future choices is “external” to the environment and formalism, but nonetheless affects the penalty. If we instead have this restriction “inside” the environment, then we will need to include a state variable specifying whether the action set is restricted or not. In that case, the impact measure would create a reward function that depends on that state variable. So another way of stating the problem is that if you add the ability to restrict future actions to the environment, then the impact penalty leads to a reward function that depends on whether the action set is restricted, which intuitively we don’t want. (This point is also made in this followup post.)

[-]Stuart_Armstrong5y30

Good, cheers!

[-]TurnTrout5y20

Nice post! I think this notion of time-inconsistency points to a key problem in impact measurement, and if we could solve it (without backtracking on other problems, like interference/offsetting), we would be a lot closer to dealing with subagent issues.

I think the other baselines can also induce time-inconsistent behavior, for the same reason: if reaching the main goal has a side effect of allowing the agent to better achieve the auxiliary goal (compared to starting state / inaction / stepwise inaction), the agent is willing to pay a small amount to restrict its later capabilities. Sometimes this is even a good thing - the agent might "pay" by increasing its power in a very specialized and narrow manner, instead of gaining power in general, and we want that.

Here are some technical quibbles which don't affect the conclusion (yay).

If using an inaction rollout of length $l$ , just multiply that penalty by $γ^{l}$

I don't think so - the inaction rollout formulation (as I think of it) compares the optimal value after taking action $a$ and waiting for $N - 1$ steps, with the optimal value after $N$ steps of waiting. There's no additional discount there.

Fortunately, when summing up the penalties, you sum terms like $\dots p | γ^{n - 1} - γ^{n} | + p | γ^{n} - γ^{n + 1} | \dots$ , so a lot of the terms cancel.

Why do the absolute values cancel?

[-]Stuart_Armstrong5y10

Why do the absolute values cancel?

Because $γ^{n} > γ^{n + 1}$ , so you can remove the absolute values.

If using an inaction rollout of length $l$ , just multiply that penalty by $γ^{l}$ . ↩︎
The $γ^{n + 3}$ comes from the optimal policy for reaching the red button under this restriction: go to the square above the red button, wait till $S$ is available again, then go $S - S - S$ . ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

18

Dynamic inconsistency of the inaction and initial state baseline

18

Losses from time-inconsistency

Time inconsistency example

Two paths

Take the third option

The initial state and the initial inaction baseline

Counterfactual constraint