I showed in a previous post that impact penalties were time-inconsistent. But why is this? There are two obvious possibilities:

The impact penalty is inconsistent because it includes an optimisation process over the possible polices of the agent (eg when defining the Q-values in the attainable utility preservation).

The impact penalty is inconsistent because of how it's defined at each step (eg because the stepwise inaction baseline is reset every turn).

It turns out the first answer is the correct one. And indeed, we get:

If the impact penalty is not defined in terms of optimising over the agent's actions or policies, then it is kinda time-consistent.

What is the "kinda" doing there? Well, as we'll see, there is a subtle semantics vs syntax issue going on.

Time-consistent rewards

In attainable utility amplification, and other impact penalties, the reward is ultimately a function of the current state st and a counterfactual state s′t.

For the initial state and the initial state inaction baselines, the state s′t is determined independently of anything the agent has actually done. So these baselines are given by a function f:

f(μ,A,st,s′t).

Here, μ is the environment and A is the set of actions available to the agent. Since s′t is fixed, we can re-write this as:

fs′t(μ,A,st).

Now, if the impact measure is a function of st and μ only, then it is... a reward function, with R(st)=fs′t(μ,st). Thus, since this is just a reward function, the agent is time-consistent.

Now let's look at the stepwise inaction baseline. In this case, s′t is determined by an inaction rollout from the prior state st−1. So the impact measure is actually a function of:

f(μ,A,st,st−1).

Again, if f is in fact independent of A, the set of the agent's actions (including for the rollouts from st−1, then this is a reward function - one that is a function of the previous state and the current state, but that's quite common for reward functions.

So again, the agent has no interest in constraining its own future actions.

Semantics vs syntax

Back to "kinda". The problem is that we've been assuming that actions and states are very distinct objects. Suppose that, as in the previous post an agent at time t−1 wants to prevent itself from taking action S (go south) at time t. Let A be the agent's full set of actions, and A−S the same set without S.

So now the agent might be time-inconsistent, since it's possible that:

f(μ,A,st,st−1)≠f(μ,A−S,st,st−1).

But now, instead of denoting "can't go south" by reducing the action set, we could instead denote it by expanding the state set. So define s−St as the same state as st, except that taking the action S is the same as taking the action ∅. Everything is (technically) independent of A, so the agent is "time-consistent".

But, of course, the two setups, restricted action set or extended state set, are almost completely isomorphic - even though, according to our result above, the agent would be time-consistent in the second case. It would be time consistent in that it would not want to change the actions of it future self - instead it would just put its future self in a state where some actions were in practice unobtainable.

So it seems that, unfortunately, it's not enough to be a reward-maximiser (or a utility maximiser) in order to be time-consistent in practice.

I showed in a previous post that impact penalties were time-inconsistent. But why is this? There are two obvious possibilities:

It turns out the first answer is the correct one. And indeed, we get:

kindatime-consistent.What is the "kinda" doing there? Well, as we'll see, there is a subtle semantics vs syntax issue going on.

## Time-consistent rewards

In attainable utility amplification, and other impact penalties, the reward is ultimately a function of the current state st and a counterfactual state s′t.

For the initial state and the initial state inaction baselines, the state s′t is determined independently of anything the agent has actually done. So these baselines are given by a function f:

Here, μ is the environment and A is the set of actions available to the agent. Since s′t is fixed, we can re-write this as:

Now, if the impact measure is a function of st and μ only, then it is... a reward function, with R(st)=fs′t(μ,st). Thus, since this is just a reward function, the agent is time-consistent.

Now let's look at the stepwise inaction baseline. In this case, s′t is determined by an inaction rollout from the prior state st−1. So the impact measure is actually a function of:

Again, if f is in fact independent of A, the set of the agent's actions (including for the rollouts from st−1, then this is a reward function - one that is a function of the previous state and the current state, but that's quite common for reward functions.

So again, the agent has no interest in constraining its own future actions.

## Semantics vs syntax

Back to "kinda". The problem is that we've been assuming that actions and states are very distinct objects. Suppose that, as in the previous post an agent at time t−1 wants to prevent itself from taking action S (go south) at time t. Let A be the agent's full set of actions, and A−S the same set without S.

So now the agent might be time-inconsistent, since it's possible that:

f(μ,A,st,st−1)≠f(μ,A−S,st,st−1).

But now, instead of denoting "can't go south" by reducing the action set, we could instead denote it by expanding the state set. So define s−St as the same state as st, except that taking the action S is the same as taking the action ∅. Everything is (technically) independent of A, so the agent is "time-consistent".

But, of course, the two setups, restricted action set or extended state set, are almost completely isomorphic - even though, according to our result above, the agent would be time-consistent in the second case. It would be time consistent in that it would not want to change the actions of it future self - instead it would just put its future self in a state where some actions were in practice unobtainable.

So it seems that, unfortunately, it's not enough to be a reward-maximiser (or a utility maximiser) in order to be time-consistent

in practice.