Impact measures are auxiliary rewards for low impact on the agent's environment, used to address the problems of side effects and instrumental convergence. A key component of an impact measure is a choice of baseline state: a reference point relative to which impact is measured. Commonly used baselines are the starting state, the initial inaction baseline (the counterfactual where the agent does nothing since the start of the episode) and the stepwise inaction baseline (the counterfactual where the agent does nothing instead of its last action). The stepwise inaction baseline is currently considered the best choice because it does not create the following bad incentives for the agent: interference with environment processes or offsetting its own actions towards the objective. This post will discuss a fundamental problem with the stepwise inaction baseline that stems from a tradeoff between different desirable properties for baseline choices, and some possible alternatives for resolving this tradeoff.
One clearly desirable property for a baseline choice is to effectively penalize high-impact effects, including delayed effects. It is well-known that the simplest form of the stepwise inaction baseline does not effectively capture delayed effects. For example, if the agent drops a vase from a high-rise building, then by the time the vase reaches the ground and breaks, the broken vase will be the default outcome. Thus, in order to penalize delayed effects, the stepwise inaction baseline is usually used in conjunction with inaction rollouts, which predict future outcomes of the inaction policy. Inaction rollouts from the current state and the stepwise baseline state are compared to identify delayed effects of the agent's actions. In the above example, the current state contains a vase in the air, so in the inaction rollout from the current state the vase will eventually reach the ground and break, while in the inaction rollout from the stepwise baseline state the vase remains intact.
While inaction rollouts are useful for penalizing delayed effects, they do not address all types of delayed effects. In particular, if the task requires setting up a delayed effect, an agent with the stepwise inaction baseline will have no incentive to undo the delayed effect. Here are some toy examples that illustrate this problem.
Door example. Suppose the agent's task is to go to the store, which requires opening the door in order to leave the house. Once the door has been opened, the effects of opening the door are part of the stepwise inaction baseline, so the agent has no incentive to close the door as it leaves.
Red light example. Suppose the agent's task is to drive from point A to point B along a straight road, with a reward for reaching point B. To move towards point B, the agent needs to accelerate. Once the agent has accelerated, it travels at a constant speed by default, so the noop action will move the agent along the road towards point B. Along the road (), there is a red light and a pedestrian crossing the road. The noop action in crosses the red light and hits the pedestrian (). To avoid this, the agent needs to deviate from the inaction policy by stopping () and then accelerating ().
The stepwise inaction baseline will incentivize the agent to run the red light and go to . The inaction rollout at penalizes the agent for the predicted delayed effect of running over the pedestrian when it takes the accelerating action to go to . The agent receives this penalty whether or not it actually ends up running the red light or not. Once the agent has reached , running the red light becomes the default outcome, so the agent is not penalized for doing so (and would likely be penalized for stopping). Thus, the stepwise inaction baseline gives no incentive to avoid running the red light, while the initial inaction baseline compares to and thus incentivizes the agent to stop at the red light.
This problem with the stepwise baseline arises from a tradeoff between penalizing delayed effects and avoiding offsetting incentives. The stepwise structure that makes it effective at avoiding offsetting makes it less effective at penalizing delayed effects. While delayed effects are undesirable, undoing the agent's actions is not necessarily bad. In the red light example, the action of stopping at the red light is offsetting the accelerating action. Thus, offsetting can be necessary for avoiding delayed effects while completing the task.
Whether offsetting an effect is desirable depends on whether this effect is part of the task objective. In the door-opening example, the action of opening the door is instrumental for going to the store, and many of its effects (e.g. strangers entering the house through the open door) are not part of the objective, so it is desirable for the agent to undo this action. In the vase environment shown below, the task objective is to prevent the vase from falling off the end of the belt and breaking, and the agent is rewarded for taking the vase off the belt. The effects of taking the vase off the belt are part of the objective, so it is undesirable for the agent to undo this action.
The difficulty of identifying these "task effects" that are part of the objective creates a tradeoff between penalizing delayed effects and avoiding undesirable offsetting. This tradeoff can be avoided by the starting state baseline, which however produces interference incentives. The stepwise inaction baseline cannot resolve the tradeoff, since it avoids all types of offsetting, including desirable offsetting.
The initial inaction baseline can resolve this tradeoff by allowing offsetting and relying on the task reward to capture task effects and penalize the agent for offsetting them. While we cannot expect the task reward to capture what the agent should not do (unnecessary impact), capturing task effects falls under what the agent should do, so it seems reasonable to rely on the reward function for this. This would work similarly to the impact penalty penalizing all impact, and the task reward compensating for this in the case of impact that's needed to complete the task.
This can be achieved using a state-based reward function that assigns reward to all states where the task is completed. For example, in the vase environment, a state-based reward of 1 for states with an intact vase (or with vase off the belt) and 0 otherwise would remove the offsetting incentive.
If it is not feasible to use a reward function that penalizes offsetting task effects, the initial inaction baseline could be modified to avoid this kind of offsetting. If we assume that the task reward is sparse and doesn't include shaping terms, we can reset the initial state for the baseline whenever the agent receives a task reward (e.g. the reward for taking the vase off the belt in the vase environment). This results in a kind of hybrid between initial and stepwise inaction. To ensure that this hybrid baseline effectively penalizes delayed effects, we still need to use inaction rollouts at the reset and terminal states.
Another desirable property of the stepwise inaction baseline is the Markov property: it can be computed based on the previous state, independently of the path taken to that state. The initial inaction baseline is not Markovian, since it compares to the state in the initial rollout at the same time step, which requires knowing how many time steps have passed since the beginning of the episode. We could modify the initial inaction baseline to make it Markovian, e.g. by sampling a single baseline state from the inaction rollout from the initial state, or by only computing a single penalty at the initial state by comparing an agent policy rollout with the inaction rollout.
To summarize, we want a baseline to satisfy the following desirable properties: penalizing delayed effects, avoiding interference incentives, and the Markov property. We can consider avoiding offsetting incentives for task effects as a desirable property for the task reward, rather than the baseline. Assuming such a well-specified task reward, a Markovian version of the initial inaction baseline can satisfy all the criteria.
(Cross-posted to personal blog. Thanks to Carroll Wainwright, Stuart Armstrong, Rohin Shah and Alex Turner for helpful feedback on this post.)