Previously: Attainable Utility Preservation: Empirical Results; summarized in AN #105

Our most recent AUP paper was accepted to NeurIPS 2020 as a spotlight presentation:
Reward function specification can be difficult, even in simple environments. Rewarding the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (AUP) avoided side effects by penalizing shifts in the ability to achieve randomly generated goals. We scale this approach to large, randomly generated environments based on Conway’s Game of Life. By preserving optimal value for a single randomly generated reward function, AUP incurs modest overhead while leading the agent to complete the specified task and avoid side effects.
Here are some slides from our spotlight talk (publicly available; it...
Reframing Impact has focused on supplying the right intuitions and framing. Now we can see how these intuitions about power and the AU landscape both predict and explain AUP's empirical success thus far.
Let's start with the known and the easy: avoiding side effects[1] in the small AI safety gridworlds (for the full writeup on these experiments, see Conservative Agency). The point isn't to get too into the weeds, but rather to see how the weeds still add up to the normalcy predicted by our AU landscape reasoning.
In the following MDP levels, the agent can move in the cardinal directions or do nothing (). We give the agent a reward function which partially encodes what we want, and also an auxiliary reward function ...