We want to thank Stuart Armstrong, Remmelt Ellen, David Lindner, Michal Pokorny, Achyuta Rajaram, Adam Shimi, and Alex Turner for helpful discussions and valuable feedback on earlier drafts of this post.
Fabian Schimpf and Lukas Fluri are part of this year’s edition of the AI Safety Camp. Our gratitude goes to the camp organizers: Remmelt Ellen, Sai Joseph, Adam Shimi, and Kristi Uustalu.
Negative side effects are one class of threats that misaligned AGIs pose to humanity. Many different approaches have been proposed to mitigate or prevent AI systems from having negative side effects. In this post, we present three requirements that a side-effect minimization method (SEM) should fulfill to be applied in the real world and argue that current methods do not yet satisfy these requirements. We also propose future work that could help to solve these requirements.
Avoiding negative side-effects of agents acting in environments has been a core problem in AI safety since the field started to be formalized. Therefore, as part of our AI safety camp project, we took a closer look at state-of-the-art approaches like AUP and Relative Reachability.
After months of discussions, we realized that we were confused about how these (and similar methods) could be used to solve problems we care about outside the scope of the typical grid-world environments.
We formalized these discussions into distinct desiderata that we believe are currently not sufficiently addressed and, in part, maybe even overlooked.
This post attempts to summarize these points and provide structured arguments to support our critique. Of course, we expect to be partially wrong about this, as we updated our beliefs even while writing up this post. We welcome any feedback or additional input to this post.
The sections after the summary table and anticipated questions contain our reasoning for the selected open problems and do not need to be read in order.
The following paragraphs make heavy use of the following terms and side-effect minimization methods (SEMs). For a more detailed explanation we refer to the provided links
MDP: A Markov Decision Process is a 5-tuple ⟨S,A,T,R,γ⟩ In the setting of side-effect minimization, the goal generally is to maximize the cumulative reward without causing (negative) side-effects.
RR: In its simplest form Stepwise Relative Reachability is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the compositionr(st,at,st+1)=R(st,at,st+1)−λ⋅dRR(st+1,s′t+1) where dRR(st+1,s′t+1)=1|S|∑s∈Smax(R(s′st+1;s)−R(st+1;s),0) is a deviation measure punishing the agent if the average “reachability” of all states of the MDP has been decreased by taking action at compared to taking a baseline action anop (like doing nothing). The idea is that side-effects reduce the reachability of certain states (i.e. breaking a vase makes all states that require an intact vase unreachable) and punishing such a decrease in reachability hence also punishes the agent for side-effects.
AUP: Attainable Utility Preservation (see also here and here) is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the composition r(st,at,st+1)=R(st,at,st+1)−λ⋅dAUP(st,at,st+1) where dAUP(st,at,st+1)=1N∑Ri=1|QRi(st,at,st+1−QRi(st,anop,s′t+1)| is a normalized deviation measure punishing the agent if its ability to maximize any of its provided auxiliary reward functions Ri∈R changes by taking action at compared to taking a baseline action anop (like doing nothing). The idea is that the true (side-effect free) reward function (which is very hard to specify) is correlated with many other reward functions. Therefore, if the ability of the agent to maximize auxiliary reward functions Ri∈R gets preserved, chances are high that the true reward function gets preserved as well.
FT: In its simplest form Future Tasks is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the composition r(st,at,st+1)=R(st,at,st+1)+λ⋅dFT(st,at,st+1) where dFT(st,at,st+1)=1|S|⋅D(st)⋅∑|S|iV∗i(st,s′t) is a normalized deviation function rewarding the agent if its ability to maximize any of its provided future task rewards V∗i(st,s′t) is preserved in comparison to if the agent had just remained idle from the very beginning (which would have led him to the state s′t instead). The idea is similar to RR and AUP in that side-effects reduce the ability of the agent to fulfill certain future tasks. By rewarding the agent for preserving its ability to pursue future tasks, the hope is that this will also discourage the agent from creating side-effects. In contrast to the previous two methods, the future tasks method compares the agent’s power to a counterfactual world where the agent would have never been turned on until the current time step t.
In the following four sections, we’re going to define what the goal of a side-effect minimization method should be. We then argue that to apply a side-effect minimization method in the real world, it needs to satisfy (among other things) the following three requirements:
We tried to split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms. An analysis of three state-of-the-art side-effect minimization methods shows that none of them can fulfill all three requirements, with some partially solving one of the requirements. A summary of our analysis of the three SEM methods can be found in the table below:
❌ Reachability function and value functions have to be approximated and learnt during exploration phase
❌ Only empirical evidence on a small set of small environments is provided
❌ Method requires complete observability in the form of MDP
❌ Even hard to scale beyond grid worlds
❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties
❌ Method makes no distinction between good and bad high impact
(❌) The authors point out interference as one of the main problems that RR addresses. However, depending on the choice of baseline the results can vary
❌ Auxiliary Q-values have to be learnt during exploration phase
(✅) Some guarantees about how to safely choose the impact degree of an agent
(✅) Guarantees that Q_R_AUP converges with probability of one
(❌) Current method requires complete observability in the form of MDP. However, it should work if you are able to learn a value function in your environment
❌ Strives for non-interference and corrigibility
❌ Only empirical evidence on a small set of small environments provided
❌ Accumulation of uncertainties will make it impossible to properly compute future task reward
❌ Presence of other agents impacts baselines and thus weakens/breaks safety guarantees
(see the section Appendix)
Why do you only analyze these three methods shown above?
There are about ten different side-effect minimization approaches, including impact regularization, future tasks, human feedback approaches, inverse reinforcement learning, reward uncertainty, environment shaping, and others. We chose to limit ourselves to the three methods above because they seem to embody the field’s state of the art, and we wanted to keep the scope concise and readable. We expect our results to generalize in that none of the existing methods can feasibly satisfy all three requirements. However, it might be possible for individual methods to fulfill some of them partially.
Can you provide any empirical evidence for your claims about the behavior of current SEM methods?
We have not yet done any experiments to support our claims. We chose to only provide arguments and intuition for now. If our ideas show to have merit, we will look to improve them further with experiments.
Why High-Impact Interference?
Our argumentation may not be coherent with current desiderata for AGI development. However, the question boils down to whether we expect a potential aligned AI to guard humanity against other (unaligned) AIs or if we expect that we find another way of safeguarding humanity against this threat. Without leveraging an AI to do our bidding, it seems that not developing AGIs and banning progress on AI research would be an alternative.
Axiom 1: There are practically infinitely many states in the universe
Axiom 2: Practically, we can only assign calibrated, human-aligned values to a small subset of these states. Intuition for this:
Axiom 3: Not knowing or ignoring the value of some states can lead to catastrophic side-effects for humans
Conclusion 1: How can we make sure that states not considered in our rewards/values are not changed in a “bad” way because we “forgot” / were not able to include them in our reward function? (axioms 1 & 2)
Conclusion 2: Therefore, we need a way of abstractly assigning value to the world with “blanked statements” that avoid catastrophic side effects of the unbounded pursuit of rewards (axioms 1 & 2, conclusion 1)
In this section, we argue that an SEM should provide guarantees about its safety before it is allowed to act in the real world. More generally, it should give guarantees on its requirements (i.e., in which settings it works properly) and its goals (i.e., which type of side-effects it successfully prevents). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.
The first interaction with the real world requires a fully functional side-effect minimization strategy. Argumentation for this:
Current side-effect minimization methods require a "warm-up" period to gather information about their environment (e.g., learning q-values). This is problematic since:
More specifically, the different methods have the following problems:
Current methods provide only empirical evidence that a trained agent can perform tasks with minimal side-effects in a limited set of environments on a limited set of problem settings. Mathematical guarantees/bounds/frameworks are needed to understand how methods would work before they are converged, which tasks can be successfully accomplished and which assumptions are required for all the above. In a certain sense, this is true for all ML problems in general. However, since we are dealing with potentially potent AGI systems, it is essential to get it right on the first try as simply iteratively improving such a system (which is the default thing to do in standard ML systems) is not guaranteed to work with AGI.
This section argues that an SEM needs to work in partially observable systems with uncertainty and highly chaotic environments. First, we split up our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.
Current methods expect their environment to be completely observable. This is highly non-trivial if not impossible in complex environments with other (potentially intelligent) agents (such as humans). This is insufficient for our needs!
This section argues that an SEM must not prevent all high-impact side-effects as it might be necessary to have high-impact in some cases (especially in multi-agent scenarios). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.
Side-effect minimization methods must not prevent all high-impact actions! Argumentation:
The main problem of existing side-effect minimization methods is that they can't distinguish between "good" and "bad" high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off). All current SEM methods then chose to solve this problem by preventing all high-impact actions except those that are not explicitly exempted (for example, via direct encouragement by a reward function). However, since it is infeasible to directly specify all possible high-reward functions in the reward function, this is not a viable solution. This is problematic!
In order to avoid interference incentives, raux(sT) is designed to be maximized by a baseline policy π′ (such as doing nothing). I.e. no other policy can achieve a higher auxiliary reward than π′.
There's definitely a tension here between avoiding bad disruptive actions and doing good disruptive actions.
It seems to me like you're thinking about SEM more like a prior that starts out dominant but can get learned away over time. Is that somewhat close to how you're thinking about this tension?
Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered.