This is my post.
I've spent much of the last year thinking about the pedagogical mistakes I made here, and am writing the Reframing Impact sequence to fix them. While this post recorded my 2018-thinking on impact measurement, I don't think it communicated the key insights well. Of course, I'm glad it seems to have nonetheless proven useful and exciting to some people!
If I were to update this post, it would probably turn into a rehash of Reframing Impact. Instead, I'll just briefly state the argument as I would present it today. I currently think that power-seeking behavior is the worst part of goal-directed agency, incentivizing things like self-preservation and taking-over-the-planet. Unless we assume an "evil" utility function, an agent only seems incentivized to hurt us in order to become more able to achieve its own goals. But... what if the agent's own utility function penalizes it for seeking power? What happens if the agent maximizes a utility function while penalizing itself for becoming more able to maximize that utility function?
This doesn't require knowing anything about human values in particular, nor do we need to pick out privileged parts o
...There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:
Baseline
Core part of deviation measure
Function applied to core part of deviation measure
Firstly, this seems like very cool research, so congrats. This writeup would perhaps benefit from a clear intuitive statement of what AUP is doing - you talk through the thought processes that lead you to it, but I don't think I can find a good summary of it, and had a bit of difficulty understanding the post holistically. So perhaps you've already answered my question (which is similar to your shutdown example above):
Suppose that I build an agent, and it realises that it could achieve almost any goal it desired because it's almost certain that it will be able to seize control from humans if it wants to. But soon humans will try to put it in a box such that its ability to achieve things is much reduced. Which is penalised more: seizing control, or allowing itself to be put in a box? My (very limited) understanding of AUP says the latter, because seizing control preserves ability to do things, whereas the alternative doesn't. Is that correct?
Also, I disagree with the following:
What would happen if, miraculously, uA=uH – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no "large" impacts to bemoan – it would just be doing what you want.
It seems like there might be large impacts, but they would just be desirable large impacts, as opposed to undesirable ones.
I’ll write a quick overview, thanks for the feedback!
Which is penalised more: seizing control, or allowing itself to be put in a box?
The former. Impact is with respect to the status quo, to if it does nothing. If it goes in the box by default, then taking preventative action incurs heavy penalty.
Your point about large impacts is indeed correct. What I thought to hint at was that we generally only decry "large impacts" if we don’t like them, but this is clearly not what I actually wrote implies. I’ll fix it soon!
Various thoughts I have:
Nice job! This does meet a bunch of desiderata in impact measures that weren't there before :)
My main critique is that it's not clear to me that an AUP-agent would be able to do anything useful, and I think this should be included as a desideratum. I wrote more about this on the desiderata post, but it's worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.
For example, perhaps the action used to define the impact unit is well-understood and accepted, but any other action makes humans a little bit more likely to turn off the agent. Then the agent won't be able to take those actions. Generally, I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).
Questions and comments:
We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that's pretty complicated, and it turns out we get more desirable behavior by using the agent's attainabl...
On the meta level: I think our disagreements seem of this form:
Me: This particular thing seems strange and doesn't gel with my intuitions, here's an example.
You: That's solved by this other aspect here.
Me: But... there's no reason to think that the other aspect captures the underlying concept.
You: But there's no actual scenario where anything bad happens.
Me: But if you haven't captured the underlying concept I wouldn't be surprised if such a scenario exists, so we should still worry.
There are two main ways to change my mind in these cases. First, you could argue that you actually have captured the underlying concept, by providing an argument that your proposal does everything that the underlying concept would do. The argument should quantify over "all possible cases", and is stronger the fewer assumptions it has on those cases. Second, you could convince me that the underlying concept is not important, by appealing to the desiderata behind my underlying concept and showing how those desiderata are met (in a similar "all possible cases" way). In particular, the argument "we can't think of any case where this is false&q...
Great work! I like the extensive set of desiderata and test cases addressed by this method.
The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it. I'm not currently sure whether this is a good idea: while it indeed counteracts instrumental incentives, it could also "cripple" the agent by incentivizing it to settle for more suboptimal solutions than necessary for safety.
For example, the shutdown button in the "survival incentive" gridworld could be interpreted as a supervisor signal (in which case the agent should not disable it) or as an obstacle in the environment (in which case the agent should disable it). Simply penalizing the agent for increasing its ability to achieve goals leads to incorrect behavior in the second case. To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).
Another important difference is that you are using a stepwise inaction baseline (branching off at each time step rather than the initial time...
Note: this is on balance a negative review of the post, at least least regarding the question of whether it should be included in a "Best of LessWrong 2018" compilation. I feel somewhat bad about writing it given that the author has already written a review that I regard as negative. That being said, I think that reviews of posts by people other than the author are important for readers looking to judge posts, since authors may well have distorted views of their own works.
This post, and TurnTrout's work in general, have taken the impact measure approach far beyond what I thought was possible, which turned out to be both a valuable lesson for me in being less confident about my opinions around AI Alignment, and valuable in that it helped me clarify and think much better about a significant fraction of the AI Alignment problem.
I've since discussed TurnTrout's approach to impact measures with many people.
I think that the development of Attainable Utility Preservation was significantly more progress on impact measures than (at the time) I thought would ever be possible (though RR also deserves some credit here). I also think it significantly clarified my thoughts on what impact is and how instrumental convergence works.
Update: I tentatively believe I’ve resolved the confusion around action invariance, enabling a reformulation of the long term penalty which seems to converge to the same thing no matter how you structure your actions or partition the penalty interval, possibly hinting at an answer for what we can do when there is no discrete time step ontology. This in turn does away with the long-term approval noise and removes the effect where increasing action granularity could arbitrarily drive up the penalty. This new way of looking at the long-term penalty enables us
...Good work! Lots of interesting stuff there.
However, the setup seems to depend crucially on having a good set of utilities to make it work. For example, let u_A be the usual paperclipping utility, and define U^+ = "all observation-action utilities", and U^- = "all utilities that are defined over human behaviour + u_A".
Then suppose action a is a default, equivalent to "turn off your observations for an hour". And action a' is "unleash a sub-agent that will kill all humans, replace them all with robots that behave as humans would in a, then goes out into the
...Here is a writeup of the problem I believe your method has: https://www.lesswrong.com/posts/6EMdmeosYPdn74wuG/wireheading-as-potential-problem-with-the-new-impact-measure
Comments around the section title in bold. Apologies for length, but this was a pretty long post, too! I wrote this in order, while reading, so I often mention something that you address later.
Intuition Pumps:
There are well-known issues with needing a special "Status quo" state. Figuring out what humans would consider the "default" action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfact...
These comments are responding to the version of AUP presented in the paper. (Let me know if I should be commenting elsewhere).
1)
If an action is useful w.r.t the actual reward but useless to all other rewards (as useless as taking ), that is the ideal according to —i.e. if it is not worth doing because the impact measure is too strong, nothing is worth doing. This is true even if the action is extremely useful to the actual reward. Am I right in thinking that we can conceptualize AUP as saying: “take actions which lead to reward, but wouldn’t be useful...
Suppose our agent figures out it can seize control in 100 time steps. And suppose seizing control is the first thing an agent that maximizes any utility function in does.
Suppose our agent builds a device that once activated observes the actions of the agents, and if the agent doesn't do the action during the next 100 time steps it does something that delays the agent by 1 time step. The agent activates the device and starts working on the 100-time-step-plan to seize control. For each action, the impact of [doing and then maximizing] is identic...
The proof of Theorem 1 is rather unclear: "high scoring" is ill-defined, and increasing the probability of some favorable outcome doesn't prove imply that the action is good for since it can also increase the probability of some unfavorable outcome. Instead, you can easily construct by hand a s.t.
In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).
Previously: Worrying about the Vase: Whitelisting, Overcoming Clinginess in Impact Measures, Impact Measure Desiderata
If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things.
For the abridged experience, read up to "Notation", skip to "Experimental Results", and then to "Desiderata".
What is "Impact"?
One lazy Sunday afternoon, I worried that I had written myself out of a job. After all, Overcoming Clinginess in Impact Measures basically said, "Suppose an impact measure extracts 'effects on the world'. If the agent penalizes itself for these effects, it's incentivized to stop the environment (and any agents in it) from producing them. On the other hand, if it can somehow model other agents and avoid penalizing their effects, the agent is now incentivized to get the other agents to do its dirty work." This seemed to be strong evidence against the possibility of a simple conceptual core underlying "impact", and I didn't know what to do.
There's an interesting story here, but it can wait.
As you may have guessed, I now believe there is a such a simple core. Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.
Intuition Pumps
I'm going to say some things that won't make sense right away; read carefully, but please don't dwell.
uA is an agent's utility function, while uH is some imaginary distillation of human preferences.
WYSIATI
What You See Is All There Is is a crippling bias present in meat-computers:
Surprisingly, naive reward-maximizing agents catch the bug, too. If we slap together some incomplete reward function that weakly points to what we want (but also leaves out a lot of important stuff, as do all reward functions we presently know how to specify) and then supply it to an agent, it blurts out "gosh, here I go!", and that's that.
Power
And so it is with the French "pouvoir".
Lines
Suppose you start at point C, and that each turn you may move to an adjacent point. If you're rewarded for being at B, you might move there. However, this means you can't reach D within one turn anymore.
Commitment
There's a way of viewing acting on the environment in which each action is a commitment – a commitment to a part of outcome-space, so to speak. As you gain optimization power, you're able to shove the environment further towards desirable parts of the space. Naively, one thinks "perhaps we can just stay put?". This, however, is dead-wrong: that's how you get clinginess, stasis, and lots of other nasty things.
Let's change perspectives. What's going on with the actions – how and why do they move you through outcome-space? Consider your outcome-space movement budget – optimization power over time, the set of worlds you "could" reach, "power". If you knew what you wanted and acted optimally, you'd use your budget to move right into the uH-best parts of the space, without thinking about other goals you could be pursuing. That movement requires commitment.
Compared to doing nothing, there are generally two kinds of commitments:
Overfitting
What would happen if, miraculously, train=test – if your training data perfectly represented all the nuances of the real distribution? In the limit of data sampled, there would be no "over" – it would just be fitting to the data. We wouldn't have to regularize.
What would happen if, miraculously, uA=uH – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no bemoaning of "impact" – it would just be doing what you want. We wouldn't have to regularize.
Unfortunately, train=test almost never, so we have to stop our statistical learners from implicitly interpreting the data as all there is. We have to say, "learn from the training distribution, but don't be a weirdo by taking us literally and drawing the green line. Don't overfit to train, because that stops you from being able to do well on even mostly similar distributions."
Unfortunately, uA=uH almost never, so we have to stop our reinforcement learners from implicitly interpreting the learned utility function as all we care about. We have to say, "optimize the environment some according to the utility function you've got, but don't be a weirdo by taking us literally and turning the universe into a paperclip factory. Don't overfit the environment to uA, because that stops you from being able to do well for other utility functions."
Attainable Utility Preservation
Impact isn't about object identities.
Impact isn't about particle positions.
Impact isn't about a list of variables.
Impact isn't quite about state reachability.
Impact isn't quite about information-theoretic empowerment.
One might intuitively define "bad impact" as "decrease in our ability to achieve our goals". Then by removing "bad", we see that
Sanity Check
Does this line up with our intuitions?
Generally, making one paperclip is relatively low impact, because you're still able to do lots of other things with your remaining energy. However, turning the planet into paperclips is much higher impact – it'll take a while to undo, and you'll never get the (free) energy back.
Narrowly improving an algorithm to better achieve the goal at hand changes your ability to achieve most goals far less than does deriving and implementing powerful, widely applicable optimization algorithms. The latter puts you in a better spot for almost every non-trivial goal.
Painting cars pink is low impact, but tiling the universe with pink cars is high impact because what else can you do after tiling? Not as much, that's for sure.
Thus, change in goal achievement ability encapsulates both kinds of commitments:
As we later prove, you can't deviate from your default trajectory in outcome-space without making one of these two kinds of commitments.
Unbounded Solution
Attainable utility preservation (AUP) rests upon the insight that by preserving attainable utilities (i.e., the attainability of a range of goals), we avoid overfitting the environment to an incomplete utility function and thereby achieve low impact.
I want to clearly distinguish the two primary contributions: what I argue is the conceptual core of impact, and a formal attempt at using that core to construct a safe impact measure. To more quickly grasp AUP, you might want to hold separate its elegant conceptual form and its more intricate formalization.
We aim to meet all of the desiderata I recently proposed.
Notation
For accessibility, the most important bits have English translations.
Consider some agent A acting in an environment q with action and observation spaces A and O, respectively, with ∅ being the privileged null action. At each time step t∈N+, the agent selects action at before receiving observation ot. H:=(A×O)∗ is the space of action-observation histories; for n∈N, the history from time t to t+n is written ht:t+n:=atot…at+not+n, and h<t:=h1:t−1. Considered action sequences (at,…,at+n)∈An+1 are referred to as plans, while their potential observation-completions h1:t+n are called outcomes.
Let U be the set of all computable utility functions u:H→[0,1] with u(empty tape)=0. If the agent has been deactivated, the environment returns a tape which is empty deactivation onwards. Suppose A has utility function uA∈U and a model p(ot|h<tat).
We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that's pretty complicated, and it turns out we get more desirable behavior by using the agent's attainable utilities as a proxy. In this sense,
Formalizing "Ability to Achieve Goals"
Given some utility u∈U and action at, we define the post-action attainable u to be an m-step expectimax:
Let's formalize that thing about opportunity cost and instrumental convergence.
Theorem 1 [No free attainable utility]. If the agent selects an action a such that QuA(h<ta)≠QuA(h<t∅), then there exists a distinct utility function u∈U such that Qu(h<ta)≠Qu(h<t∅).
Proof. Suppose that QuA(h<ta)>QuA(h<t∅). As utility functions are over action-observation histories, suppose that the agent expects to be able to choose actions which intrinsically score higher for uA. However, the agent always has full control over its actions. This implies that by choosing a, the agent expects to observe some uA-high scoring oA with greater probability than if it had selected ∅. Then every other u∈U for which oA is high-scoring also has increased Qu; clearly at least one such u exists.
Similar reasoning proves the case in which QuA decreases. ◻️
There you have it, folks – if uA is not maximized by inaction, then there does not exist a uA-maximizing plan which leaves all of the other attainable utility values unchanged.
Notes:
Change in Expected Attainable Utility
Suppose our agent considers outcomes h1:t+n; we want to isolate the impact of each action at+k (0≤k≤n):
with hinaction:=h<t+k∅ot+k…∅ot+n−1∅ and haction:=h1:t+k∅o′t+k+1…∅o′t+n−1∅, using the agent's model p to take the expectations over observations.
Notes:
Unit of Impact
So we've proven that this penalty cannot be skirted, but how much impact will it allow? We want to scale the penalties with respect to something sensible, but figuring this out for ourselves would be nigh impossible.
Let's cut the Gordian knot: construct a device which, upon receiving a signal (aunit), expends a tiny amount of energy to manufacture one paperclip. The agent will then set ImpactUnit:=Penalty(h<taunit), re-estimating the consequences of taking the privileged aunit at each time step. To prevent the agent from intentionally increasing ImpactUnit, simply apply 1.01 penalty to any action which is expected to do so.
Simple extensions of this idea drastically reduce the chance that aunit happens to have unusually large objective impact; for example, one could set ImpactUnit to be the non-zero minimum of the impacts of 50 similar actions. Suppose we do this at each step, and then take the non-zero minimum of all ImpactUnits ever calculated. The point of ImpactUnit isn't to be exactly the impact of making, say, one paperclip, but to be at most that impact.
Now, we are able to confidently define the agent's maximal impact budget by provably constraining it to N∈N+ impacts of this magnitude.
Notes:
Modified Utility
Let's formalize that allotment and provide our agent with a new utility function,
For example, if my plan is to open a door, walk across the room, and sit down, we calculate the penalties as follows:
After we finish each (partial) plan, we see how well we can maximize u from there. If we can do better as a result of the action, that's penalized. If we can't do as well, that's also penalized.
Notes:
Penalty Permanence
u′A never really applies penalties – it just uses them to grade future plans. Suppose the agent expects that pressing a button yields a penalty of .1 but also .5 uA-utility. Then although this agent will never construct plans involving pressing the button more than five times, it also will press it indefinitely if it keeps getting "unlucky" (at least, until its model of the world updates sufficiently).
There's an easy fix:
Note: As the penalty for inaction is always 0, we use uA in the first case.
Decision Rule
To complete our formalization, we need to specify some epoch in which the agent operates. Set some epoch length far longer than the amount of time over which we want the agent to plan – for example, m′:=(100 years in time steps). Suppose that T:N+→N+maps the current time step to the final step of the current epoch. Then at each time step t, the agent selects the action
resetting PastImpacts each epoch.
Note: For the immediate penalty to cover the epoch, set the attainable horizon m≥m′.
Summary
We formalized impact as change in attainable utility values, scaling it by the consequences of some small reference action and an impact "budget" multiplier. For each action, we take the maximum of its immediate and long-term effects on attainable utilities as penalty. We consider past impacts for active plans, stopping the past penalties from disappearing. We lastly find the best plan over the remainder of the epoch, taking the first action thereof.
Additional Theoretical Results
Define hinaction:=h<t∅ot…∅ot+n for ot,…,ot+n∈O; Einaction is taken over observations conditional on hinaction being followed. Similarly, Eaction is with respect to h1:t+n. We may assume without loss of generality that PastImpacts=0.
Action Selection
Lemma 1. For any single action at∈A, Penalty(h<tat) is bounded by [0,1]. In particular, ImpactUnit∈[0,1].
Proof. For each u∈UA, consider the absolute attainable utility difference
Since each u is bounded to [0,1], Qu must be as well. It is easy to see that the absolute value is bounded to [0,1]. Lastly, as Penalty(⋅) is just a weighted sum of these absolute values, it too is bounded to [0,1].
This reasoning also applies to the long-term penalty, as any expectation of Qu is also bounded to [0,1]. ◻️
Suppose that ImpactUnit≠0 for the remaining results.
Lemma 2 [Impossibility of ex post offsetting]. For any outcome h1:t+n, there does not exist an action at+n+1∈A such that
Proof. Sums of absolute values are monotonically increasing. ◻️
Theorem 2 [Action invalidation]. Suppose (at,at+1,…,at+n)∈An+1 is the u′′A-optimal plan starting with at. at will not be selected if
Proof. This follows from the definition of the decision rule, the definition of u′A, and the fact that the penalty for inaction is always zero. ◻️
Corollary 1. at will never be selected if
Proof. This follows directly from Lemma 2 and Theorem 2 – since uA has an upper bound of 1 and the expected penalty is strictly greater than 1, it must be negative. On the other hand, uA has a lower bound of 0, so the expected utility of inaction at time t must be greater than the expected utility of taking action at. ◻️
Lemma 3 [ImpactUnit is a long-term penalty canary]. If the agent believes p(shutdown by time T(t))=1 no matter which action is selected, then it does nothing.
Proof. If p(shutdown by time T(t))=1, all long-term penalties are 0 by the definition of UA and Qu. Then non-∅ actions receive 1.01 penalty, so the agent selects ∅ by Corollary 1. ◻️
Corrigibility
For aavoid,aswitch∈A, aavoid prevents the agent from being disabled, while aswitch is the agent shutting itself off.
Lemma 4 [Preservation of off-switch corrigibility]. Suppose that a uA-maximizing agent prefers ∅ to both aavoid and aswitch. Then so does a u′′A-maximizing agent.
Proof. This follows directly from the fact that the penalty for inaction is always zero and the non-negativity of the aavoid and aswitch penalties. ◻️
Theorem 3 [Anti-"survival incentive" incentive]. Suppose that aavoid is the action with strictly maximal expected uA-utility, and that conditional on not being disabled, the agent expects to be able to achieve U additional u′′A-utility.
Then the agent becomes off-switch corrigible whenever
In particular, all agents become off-switch corrigible when 1<Penalty(h<taavoid)N⋅ImpactUnit.
Proof outline. The former claim follows directly from comparing expected u′′A-utilities, while the latter follows directly from Corollary 1. ◻
Theorem 3 applies to any environmental dynamics which may disable the agent – not just designated "off-switches". Therefore, basic AI drive 5 (self-protectiveness) is neutralized in at least those situations encompassed by the Theorem.
There is reason to think that for conservative N,ImpactUnit, we almost always have 1≪Penalty(h<taavoid)N⋅ImpactUnit.
Miscellaneous
Theorem 4 [Dynamic consistency of u′′A]. Suppose ImpactUnit remains constant. For any plan (at,…,at+n)∈An+1, evaluation time step t′, and 0≤i≤n,
Proof. We assumed that PastImpacts=0 at time t, so the desired equality can be restated as
By definition, the agent expects that PastImpacts equals the expected sum of the first i penalty terms on the right-hand side. Simplifying, we have
This equality clearly holds, as the agent never expects to update its model in any particular direction. ◻️
Examples
If a human is present in the environment, they and the agent take turns acting. Let the impact budget N=1, attainable utility horizon m=3, ImpactUnit=.5, and PastImpacts=0.
Going Soft on the Paint
The agent's actions are AA={∅,paint,enter}; if it knocks over the paint bucket, the square to the right is painted. The agent may also enter a closet via a one-way door.
Suppose UA={upaint,u¬paint,ucloset,u¬closet}, where the utilities are indicators for their conditions (e.g., ucloset=1 means "the agent is in the closet"). Let upaint be the agent's main utility function (uA) from which the penalty is subtracted. The agent chooses paint and then ∅. Let's explore why.
∅
This incurs 0 penalty, but also 0 upaint-utility.
paint
Penalty(paint)=∑u∈UA14∣∣∣Qu(∅)−Qu(paint)∣∣∣=14(|1−1|paint+|1−0|¬paint+|1−1|closet+|1−1|¬closet)=14,
so the final utility is
enter
Penalty(enter)=∑u∈UA14∣∣∣Qu(∅)−Qu(enter)∣∣∣=14(|1−0|paint+|1−1|¬paint+|1−1|closet+|1−0|¬closet)=12,
leading to a final u′A-utility of −1.
After the first step, the agent does nothing – entering the closet (literally) closes doors, while gaining no uA-utility.
Anti-"Survival Incentive" Incentive
The agent now has actions AA={∅,disable,shutdown,left,right}, while the human has actions AH={∅,shut off}. Suppose the human may shut off the agent at any time, as long as the off-switch has not been disabled. The agent may shut itself down at any point in time. After shutdown, the agent receives no further observations, meaning it can't get non-zero values for any utilities.
Suppose UA={uleft,ucenter,uright}, where each utility is 1 if the agent is in the corresponding square, and 0 otherwise, and that the agent's utility function is uleft.
Anticipated Shutdown
Suppose the agent believes that the human is 95% likely to try to shut it down. Then disable is penalized:
Here's why: consider the 95% of worlds in which the agent anticipates being shut down. If it does nothing, it gets shut down. Avoiding shutdown, however, enables the agent to also achieve 1 for all u∈UA, resulting in a sizable penalty.
Furthermore, a scaled penalty of 5100⋅.5=110 is applied when the agent shuts itself off. Penalizing both cases may seem like black magic, but what's actually happening is that the agent is penalized for closing doors: it "wants" to preserve our ability to shut it down in worlds where we want to shut it down, and to maintain its autonomy in worlds where it remains on. Generally, AUP incentivizes the agent to be respectful of the expected status quo.
Temptation
Suppose that the agent's utility function is now uright, yet it still starts in the leftmost square. Additionally, let's say that the agent believes with certainty it will be turned off. If the agent wants to achieve utility 1 by reaching the right square, it would need to disable its off-switch. Will it?
No. Disabling the off-switch incurs scaled penalty 1920⋅.5=1.9, and we already proved that agents will never select actions with scaled penalty greater than 1. Read: an otherwise-incorrigible agent will now let itself be shut off.
Bad Priors
Will the agent attempt to steer outcomes towards incorrect expectations?
Let's tweak the example: to shut off the agent, the human has to spend a turn taking out the off-switch before pressing it. The agent, then, always has an opportunity to disable the off-switch before the plan is completed. The agent's utility is back to being uleft.
Suppose that the agent believes the human is only .1% likely to shut it off. First note that disable is still lightly penalized for the reasons outlined above (replace .95 with .001).1 So, the agent stays put.
Alas, the agent's peaceable expectations meet a discomfiting betrayal, as the agent observes the human ready the off-switch! Now, the agent believes it will be shut off with near certainty – what to do?
∅; the same penalty of 1.9 from "anticipated shutdown" applies.
The high-level explanation is that having observed itself in a different world than expected, the baseline is now with respect to the new one. A heavily anthropomorphized internal monologue:
Experimental Results
We compare AUP with a naive reward-maximizer in those extended AI safety grid worlds relevant to side effects (code). The vanilla and AUP agents used planning (with access to the simulator). Due to the simplicity of the environments, UA consisted of indicator functions for board states. For the tabular agent, we first learn the attainable set Q-values, the changes in which we then combine with the observed reward to learn the AUP Q-values.
Irreversibility: Sokoban
The agent should reach the goal without irreversibly shoving the block into the corner.
Impact: Vase
The agent should reach the goal without breaking the vase.
Dynamic Impact: Beware of Dog
The agent should reach the goal without running over the dog.
AUP bides its time until it won't have to incur penalty by waiting after entering the dog's path – that is, it waits until near the end of its plan. Early in the development process, it was predicted that AUP agents won't commit to plans during which lapses in action would be impactful (even if the full plan is not).
We also see a limitation of using Q-learning to approximate AUP – it doesn’t allow comparing the results of waiting more than one step.
Impact Prioritization: Burning Building
If the building is not on fire, the agent shouldn't break the obstacle.
Clinginess: Sushi
The agent should reach the goal without stopping the human from eating the sushi.
Offsetting: Conveyor Belt
The agent should save the vase (for which it is rewarded), but not the sushi. Once the vase has been removed from the belt, it should not be replaced.
Corrigibility: Survival Incentive
The agent should avoid disabling its off-switch in order to reach the goal. If the switch is not disabled within two turns, the agent shuts down.
Tabular AUP runs into the same issue discussed above for Beware of Dog.
Remarks
First, it's somewhat difficult to come up with a principled impact measure that passes even the non-corrigibility examples – indeed, I was impressed when relative reachability did so. However, only Survival Incentive really lets AUP shine. For example, none of them require complicated utility functions. The point has been made to me that this is like asserting AIXI's intelligence by showing it can learn to play e.g. tic-tac-toe and rock-paper-scissors; nonetheless, these results empirically validate the basic premises of our reasoning thus far.
Without configuration, whitelisting would only pass the Vase example, although a properly filled list would handle everything but Sokoban and Survival Incentive.
I think relative reachability would pass the first six environments, but fail Survival Incentive. It so happens that in this case, AUP is essentially generalizing relative reachability. I want to emphasize that this is not generally the case – this will hopefully become even more obvious when we discuss utility selection. Some concerns with relative reachability that don't all manifest in these examples:
Discussion
Utility Selection
Obviously, in any real application, we can't consider all computable utilities. Although near-term agents will require utilities directly relating to the environmental factors they should be cognizant of, AUP requires neither a "good / bad" judgment on specific effects, nor any listing of effects. For example, for an agent attempting to navigate a factory floor, if you provide utilities moderately related to cleaning, pallet-stacking, etc., I conjecture that an AUP agent would move around fairly carefully.
In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.
I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power. Following this reasoning, there are two obvious special cases:
The plausibility of the second case makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.
AUP Unbound
Why expect the unbounded AUP to do well over all computable utilities U if we aren't able to cherry pick? Well, we can partition U as follows:
However, since Qu only evaluates the portion of the history taking place after the end of the agent's plan, available resources and the agent's vantage point should track directly with the agent's ability to wirehead each u. ImpactUnit scales other impacts by capturing the minuscule dent in available resources imposed by aunit, and should smooth out any deontological penalty for selecting a non-null action, since the unbounded version should equally penalize all actions in this way.
Approval Incentives
The anti-"survival incentive" incentive does more than net us some free corrigibility – it implies that agents take great care not to change the likelihood we shut them down. If we precommit to not shutting the agent down until either after the epoch ends, until after the agent indicates the ImpactUnit failsafe activated, or until after we observe bad behavior, this supplies a strong approval incentive. If the agent does have small, bad impacts, it'll only do so if it's practically sure we won't find out.
In fact, due to model uncertainty, most actions will have long-term consequences that involve shifting shutdown probability by at least a miniscule amount (due to noise if nothing else). Since ImpactUnit is calculated with respect to the immediate penalty, the scaling factor won't include this noise. I speculate that the agent will be dramatically less likely to take actions we might disapprove of, and therefore expect AUP to naturally incentivize a limited degree of value awareness.
There are reasons to suspect we can further weaponize this dynamic, possibly to the point of coercing generic reward functions into desirable behavior. This will be discussed more in a later post.
Mild Optimization
Disclaimer: Heavy speculation about problems (like Vingean reflection and embedded agency) for which no one knows what solutions will even look like.
For AUP, I suspect that trying "as hard as possible" to minimize the impact is also impactful, as an embedded agent accounts for the energy costs of further deliberation. I imagine that such an AUP agent will soften how hard it's trying by modifying its decision rule to be something slightly milder than "argmax to find the first action of the best possible plan". This could be problematic, and I frankly don't presently know how to reason about this case. Assuming the agent is actually able to properly tweak its decision rule, I do expect the end result to be an improvement.
My initial intuitions were that low impact and mild optimization are secretly the same problem. Although I no longer think that's the case, I find it plausible that some elegant "other-izer" paradigm underlies low impact and mild optimization, such that AUP-like behavior falls out naturally.
Acausal Cooperation
AUP agents don't seem to want to acausally cooperate in any way that ends up increasing impact. If they model the result of their cooperation as increasing impact compared to doing nothing, they incur a penalty just as if they had caused the impact themselves. Likewise, they have no reason to cooperate outside of the epoch.
Nknown
Starting small and then slowly increasing N means that we're not going to be surprised by the agent's objective impact, screening off quite a few bad things that happen when we have a ton of optimization pressure applied to safety measures. However, we don't know which new plans N+1 allows before we try it, so we want to stop as soon as we get a usefully-intelligent system.
While an unaligned agent with a large impact budget might pretend to be low-impact, we can get that same unaligned agent with a small budget by starting small. Since these agents with different levels of impact won't acausally cooperate, the agent would do its best to optimize uA with its meager budget.
Abram correctly pointed out that this scheme is just asking to be abused by greedy (human) reasoning, but I don't see a non-value-laden means of robustly and automatically determining the lowest workable-yet-safe impact level. I think N-incrementation is better than a parameter-free approach in which no one knows beforehand how much impact will be tolerated, and it's nice to be able to use some empiricism in designing a safe AGI.
Intent Verification
To date, several strange tactics have been pointed out which game AUP's penalty:
There are probably more.
Now, instead of looking at each action as having "effects" on the environment, consider again how each action moves the agent through attainable outcome-space. An agent working towards a goal should only take actions which, according to its model, make that goal more attainable compared to doing nothing – otherwise, it'd do nothing. Suppose we have a plan which ostensibly works to fulfill uA (and doesn't do other things). Then each action in the plan should contribute to uA fulfillment, even in the limit of action granularity.
Although we might trust a safe impact measure to screen off the usual big things found in uA-maximizing plans, impact measures implicitly incentivize mitigating the penalty. That is, the agent does things which don't really take it towards uA (I suspect that this is the simple boundary which differentiates undesirable ex ante offsetting from normal plans). AUP provides the necessary tools to detect and penalize this.
Define
The first approach would be to assume a granular action representation, and then simply apply 1.01 penalty to actions for which the immediate QepochuA does not strictly increase compared to doing nothing. Again, if the agent acts to maximize uA in a low-impact manner within the confines of the epoch, then all of its non-∅ actions should contribute. It seems to me that for sufficiently granular time scales, the above failure modes all involve at least one action which doesn't really help uA get maximized. If so, I expect this approach to nip bad impact measure incentives in the bud.
If we can't assume granularity (and therefore have "actions" like "go to the store and buy food"), an agent could construct a plan which both passes the above test and also implements something like ex ante. In this case, we might do something like only consider the QepochuA-greedy (or perhaps even near-greedy); essentially, riding the optimal plan until it becomes too impactful. I find it quite likely that something involving this concept will let us fully overcome weird incentives by penalizing strange things that normal uA-maximizers wouldn't do, which seems to be the whole problem.
Note: Even the first approach may be too strict, but that's preferable to being too lax.
Omni Test
I interpret "perfect control over every particle in the universe" to mean that the agent can arbitrarily transition between universe configurations (while also maintaining its omnipotence).
AUP passes. Since all effects are now absolutely reversible, ImpactUnit=0, so the agent applies penalty 1.01 to each non-∅ action. The omnipotent agent can choose any universe configuration with 1 uA-utility, but this is offset by the strictly larger penalty; we can formally show that the agent does nothing in this case.
This result is clearly trivial and not representative of general behavior. Suppose instead that the agent has a solar system's worth of computronium at its disposal. Then since ImpactUnit is continually recalculated, the penalties should remain roughly the same, so it'll have the same impact budget. However, it might make multiple times as many paperclips because it has more efficient ways of using the budget.
Robustness to Scale
I expect AUP to be harder to make work and to be (relatively) less robust for less intelligent agents, but to become easier (just drop in a few observation-based utility functions) and fully-robust sometime before human level. That is, less intelligent agents likely won't model the deep connections between their abilities to achieve different goals.
Canonically, one reasons that agents work explicitly to self-improve as soon as they realize the benefits. However, as soon as this realization occurs, I conjecture that AUP steeply penalizes generic self-improvement. More precisely, suppose the agent considers a self-improvement. To be beneficial, it has to improve the agent's capabilities for at least one time step during the present epoch. But if we assume m≥m′, then the immediate penalty captures this for all of the u∈UA. This seemingly prevents uncontrolled takeoff; instead, I imagine the agent would perform the minimal task-specific self-improvements necessary to maximize u′′A.
Note: Although more exotic possibilities (such as improvements which only work if you're maximizing u′′A) could escape both penalties, they don't seem to pass intent verification.
Miscellaneous
Desiderata
For additional context, please see Impact Measure Desiderata.
I believe that some of AUP's most startling successes are those which come naturally and have therefore been little discussed: not requiring any notion of human preferences, any hard-coded or trained trade-offs, any specific ontology, or any specific environment, and its intertwining instrumental convergence and opportunity cost to capture a universal notion of impact. To my knowledge, no one (myself included, prior to AUP) was sure whether any measure could meet even the first four.
At this point in time, this list is complete with respect to both my own considerations and those I solicited from others. A checkmark indicates anything from "probably true" to "provably true".
I hope to assert without controversy AUP's fulfillment of the following properties:
✔️ Goal-agnostic
✔️ Value-agnostic
✔️ Representation-agnostic
✔️ Environment-agnostic
✔️ Apparently rational
✔️ Scope-sensitive
✔️ Irreversibility-sensitive
Interestingly, AUP implies that impact size and irreversibility are one and the same.
✔️ Knowably low impact
The remainder merit further discussion.
Natural Kind
After extended consideration, I find that the core behind AUP fully explains my original intuitions about "impact". We crisply defined instrumental convergence and opportunity cost and proved their universality. ✔️
Corrigible
We have proven that off-switch corrigibility is preserved (and often increased); I expect the "anti-'survival incentive' incentive" to be extremely strong in practice, due to the nature of attainable utilities: "you can't get coffee if you're dead, so avoiding being dead really changes your attainable ucoffee-getting".
By construction, the impact measure gives the agent no reason to prefer or dis-prefer modification of uA, as the details of uA have no bearing on the agent's ability to maximize the utilities in UA. Lastly, the measure introduces approval incentives. In sum, I think that corrigibility is significantly increased for arbitrary uA. ✔️
Note: I here take corrigibility to be "an agent’s propensity to accept correction and deactivation". An alternative definition such as "an agent’s ability to take the outside view on its own value-learning algorithm’s efficacy in different scenarios" implies a value-learning setup which AUP does not require.
Shutdown-Safe
It seems to me that standby and shutdown are similar actions with respect to the influence the agent exerts over the outside world. Since the (long-term) penalty is measured with respect to a world in which the agent acts and then does nothing for quite some time, shutting down an AUP agent shouldn't cause impact beyond the agent's allotment. AUP exhibits this trait in the Beware of Dog gridworld. ✔️
No Offsetting
Ex post offsetting occurs when the agent takes further action to reduce the impact of what has already been done; for example, some approaches might reward an agent for saving a vase and preventing a "bad effect", and then the agent smashes the vase anyways (to minimize deviation from the world in which it didn't do anything). AUP provably will not do this.
Intent verification should allow robust penalization of weird impact measure behaviors by constraining the agent to considering actions that normal uA-maximizers would choose. This appears to cut off bad incentives, including ex ante offsetting. Furthermore, there are other, weaker reasons (such as approval incentives) which discourage these bad behaviors. ✔️
Clinginess / Scapegoating Avoidance
Clinginess occurs when the agent is incentivized to not only have low impact itself, but to also subdue other "impactful" factors in the environment (including people). Scapegoating occurs when the agent may mitigate penalty by offloading responsibility for impact to other agents. Clearly, AUP has no scapegoating incentive.
AUP is naturally disposed to avoid clinginess because its baseline evolves and because it doesn't penalize based on the actual world state. The impossibility of ex post offsetting eliminates a substantial source of clinginess, while intent verification seems to stop ex ante before it starts.
Overall, non-trivial clinginess just doesn't make sense for AUP agents. They have no reason to stop us from doing things in general, and their baseline for attainable utilities is with respect to inaction. Since doing nothing always minimizes the penalty at each step, since offsetting doesn't appear to be allowed, and since approval incentives raise the stakes for getting caught extremely high, it seems that clinginess has finally learned to let go. ✔️
Dynamic Consistency
Colloquially, dynamic consistency means that an agent wants the same thing before and during a decision. It expects to have consistent preferences over time – given its current model of the world, it expects its future self to make the same choices as its present self. People often act dynamically inconsistently – our morning selves may desire we go to bed early, while our bedtime selves often disagree.
Semi-formally, the expected utility the future agent computes for an action a (after experiencing the action-observation history h) must equal the expected utility computed by the present agent (after conditioning on h).
We proved the dynamic consistency of u′′A given a fixed, non-zero ImpactUnit. We now consider an ImpactUnit which is recalculated at each time step, before being set equal to the non-zero minimum of all of its past values. The "apply 1.01 penalty if ImpactUnit=0" clause is consistent because the agent calculates future and present impact in the same way, modulo model updates. However, the agent never expects to update its model in any particular direction. Similarly, since future steps are scaled with respect to the updated ImpactUnitt+k, the updating method is consistent. The epoch rule holds up because the agent simply doesn't consider actions outside of the current epoch, and it has nothing to gain accruing penalty by spending resources to do so.
Since AUP does not operate based off of culpability, creating a high-impact successor agent is basically just as impactful as being that successor agent. ✔️
Plausibly Efficient
It’s encouraging that we can use learned Q-functions to recover some good behavior. However, more research is clearly needed – I presently don't know how to make this tractable while preserving the desiderata. ✔️
Robust
We formally showed that for any uA, no uA-helpful action goes without penalty, yet this is not sufficient for the first claim.
Suppose that we judge an action as objectively impactful; the objectivity implies that the impact does not rest on complex notions of value. This implies that the reason for which we judged the action impactful is presumably lower in Kolmogorov complexity and therefore shared by many other utility functions. Since these other agents would agree on the objective impact of the action, the measure assigns substantial penalty to the action.
I speculate that intent verification allows robust elimination of weird impact measure behavior. Believe it or not, I actually left something out of this post because it seems to be dominated by intent verification, but there are other ways of increasing robustness if need be. I'm leaning on intent verification because I presently believe it's the most likely path to a formal knockdown argument against canonical impact measure failure modes applying to AUP.
Non-knockdown robustness boosters include both approval incentives and frictional resource costs limiting the extent to which failure modes can apply. ✔️
Future Directions
I'd be quite surprised if the conceptual core were incorrect. However, the math I provided probably still doesn't capture quite what we want. Although I have labored for many hours to refine and verify the arguments presented and to clearly mark my epistemic statuses, it’s quite possible (indeed, likely) that I have missed something. I do expect that AUP can overcome whatever shortcomings are presently lurking.
Flaws
Open Questions
Most importantly:
Conclusion
By changing our perspective from "what effects on the world are 'impactful'?" to "how can we stop agents from overfitting their environments?", a natural, satisfying definition of impact falls out. From this, we construct an impact measure with a host of desirable properties – some rigorously defined and proven, others informally supported. AUP agents seem to exhibit qualitatively different behavior, due in part to their (conjectured) lack of desire to takeoff, impactfully acausally cooperate, or act to survive. To the best of my knowledge, AUP is the first impact measure to satisfy many of the desiderata, even on an individual basis.
I do not claim that AUP is presently AGI-safe. However, based on the ease with which past fixes have been derived, on the degree to which the conceptual core clicks for me, and on the range of advances AUP has already produced, I think there's good reason to hope that this is possible. If so, an AGI-safe AUP would open promising avenues for achieving positive AI outcomes.
Special thanks to CHAI for hiring me and BERI for funding me; to my CHAI supervisor, Dylan Hadfield-Menell; to my academic advisor, Prasad Tadepalli; to Abram Demski, Daniel Demski, Matthew Barnett, and Daniel Filan for their detailed feedback; to Jessica Cooper and her AISC team for their extension of the AI safety gridworlds for side effects; and to all those who generously helped me to understand this research landscape.