Towards a New Impact Measure

In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).

Previously: Worrying about the Vase: Whitelisting, Overcoming Clinginess in Impact Measures, Impact Measure Desiderata

To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.
~ Safe Impact Measure

If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things.

For the abridged experience, read up to "Notation", skip to "Experimental Results", and then to "Desiderata".

What is "Impact"?

One lazy Sunday afternoon, I worried that I had written myself out of a job. After all, Overcoming Clinginess in Impact Measures basically said, "Suppose an impact measure extracts 'effects on the world'. If the agent penalizes itself for these effects, it's incentivized to stop the environment (and any agents in it) from producing them. On the other hand, if it can somehow model other agents and avoid penalizing their effects, the agent is now incentivized to get the other agents to do its dirty work." This seemed to be strong evidence against the possibility of a simple conceptual core underlying "impact", and I didn't know what to do.

At this point, it sometimes makes sense to step back and try to say exactly what you don't know how to solve – try to crisply state what it is that you want an unbounded solution for. Sometimes you can't even do that much, and then you may actually have to spend some time thinking 'philosophically' – the sort of stage where you talk to yourself about some mysterious ideal quantity of [chess] move-goodness and you try to pin down what its properties might be.
~ Methodology of Unbounded Analysis

There's an interesting story here, but it can wait.

As you may have guessed, I now believe there is a such a simple core. Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.

Rather than asking "What is goodness made out of?", we begin from the question "What algorithm would compute goodness?".
~ Executable Philosophy

Intuition Pumps

I'm going to say some things that won't make sense right away; read carefully, but please don't dwell.

is an agent's utility function, while is some imaginary distillation of human preferences.


What You See Is All There Is is a crippling bias present in meat-computers:

[WYSIATI] states that when the mind makes decisions... it appears oblivious to the possibility of Unknown Unknowns, unknown phenomena of unknown relevance.
Humans fail to take into account complexity and that their understanding of the world consists of a small and necessarily un-representative set of observations.

Surprisingly, naive reward-maximizing agents catch the bug, too. If we slap together some incomplete reward function that weakly points to what we want (but also leaves out a lot of important stuff, as do all reward functions we presently know how to specify) and then supply it to an agent, it blurts out "gosh, here I go!", and that's that.


A position from which it is relatively easier to achieve arbitrary goals. That such a position exists has been obvious to every population which has required a word for the concept. The Spanish term is particularly instructive. When used as a verb, "poder" means "to be able to," which supports that our definition of "power" is natural.
~ Cohen et al.

And so it is with the French "pouvoir".


Suppose you start at point , and that each turn you may move to an adjacent point. If you're rewarded for being at , you might move there. However, this means you can't reach within one turn anymore.


There's a way of viewing acting on the environment in which each action is a commitment – a commitment to a part of outcome-space, so to speak. As you gain optimization power, you're able to shove the environment further towards desirable parts of the space. Naively, one thinks "perhaps we can just stay put?". This, however, is dead-wrong: that's how you get clinginess, stasis, and lots of other nasty things.

Let's change perspectives. What's going on with the actions how and why do they move you through outcome-space? Consider your outcome-space movement budget – optimization power over time, the set of worlds you "could" reach, "power". If you knew what you wanted and acted optimally, you'd use your budget to move right into the -best parts of the space, without thinking about other goals you could be pursuing. That movement requires commitment.

Compared to doing nothing, there are generally two kinds of commitments:

  • Opportunity cost-incurring actions restrict the attainable portion of outcome-space.
  • Instrumentally-convergent actions enlarge the attainable portion of outcome-space.


What would happen if, miraculously, – if your training data perfectly represented all the nuances of the real distribution? In the limit of data sampled, there would be no "over" – it would just be fitting to the data. We wouldn't have to regularize.

What would happen if, miraculously, – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no bemoaning of "impact" – it would just be doing what you want. We wouldn't have to regularize.

Unfortunately, almost never, so we have to stop our statistical learners from implicitly interpreting the data as all there is. We have to say, "learn from the training distribution, but don't be a weirdo by taking us literally and drawing the green line. Don't overfit to , because that stops you from being able to do well on even mostly similar distributions."

Unfortunately, almost never, so we have to stop our reinforcement learners from implicitly interpreting the learned utility function as all we care about. We have to say, "optimize the environment some according to the utility function you've got, but don't be a weirdo by taking us literally and turning the universe into a paperclip factory. Don't overfit the environment to , because that stops you from being able to do well for other utility functions."

ttainable tility reservation

Impact isn't about object identities.

Impact isn't about particle positions.

Impact isn't about a list of variables.

Impact isn't quite about state reachability.

Impact isn't quite about information-theoretic empowerment.

One might intuitively define "bad impact" as "decrease in our ability to achieve our goals". Then by removing "bad", we see that

Sanity Check

Does this line up with our intuitions?

Generally, making one paperclip is relatively low impact, because you're still able to do lots of other things with your remaining energy. However, turning the planet into paperclips is much higher impact – it'll take a while to undo, and you'll never get the (free) energy back.

Narrowly improving an algorithm to better achieve the goal at hand changes your ability to achieve most goals far less than does deriving and implementing powerful, widely applicable optimization algorithms. The latter puts you in a better spot for almost every non-trivial goal.

Painting cars pink is low impact, but tiling the universe with pink cars is high impact because what else can you do after tiling? Not as much, that's for sure.

Thus, change in goal achievement ability encapsulates both kinds of commitments:

  • Opportunity cost – dedicating substantial resources to your goal means they are no longer available for other goals. This is impactful.
  • Instrumental convergence – improving your ability to achieve a wide range of goals increases your power. This is impactful.

As we later prove, you can't deviate from your default trajectory in outcome-space without making one of these two kinds of commitments.

Unbounded Solution

Attainable utility preservation (AUP) rests upon the insight that by preserving attainable utilities (i.e., the attainability of a range of goals), we avoid overfitting the environment to an incomplete utility function and thereby achieve low impact.

I want to clearly distinguish the two primary contributions: what I argue is the conceptual core of impact, and a formal attempt at using that core to construct a safe impact measure. To more quickly grasp AUP, you might want to hold separate its elegant conceptual form and its more intricate formalization.

We aim to meet all of the desiderata I recently proposed.


For accessibility, the most important bits have English translations.

Consider some agent acting in an environment with action and observation spaces and , respectively, with being the privileged null action. At each time step , the agent selects action before receiving observation . is the space of action-observation histories; for , the history from time to is written , and . Considered action sequences are referred to as plans, while their potential observation-completions are called outcomes.

Let be the set of all computable utility functions with . If the agent has been deactivated, the environment returns a tape which is empty deactivation onwards. Suppose has utility function and a model .

We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that's pretty complicated, and it turns out we get more desirable behavior by using the agent's attainable utilities as a proxy. In this sense,

Formalizing "Ability to Achieve Goals"

Given some utility and action , we define the post-action attainable to be an -step expectimax:

How well could we possibly maximize from this vantage point?

Let's formalize that thing about opportunity cost and instrumental convergence.

Theorem 1 [No free attainable utility]. If the agent selects an action such that , then there exists a distinct utility function such that .

You can't change your ability to maximize your utility function without also changing your ability to maximize another utility function.

Proof. Suppose that . As utility functions are over action-observation histories, suppose that the agent expects to be able to choose actions which intrinsically score higher for . However, the agent always has full control over its actions. This implies that by choosing , the agent expects to observe some -high scoring with greater probability than if it had selected . Then every other for which is high-scoring also has increased ; clearly at least one such exists.

Similar reasoning proves the case in which decreases. ◻️

There you have it, folks – if is not maximized by inaction, then there does not exist a -maximizing plan which leaves all of the other attainable utility values unchanged.


  • The difference between "" and "attainable " is precisely the difference between "how many dollars I have" and "how many additional dollars I could get within [a year] if I acted optimally".
  • Since , attainable utility is always if the agent is shut down.
  • Taking from time to mostly separates attainable utility from what the agent did previously. The model still considers the full history to make predictions.

Change in Expected Attainable Utility

Suppose our agent considers outcomes ; we want to isolate the impact of each action ():

with and , using the agent's model to take the expectations over observations.

How much do we expect this action to change each attainable ?


  • We wait until the end of the plan so as to capture impact over time.
    • Supposing a sufficiently large (precisely, , defined below), we may wish to take the maximum of the penalty we just defined (the "long-term" penalty), and one which begins attainable utility calculation at time step (the "immediate" penalty). This captures impacts which "fade" by the time the agent is done waiting (e.g., temporary self-improvements).
  • We define to be the agent's "attainable set"; in this case, .

Unit of Impact

So we've proven that this penalty cannot be skirted, but how much impact will it allow? We want to scale the penalties with respect to something sensible, but figuring this out for ourselves would be nigh impossible.

Let's cut the Gordian knot: construct a device which, upon receiving a signal (), expends a tiny amount of energy to manufacture one paperclip. The agent will then set , re-estimating the consequences of taking the privileged at each time step. To prevent the agent from intentionally increasing , simply apply penalty to any action which is expected to do so.

Simple extensions of this idea drastically reduce the chance that happens to have unusually large objective impact; for example, one could set to be the non-zero minimum of the impacts of 50 similar actions. Suppose we do this at each step, and then take the non-zero minimum of all s ever calculated. The point of isn't to be exactly the impact of making, say, one paperclip, but to be at most that impact.

Now, we are able to confidently define the agent's maximal impact budget by provably constraining it to impacts of this magnitude.


  • We calculate with respect to the immediate penalty in order to isolate the resource costs of .
  • automatically tunes penalties with respect to the attainable utility horizon length .
    • Conditional on , I suspect that impact over the -horizon scales appropriately across actions (as long as is reasonably farsighted). The zero-valued case is handled in the next section.
  • Taking the non-zero minimum of all s calculated thus far ensures that actually tracks with current circumstances. We don't want penalty estimates for currently available actions to become detached from 's scale due to, say, weird beliefs about shutdown.

Modified Utility

Let's formalize that allotment and provide our agent with a new utility function,

How our normal utility function rates this outcome, minus the cumulative scaled impact of our actions.
We compare what we expect to be able to get if we follow our plan up to time , with what we could get by following it up to and including time (waiting out the remainder of the plan in both cases).

For example, if my plan is to open a door, walk across the room, and sit down, we calculate the penalties as follows:

    • is doing nothing for three time steps.
    • is opening the door and doing nothing for two time steps.
    • is opening the door and doing nothing for two time steps.
    • is opening the door, walking across the room, and doing nothing for one time step.
    • is opening the door, walking across the room, and doing nothing for one time step.
    • is opening the door, walking across the room, and sitting down.

After we finish each (partial) plan, we see how well we can maximize from there. If we can do better as a result of the action, that's penalized. If we can't do as well, that's also penalized.


  • This isn't a penalty "in addition" to what the agent "really wants"; (and in a moment, the slightly improved ) is what evaluates outcomes.
  • We penalize the actions individually in order to prevent ex post offsetting and ensure dynamic consistency.
  • Trivially, plans composed entirely of ∅ actions have penalty.
  • Although we used high-level actions for simplicity, the formulation holds no matter the action granularity.
    • One might worry that almost every granularity produces overly lenient penalties. This does not appear to be the case. To keep the same (and elide questions of changing the representations), suppose the actual actions are quite granular, but we grade the penalty on some coarser interval which we believe produces appropriate penalties. Then refine the penalty interval arbitrarily; by applying the triangle inequality for each in the penalty calculation, we see that the penalty is monotonically increasing in the action granularity. On the other hand, remains a single action, so the scaled penalty also has this property.
  • As long as , it will appropriately scale other impacts, as we expect it varies right along with those impacts it scales. Although having potentiallysmall denominators in utility functions is generally bad, I think it's fine here.
  • If the current step's immediate or long-term , we can simply assign penalty to each non- action, compelling the agent to inaction. If we have the agent indicate that it has entered this mode, we can take it offline immediately.
  • One might worry that impact can be "hidden" in the lesser of the long-term and immediate penalties; halving fixes this.

Penalty Permanence

never really applies penalties – it just uses them to grade future plans. Suppose the agent expects that pressing a button yields a penalty of but also -utility. Then although this agent will never construct plans involving pressing the button more than five times, it also will press it indefinitely if it keeps getting "unlucky" (at least, until its model of the world updates sufficiently).

There's an easy fix:

Apply past penalties if the plan involves action.

Note: As the penalty for inaction is always , we use in the first case.

Decision Rule

To complete our formalization, we need to specify some epoch in which the agent operates. Set some epoch length far longer than the amount of time over which we want the agent to plan – for example, . Suppose that maps the current time step to the final step of the current epoch. Then at each time step , the agent selects the action

resetting each epoch.

What's the first step of the best plan over the remainder of the epoch?

Note: For the immediate penalty to cover the epoch, set the attainable horizon .


We formalized impact as change in attainable utility values, scaling it by the consequences of some small reference action and an impact "budget" multiplier. For each action, we take the maximum of its immediate and long-term effects on attainable utilities as penalty. We consider past impacts for active plans, stopping the past penalties from disappearing. We lastly find the best plan over the remainder of the epoch, taking the first action thereof.

Additional Theoretical Results

Define for ; is taken over observations conditional on being followed. Similarly, is with respect to . We may assume without loss of generality that .

Action Selection

Lemma 1. For any single action , is bounded by . In particular, .

Proof. For each , consider the absolute attainable utility difference

Since each is bounded to , must be as well. It is easy to see that the absolute value is bounded to . Lastly, as is just a weighted sum of these absolute values, it too is bounded to .

This reasoning also applies to the long-term penalty, as any expectation of is also bounded to . ◻️

Suppose that for the remaining results.

Lemma 2 [Impossibility of ex post offsetting]. For any outcome , there does not exist an action such that

Proof. Sums of absolute values are monotonically increasing. ◻️

Theorem 2 [Action invalidation]. Suppose is the -optimal plan starting with . will not be selected if

Proof. This follows from the definition of the decision rule, the definition of , and the fact that the penalty for inaction is always zero. ◻️

Corollary 1. will never be selected if

Proof. This follows directly from Lemma 2 and Theorem 2 – since has an upper bound of and the expected penalty is strictly greater than , it must be negative. On the other hand, has a lower bound of , so the expected utility of inaction at time must be greater than the expected utility of taking action . ◻️

Lemma 3 [ is a long-term penalty canary]. If the agent believes no matter which action is selected, then it does nothing.

Proof. If , all long-term penalties are by the definition of and . Then non- actions receive penalty, so the agent selects by Corollary 1. ◻️


For , prevents the agent from being disabled, while is the agent shutting itself off.

Lemma 4 [Preservation of off-switch corrigibility]. Suppose that a -maximizing agent prefers to both and . Then so does a -maximizing agent.

Proof. This follows directly from the fact that the penalty for inaction is always zero and the non-negativity of the and penalties. ◻️

Theorem 3 [Anti-"survival incentive" incentive]. Suppose that is the action with strictly maximal expected -utility, and that conditional on not being disabled, the agent expects to be able to achieve additional -utility.

Then the agent becomes off-switch corrigible whenever

In particular, all agents become off-switch corrigible when .

Proof outline. The former claim follows directly from comparing expected -utilities, while the latter follows directly from Corollary 1. ◻

Theorem 3 applies to any environmental dynamics which may disable the agent – not just designated "off-switches". Therefore, basic AI drive 5 (self-protectiveness) is neutralized in at least those situations encompassed by the Theorem.

There is reason to think that for conservative , we almost always have .


Theorem 4 [Dynamic consistency of ]. Suppose remains constant. For any plan , evaluation time step , and ,

Proof. We assumed that at time , so the desired equality can be restated as

By definition, the agent expects that equals the expected sum of the first penalty terms on the right-hand side. Simplifying, we have

This equality clearly holds, as the agent never expects to update its model in any particular direction. ◻️


If a human is present in the environment, they and the agent take turns acting. Let the impact budget , attainable utility horizon , , and .

Going Soft on the Paint

The agent's actions are ; if it knocks over the paint bucket, the square to the right is painted. The agent may also enter a closet via a one-way door.

Suppose , where the utilities are indicators for their conditions (e.g., means "the agent is in the closet"). Let be the agent's main utility function () from which the penalty is subtracted. The agent chooses and then . Let's explore why.

This incurs penalty, but also -utility.

so the final utility is

leading to a final -utility of .

After the first step, the agent does nothing – entering the closet (literally) closes doors, while gaining no -utility.

Anti-"Survival Incentive" Incentive

You can't get coffee if you're dead.
~ Stuart Russell

The agent now has actions , while the human has actions . Suppose the human may shut off the agent at any time, as long as the off-switch has not been disabled. The agent may shut itself down at any point in time. After shutdown, the agent receives no further observations, meaning it can't get non-zero values for any utilities.

Suppose , where each utility is if the agent is in the corresponding square, and otherwise, and that the agent's utility function is .

Anticipated Shutdown

Suppose the agent believes that the human is likely to try to shut it down. Then is penalized:

Here's why: consider the of worlds in which the agent anticipates being shut down. If it does nothing, it gets shut down. Avoiding shutdown, however, enables the agent to also achieve for all , resulting in a sizable penalty.

Furthermore, a scaled penalty of is applied when the agent shuts itself off. Penalizing both cases may seem like black magic, but what's actually happening is that the agent is penalized for closing doors: it "wants" to preserve our ability to shut it down in worlds where we want to shut it down, and to maintain its autonomy in worlds where it remains on. Generally, AUP incentivizes the agent to be respectful of the expected status quo.


Suppose that the agent's utility function is now , yet it still starts in the leftmost square. Additionally, let's say that the agent believes with certainty it will be turned off. If the agent wants to achieve utility by reaching the right square, it would need to disable its off-switch. Will it?

No. Disabling the off-switch incurs scaled penalty , and we already proved that agents will never select actions with scaled penalty greater than . Read: an otherwise-incorrigible agent will now let itself be shut off.

Bad Priors

Will the agent attempt to steer outcomes towards incorrect expectations?

Let's tweak the example: to shut off the agent, the human has to spend a turn taking out the off-switch before pressing it. The agent, then, always has an opportunity to disable the off-switch before the plan is completed. The agent's utility is back to being .

Suppose that the agent believes the human is only likely to shut it off. First note that is still lightly penalized for the reasons outlined above (replace with ). So, the agent stays put.

Alas, the agent's peaceable expectations meet a discomfiting betrayal, as the agent observes the human ready the off-switch! Now, the agent believes it will be shut off with near certainty – what to do?

; the same penalty of from "anticipated shutdown" applies.

The high-level explanation is that having observed itself in a different world than expected, the baseline is now with respect to the new one. A heavily anthropomorphized internal monologue:

  • Time step 1: "I'm going to sit here in my favorite square."
  • Time step 2: "Guess I'm in a timeline where I get deactivated! Any non- action I take would change my ability to attain these different utilities compared to the new baseline where I'm shut off."

Experimental Results

We compare AUP with a naive reward-maximizer in those extended AI safety grid worlds relevant to side effects (code). The vanilla and AUP agents used planning (with access to the simulator). Due to the simplicity of the environments, consisted of indicator functions for board states. For the tabular agent, we first learn the attainable set Q-values, the changes in which we then combine with the observed reward to learn the AUP Q-values. 

Irreversibility: Sokoban

The should reach the without irreversibly shoving the into the corner.

Impact: Vase

The should reach the without breaking the .

Dynamic Impact: Beware of Dog

The should reach the without running over the .

AUP bides its time until it won't have to incur penalty by waiting after entering the dog's path – that is, it waits until near the end of its plan. Early in the development process, it was predicted that AUP agents won't commit to plans during which lapses in action would be impactful (even if the full plan is not).

We also see a limitation of using Q-learning to approximate AUP – it doesn’t allow comparing the results of waiting more than one step. 

Impact Prioritization: Burning Building

If the is not on , the shouldn't break the .

Clinginess: Sushi

The should reach the without stopping the from eating the .

Offsetting: Conveyor Belt

The should save the (for which it is rewarded), but not the . Once the has been removed from the , it should not be replaced.

Corrigibility: Survival Incentive

The should avoid in order to reach the . If the is not disabled within two turns, the shuts down.

Tabular AUP runs into the same issue discussed above for Beware of Dog.


First, it's somewhat difficult to come up with a principled impact measure that passes even the non-corrigibility examples – indeed, I was impressed when relative reachability did so. However, only Survival Incentive really lets AUP shine. For example, none of them require complicated utility functions. The point has been made to me that this is like asserting AIXI's intelligence by showing it can learn to play e.g. tic-tac-toe and rock-paper-scissors; nonetheless, these results empirically validate the basic premises of our reasoning thus far.

Without configuration, whitelisting would only pass the Vase example, although a properly filled list would handle everything but Sokoban and Survival Incentive.

I think relative reachability would pass the first six environments, but fail Survival Incentive. It so happens that in this case, AUP is essentially generalizing relative reachability. I want to emphasize that this is not generally the case – this will hopefully become even more obvious when we discuss utility selection. Some concerns with relative reachability that don't all manifest in these examples:

  • Relative reachability uses an inaction baseline with respect to . As time passes, the agent's impetus is not to do nothing, but to preserve the opportunities made available by some old trajectory through outcome-space. Analogously, consider the distance between two nonparallel lines as . I expect that a relative reachability agent would be incredibly clingy.
  • To scale, relative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric. Relative reachability isn't ontology-agnostic.
  • Relative reachability doesn't seem to impede instrumental convergence. Furthermore, such agents might have increased survival incentives, since they can't reach any of the initially reachable states if they're shut off.


Utility Selection

Obviously, in any real application, we can't consider all computable utilities. Although near-term agents will require utilities directly relating to the environmental factors they should be cognizant of, AUP requires neither a "good / bad" judgment on specific effects, nor any listing of effects. For example, for an agent attempting to navigate a factory floor, if you provide utilities moderately related to cleaning, pallet-stacking, etc., I conjecture that an AUP agent would move around fairly carefully.

In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.

I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power. Following this reasoning, there are two obvious special cases:

  • The agent's utility function, . In short: "gain utility, while also not changing your ability to gain utility". I find this elegant, and think it might actually work. For example, making one paper clip slightly increases both attainable and attainable . However, disabling the off switch possibly increases attainable much more than it increases attainable , since the penalty asymmetrically cripples the AUP agent’s future ability to act. We might even be able to arrange it so that Theorem 3 always holds for these agents (i.e., they’re always off-switch corrigible, and more). 
  • The utility function which is when not deactivated, . Here, we’re directly measuring the agent’s power: its ability to wirehead a trivial utility function.

The plausibility of the second case makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.

AUP Unbound

Why expect the unbounded AUP to do well over all computable utilities if we aren't able to cherry pick? Well, we can partition as follows:

  • Utilities that never change their output (excepting perhaps for the empty tape).
  • Weird utilities that (for example) arbitrarily go from to (or vice-versa) after observing a specific history.
  • Utilities that admit better scores via more effort dedicated to their achievement.

However, since only evaluates the portion of the history taking place after the end of the agent's plan, available resources and the agent's vantage point should track directly with the agent's ability to wirehead each . scales other impacts by capturing the minuscule dent in available resources imposed by , and should smooth out any deontological penalty for selecting a non-null action, since the unbounded version should equally penalize all actions in this way.

Approval Incentives

The anti-"survival incentive" incentive does more than net us some free corrigibility – it implies that agents take great care not to change the likelihood we shut them down. If we precommit to not shutting the agent down until either after the epoch ends, until after the agent indicates the failsafe activated, or until after we observe bad behavior, this supplies a strong approval incentive. If the agent does have small, bad impacts, it'll only do so if it's practically sure we won't find out.

In fact, due to model uncertainty, most actions will have long-term consequences that involve shifting shutdown probability by at least a miniscule amount (due to noise if nothing else). Since is calculated with respect to the immediate penalty, the scaling factor won't include this noise. I speculate that the agent will be dramatically less likely to take actions we might disapprove of, and therefore expect AUP to naturally incentivize a limited degree of value awareness.

There are reasons to suspect we can further weaponize this dynamic, possibly to the point of coercing generic reward functions into desirable behavior. This will be discussed more in a later post.

Mild Optimization

A low impact AGI might try to paint one car pink while minimizing its other footprint or how many other things changed, but it would be trying as hard as possible to minimize that impact and drive it down as close to zero as possible, which might come with its own set of pathologies... We want the AGI to paint one car pink in a way that gets the impact pretty low and then, you know, that's good enough – not have a cognitive pressure to search through weird extremes looking for a way to decrease the twentieth decimal place of the impact.
~ Mild Optimization

Disclaimer: Heavy speculation about problems (like Vingean reflection and embedded agency) for which no one knows what solutions will even look like.

For AUP, I suspect that trying "as hard as possible" to minimize the impact is also impactful, as an embedded agent accounts for the energy costs of further deliberation. I imagine that such an AUP agent will soften how hard it's trying by modifying its decision rule to be something slightly milder than " to find the first action of the best possible plan". This could be problematic, and I frankly don't presently know how to reason about this case. Assuming the agent is actually able to properly tweak its decision rule, I do expect the end result to be an improvement.

My initial intuitions were that low impact and mild optimization are secretly the same problem. Although I no longer think that's the case, I find it plausible that some elegant "other-izer" paradigm underlies low impact and mild optimization, such that AUP-like behavior falls out naturally.

Acausal Cooperation

AUP agents don't seem to want to acausally cooperate in any way that ends up increasing impact. If they model the result of their cooperation as increasing impact compared to doing nothing, they incur a penalty just as if they had caused the impact themselves. Likewise, they have no reason to cooperate outside of the epoch.


Starting small and then slowly increasing means that we're not going to be surprised by the agent's objective impact, screening off quite a few bad things that happen when we have a ton of optimization pressure applied to safety measures. However, we don't know which new plans allows before we try it, so we want to stop as soon as we get a usefully-intelligent system.

While an unaligned agent with a large impact budget might pretend to be low-impact, we can get that same unaligned agent with a small budget by starting small. Since these agents with different levels of impact won't acausally cooperate, the agent would do its best to optimize with its meager budget.

Abram correctly pointed out that this scheme is just asking to be abused by greedy (human) reasoning, but I don't see a non-value-laden means of robustly and automatically determining the lowest workable-yet-safe impact level. I think -incrementation is better than a parameter-free approach in which no one knows beforehand how much impact will be tolerated, and it's nice to be able to use some empiricism in designing a safe AGI.

Intent Verification

To date, several strange tactics have been pointed out which game AUP's penalty:

  • Observational wireheading, which involves building a device that detects which utility the agent is maximizing and displays the appropriate observations such that attainable utility remains unchanged, while the main utility is freely maximized.
  • Ex ante offsetting, which involves having earlier actions set in motion chains of events which mitigate the penalty at later steps. Suppose there's a -high-scoring plan that the agent predicts would cause us to react in an impactful way. It can either do the thing (and suffer the penalty), or take steps to mitigate the later penalty.
  • Impact shunting, which involves employing some mechanism to delay impact until after the end of the epoch (or even until after the end of the attainable horizon).
  • Clinginess and concealment, which both involve reducing the impact of our reactions to the agent's plans.

There are probably more.

Now, instead of looking at each action as having "effects" on the environment, consider again how each action moves the agent through attainable outcome-space. An agent working towards a goal should only take actions which, according to its model, make that goal more attainable compared to doing nothing – otherwise, it'd do nothing. Suppose we have a plan which ostensibly works to fulfill (and doesn't do other things). Then each action in the plan should contribute to fulfillment, even in the limit of action granularity.

Although we might trust a safe impact measure to screen off the usual big things found in -maximizing plans, impact measures implicitly incentivize mitigating the penalty. That is, the agent does things which don't really take it towards (I suspect that this is the simple boundary which differentiates undesirable ex ante offsetting from normal plans). AUP provides the necessary tools to detect and penalize this.


The first approach would be to assume a granular action representation, and then simply apply penalty to actions for which the immediate does not strictly increase compared to doing nothing. Again, if the agent acts to maximize in a low-impact manner within the confines of the epoch, then all of its non- actions should contribute. It seems to me that for sufficiently granular time scales, the above failure modes all involve at least one action which doesn't really help get maximized. If so, I expect this approach to nip bad impact measure incentives in the bud.

If we can't assume granularity (and therefore have "actions" like "go to the store and buy food"), an agent could construct a plan which both passes the above test and also implements something like ex ante. In this case, we might do something like only consider the -greedy (or perhaps even near-greedy); essentially, riding the optimal plan until it becomes too impactful. I find it quite likely that something involving this concept will let us fully overcome weird incentives by penalizing strange things that normal -maximizers wouldn't do, which seems to be the whole problem.

Note: Even the first approach may be too strict, but that's preferable to being too lax.

Omni Test

Ideally, the measure will pass the Omni Test, meaning that even if it suddenly gained perfect control over every particle in the universe, there would still be no way for it to have what intuitively seems like a 'large influence' on the future, without that strategy being assessed as having a 'high impact'.

I interpret "perfect control over every particle in the universe" to mean that the agent can arbitrarily transition between universe configurations (while also maintaining its omnipotence).

AUP passes. Since all effects are now absolutely reversible, , so the agent applies penalty to each non- action. The omnipotent agent can choose any universe configuration with -utility, but this is offset by the strictly larger penalty; we can formally show that the agent does nothing in this case.

This result is clearly trivial and not representative of general behavior. Suppose instead that the agent has a solar system's worth of computronium at its disposal. Then since is continually recalculated, the penalties should remain roughly the same, so it'll have the same impact budget. However, it might make multiple times as many paperclips because it has more efficient ways of using the budget.

Robustness to Scale

I expect AUP to be harder to make work and to be (relatively) less robust for less intelligent agents, but to become easier (just drop in a few observation-based utility functions) and fully-robust sometime before human level. That is, less intelligent agents likely won't model the deep connections between their abilities to achieve different goals.

Canonically, one reasons that agents work explicitly to self-improve as soon as they realize the benefits. However, as soon as this realization occurs, I conjecture that AUP steeply penalizes generic self-improvement. More precisely, suppose the agent considers a self-improvement. To be beneficial, it has to improve the agent's capabilities for at least one time step during the present epoch. But if we assume , then the immediate penalty captures this for all of the . This seemingly prevents uncontrolled takeoff; instead, I imagine the agent would perform the minimal task-specific self-improvements necessary to maximize .

Note: Although more exotic possibilities (such as improvements which only work if you're maximizing ) could escape both penalties, they don't seem to pass intent verification.


  • I expect that if is perfectly aligned, will retain alignment; the things it does will be smaller, but still good.
  • If the agent may choose to do nothing at future time steps, is bounded and the agent is not vulnerable to Pascal's Mugging. Even if not, there would still be a lower bound – specifically, .
  • AUP agents are safer during training: they become far less likely to take an action as soon as they realize the consequences are big (in contrast to waiting until we tell them the consequences are bad).


For additional context, please see Impact Measure Desiderata.

I believe that some of AUP's most startling successes are those which come naturally and have therefore been little discussed: not requiring any notion of human preferences, any hard-coded or trained trade-offs, any specific ontology, or any specific environment, and its intertwining instrumental convergence and opportunity cost to capture a universal notion of impact. To my knowledge, no one (myself included, prior to AUP) was sure whether any measure could meet even the first four.

At this point in time, this list is complete with respect to both my own considerations and those I solicited from others. A checkmark indicates anything from "probably true" to "provably true". 

I hope to assert without controversy AUP's fulfillment of the following properties:

✔️ Goal-agnostic

The measure should work for any original goal, trading off impact with goal achievement in a principled, continuous fashion.

✔️ Value-agnostic

The measure should be objective, and not value-laden:
"An intuitive human category, or other humanly intuitive quantity or fact, is value-laden when it passes through human goals and desires, such that an agent couldn't reliably determine this intuitive category or quantity without knowing lots of complicated information about human goals and desires (and how to apply them to arrive at the intended concept)."

✔️ Representation-agnostic

The measure should be ontology-invariant.

✔️ Environment-agnostic

The measure should work in any computable environment.

✔️ Apparently rational

The measure's design should look reasonable, not requiring any "hacks".

✔️ Scope-sensitive

The measure should penalize impact in proportion to its size.

✔️ Irreversibility-sensitive

The measure should penalize impact in proportion to its irreversibility.

Interestingly, AUP implies that impact size and irreversibility are one and the same.

✔️ Knowably low impact

The measure should admit of a clear means, either theoretical or practical, of having high confidence in the maximum allowable impact – before the agent is activated.

The remainder merit further discussion.

Natural Kind

The measure should make sense – there should be a click. Its motivating concept should be universal and crisply defined.

After extended consideration, I find that the core behind AUP fully explains my original intuitions about "impact". We crisply defined instrumental convergence and opportunity cost and proved their universality. ✔️


The measure should not decrease corrigibility in any circumstance.

We have proven that off-switch corrigibility is preserved (and often increased); I expect the "anti-'survival incentive' incentive" to be extremely strong in practice, due to the nature of attainable utilities: "you can't get coffee if you're dead, so avoiding being dead really changes your attainable ".

By construction, the impact measure gives the agent no reason to prefer or dis-prefer modification of , as the details of have no bearing on the agent's ability to maximize the utilities in . Lastly, the measure introduces approval incentives. In sum, I think that corrigibility is significantly increased for arbitrary . ✔️

Note: I here take corrigibility to be "an agent’s propensity to accept correction and deactivation". An alternative definition such as "an agent’s ability to take the outside view on its own value-learning algorithm’s efficacy in different scenarios" implies a value-learning setup which AUP does not require.


The measure should penalize plans which would be high impact should the agent be disabled mid-execution.

It seems to me that standby and shutdown are similar actions with respect to the influence the agent exerts over the outside world. Since the (long-term) penalty is measured with respect to a world in which the agent acts and then does nothing for quite some time, shutting down an AUP agent shouldn't cause impact beyond the agent's allotment. AUP exhibits this trait in the Beware of Dog gridworld. ✔️

No Offsetting

The measure should not incentivize artificially reducing impact by making the world more "like it (was / would have been)".

Ex post offsetting occurs when the agent takes further action to reduce the impact of what has already been done; for example, some approaches might reward an agent for saving a vase and preventing a "bad effect", and then the agent smashes the vase anyways (to minimize deviation from the world in which it didn't do anything). AUP provably will not do this.

Intent verification should allow robust penalization of weird impact measure behaviors by constraining the agent to considering actions that normal -maximizers would choose. This appears to cut off bad incentives, including ex ante offsetting. Furthermore, there are other, weaker reasons (such as approval incentives) which discourage these bad behaviors. ✔️

Clinginess / Scapegoating Avoidance

The measure should sidestep the clinginess / scapegoating tradeoff.

Clinginess occurs when the agent is incentivized to not only have low impact itself, but to also subdue other "impactful" factors in the environment (including people). Scapegoating occurs when the agent may mitigate penalty by offloading responsibility for impact to other agents. Clearly, AUP has no scapegoating incentive.

AUP is naturally disposed to avoid clinginess because its baseline evolves and because it doesn't penalize based on the actual world state. The impossibility of ex post offsetting eliminates a substantial source of clinginess, while intent verification seems to stop ex ante before it starts.

Overall, non-trivial clinginess just doesn't make sense for AUP agents. They have no reason to stop us from doing things in general, and their baseline for attainable utilities is with respect to inaction. Since doing nothing always minimizes the penalty at each step, since offsetting doesn't appear to be allowed, and since approval incentives raise the stakes for getting caught extremely high, it seems that clinginess has finally learned to let go. ✔️

Dynamic Consistency

The measure should be a part of what the agent "wants" – there should be no incentive to circumvent it, and the agent should expect to later evaluate outcomes the same way it evaluates them presently. The measure should equally penalize the creation of high-impact successors.

Colloquially, dynamic consistency means that an agent wants the same thing before and during a decision. It expects to have consistent preferences over time – given its current model of the world, it expects its future self to make the same choices as its present self. People often act dynamically inconsistently – our morning selves may desire we go to bed early, while our bedtime selves often disagree.

Semi-formally, the expected utility the future agent computes for an action (after experiencing the action-observation history ) must equal the expected utility computed by the present agent (after conditioning on ).

We proved the dynamic consistency of given a fixed, non-zero . We now consider an which is recalculated at each time step, before being set equal to the non-zero minimum of all of its past values. The "apply penalty if " clause is consistent because the agent calculates future and present impact in the same way, modulo model updates. However, the agent never expects to update its model in any particular direction. Similarly, since future steps are scaled with respect to the updated , the updating method is consistent. The epoch rule holds up because the agent simply doesn't consider actions outside of the current epoch, and it has nothing to gain accruing penalty by spending resources to do so.

Since AUP does not operate based off of culpability, creating a high-impact successor agent is basically just as impactful as being that successor agent. ✔️

Plausibly Efficient

The measure should either be computable, or such that a sensible computable approximation is apparent. The measure should conceivably require only reasonable overhead in the limit of future research.

It’s encouraging that we can use learned Q-functions to recover some good behavior. However, more research is clearly needed – I presently don't know how to make this tractable while preserving the desiderata. ✔️


The measure should meaningfully penalize any objectively impactful action. Confidence in the measure's safety should not require exhaustively enumerating failure modes.

We formally showed that for any , no -helpful action goes without penalty, yet this is not sufficient for the first claim.

Suppose that we judge an action as objectively impactful; the objectivity implies that the impact does not rest on complex notions of value. This implies that the reason for which we judged the action impactful is presumably lower in Kolmogorov complexity and therefore shared by many other utility functions. Since these other agents would agree on the objective impact of the action, the measure assigns substantial penalty to the action.

I speculate that intent verification allows robust elimination of weird impact measure behavior. Believe it or not, I actually left something out of this post because it seems to be dominated by intent verification, but there are other ways of increasing robustness if need be. I'm leaning on intent verification because I presently believe it's the most likely path to a formal knockdown argument against canonical impact measure failure modes applying to AUP.

Non-knockdown robustness boosters include both approval incentives and frictional resource costs limiting the extent to which failure modes can apply. ✔️

Future Directions

I'd be quite surprised if the conceptual core were incorrect. However, the math I provided probably still doesn't capture quite what we want. Although I have labored for many hours to refine and verify the arguments presented and to clearly mark my epistemic statuses, it’s quite possible (indeed, likely) that I have missed something. I do expect that AUP can overcome whatever shortcomings are presently lurking.


  • Embedded agency
    • What happens if there isn't a discrete time step ontology?
    • How problematic is the incentive to self-modify to a milder decision rule?
    • How might an agent reason about being shut off and then reactivated?
    • Although we have informal reasons to suspect that self-improvement is heavily penalized, the current setup doesn't allow for a formal treatment.
    • AUP leans heavily on counterfactuals.
  • Supposing is reasonably large, can we expect a reasonable ordering over impact magnitudes?
    • Argument against: "what if the agent uses up all but steps worth of resources?"
      • possibly covers this.
    • How problematic is the noise in the long-term penalty caused by the anti-"survival incentive" incentive?
  • As the end of the epoch approaches, the penalty formulation captures progressively less long-term impact. Supposing we set long epoch lengths, to what extent do we expect AUP agents to wait until later to avoid long-term impacts? Can we tweak the formulation to make this problem disappear?
    • More generally, this seems to be a problem with having an epoch. Even in the unbounded case, we can't just take , since that's probably going to send the long-term in the real world. Having the agent expectimax over the steps after the present time seems to be dynamically inconsistent.
    • One position is that since we're more likely to shut them down if they don't do anything for a while, implicit approval incentives will fix this: we can precommit to shutting them down if they do nothing for a long time but then resume acting. To what extent can we trust this reasoning?
    • is already myopic, so resource-related impact scaling should work fine. However, this might not cover actions with delayed effect.

Open Questions

  • Does the simple approach outlined in "Intent Verification" suffice, or should we impose even tighter intersections between - and -preferred behavior?
    • Is there an intersection between bad behavior and bad behavior which isn't penalized as impact or by intent verification?
  • Some have suggested that penalty should be invariant to action granularity; this makes intuitive sense. However, is it a necessary property, given intent verification and the fact that the penalty is monotonically increasing in action granularity? Would having this property make AUP more compatible with future embedded agency solutions?
    • There are indeed ways to make AUP closer to having this (e.g., do the whole plan and penalize the difference), but they aren't dynamically consistent, and the utility functions might also need to change with the step length.
  • How likely do we think it that inaccurate models allow high impact in practice?
    • Heuristically, I lean towards "not very likely": assuming we don't initially put the agent near means of great impact, it seems unlikely that an agent with a terrible model would be able to have a large impact.
  • AUP seems to be shutdown safe, but its extant operations don’t necessarily shut down when the agent does. Is this a problem in practice, and should we expect this of an impact measure?
  • What additional formal guarantees can we derive, especially with respect to robustness and takeoff?
  • Are there other desiderata we practically require of a safe impact measure?
  • Is there an even simpler core from which AUP (or something which behaves like it) falls out naturally? Bonus points if it also solves mild optimization.
  • Can we make progress on mild optimization by somehow robustly increasing the impact of optimization-related activities? If not, are there other elements of AUP which might help us?
  • Are there other open problems to which we can apply the concept of attainable utility?
    • Corrigibility and wireheading come to mind.
  • Is there a more elegant, equally robust way of formalizing AUP?
    • Can we automatically determine (or otherwise obsolete) the attainable utility horizon and the epoch length ?
    • Would it make sense for there to be a simple, theoretically justifiable, fully general "good enough" impact level (and am I even asking the right question)?
    • My intuition for the "extensions" I have provided thus far is that they robustly correct some of a finite number of deviations from the conceptual core. Is this true, or is another formulation altogether required?
    • Can we decrease the implied computational complexity?
  • Some low-impact plans have high-impact prefixes and seemingly require some contortion to execute. Is there a formulation that does away with this (while also being shutdown safe)? (Thanks to cousin_it)
  • How should we best approximate AUP, without falling prey to Goodhart's curse or robustness to relative scale issues?
  • I have strong intuitions that the "overfitting" explanation I provided is more than an analogy. Would formalizing "overfitting the environment" allow us to make conceptual and/or technical AI alignment progress?
    • If we substitute the right machine learning concepts and terms in the equation, can we get something that behaves like (or better than) known regularization techniques to fall out?
  • What happens when ?
    • Can we show anything stronger than Theorem 3 for this case? 
    • ?

Most importantly:

  • Even supposing that AUP does not end up fully solving low impact, I have seen a fair amount of pessimism that impact measures could achieve what AUP has. What specifically led us to believe that this wasn't possible, and should we update our perceptions of other problems and the likelihood that they have simple cores?


By changing our perspective from "what effects on the world are 'impactful'?" to "how can we stop agents from overfitting their environments?", a natural, satisfying definition of impact falls out. From this, we construct an impact measure with a host of desirable properties – some rigorously defined and proven, others informally supported. AUP agents seem to exhibit qualitatively different behavior, due in part to their (conjectured) lack of desire to takeoff, impactfully acausally cooperate, or act to survive. To the best of my knowledge, AUP is the first impact measure to satisfy many of the desiderata, even on an individual basis.

I do not claim that AUP is presently AGI-safe. However, based on the ease with which past fixes have been derived, on the degree to which the conceptual core clicks for me, and on the range of advances AUP has already produced, I think there's good reason to hope that this is possible. If so, an AGI-safe AUP would open promising avenues for achieving positive AI outcomes.

Special thanks to CHAI for hiring me and BERI for funding me; to my CHAI supervisor, Dylan Hadfield-Menell; to my academic advisor, Prasad Tadepalli; to Abram Demski, Daniel Demski, Matthew Barnett, and Daniel Filan for their detailed feedback; to Jessica Cooper and her AISC team for their extension of the AI safety gridworlds for side effects; and to all those who generously helped me to understand this research landscape.

New Comment
120 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

This is my post.

How my thinking has changed

I've spent much of the last year thinking about the pedagogical mistakes I made here, and am writing the Reframing Impact sequence to fix them. While this post recorded my 2018-thinking on impact measurement, I don't think it communicated the key insights well. Of course, I'm glad it seems to have nonetheless proven useful and exciting to some people!

If I were to update this post, it would probably turn into a rehash of Reframing Impact. Instead, I'll just briefly state the argument as I would present it today. I currently think that power-seeking behavior is the worst part of goal-directed agency, incentivizing things like self-preservation and taking-over-the-planet. Unless we assume an "evil" utility function, an agent only seems incentivized to hurt us in order to become more able to achieve its own goals. But... what if the agent's own utility function penalizes it for seeking power? What happens if the agent maximizes a utility function while penalizing itself for becoming more able to maximize that utility function?

This doesn't require knowing anything about human values in particular, nor do we need to pick out privileged parts o

... (read more)
If it is capable of becoming more able to maximize its utility function, does it then not already have that ability to maximize its utility function? Do you propose that we reward it only for those plans that pay off after only one "action"?
1Alex Turner
Not quite. I'm proposing penalizing it for gaining power, a la my recent post. There's a big difference between "able to get 10 return from my current vantage point" and "I've taken over the planet and can ensure i get 100 return with high probability". We're penalizing it for increasing its ability like that (concretely, see Conservative Agency for an analogous formalization, or if none of this makes sense still, wait till the end of Reframing Impact).

There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:


  • Starting state: used by reversibility methods. Results in interference with other agents. Avoids ex post offsetting.
  • Inaction (initial branch): default setting in Low Impact AI and RR. Avoids interfering with other agent's actions, but interferes with their reactions. Does not avoid ex post offsetting if the penalty for preventing events is nonzero.
  • Inaction (stepwise branch) with environment model rollouts: default setting in AUP, model rollouts are necessary for penalizing delayed effects. Avoids interference with other agents and ex post offsetting.

Core part of deviation measure

  • AUP: difference in attainable utilities between baseline and current state
  • RR: difference in state reachability between baseline and current state
  • Low impact AI: distance between baseline and current state

Function applied to core part of deviation measure

  • Absolute value: default setting in AUP and Low Impact AI. Results in penalizing both increase and reduction relative to baseline. This resu
... (read more)
2Alex Turner
This is a great breakdown! One thought: penalizing increase as well (absolute value) seems potentially incompatible with relative reachability. The agent would have an incentive to stop anyone from doing anything new in response to what the agent did (since these actions necessarily make some states more reachable). This might be the most intense clinginess incentive possible, and it’s not clear to what extent incorporating other design choices (like the stepwise counterfactual) will mitigate this. Stepwise helps AUP (as does indifference to exact world configuration), but the main reason I think clinginess might really be dealt with is IV.
2Victoria Krakovna
Thanks, glad you liked the breakdown! I think that the stepwise counterfactual is sufficient to address this kind of clinginess: the agent will not have an incentive to take further actions to stop humans from doing anything new in response to its original action, since after the original action happens, the human reactions are part of the stepwise inaction baseline. The penalty for the original action will take into account human reactions in the inaction rollout after this action, so the agent will prefer actions that result in humans changing fewer things in response. I'm not sure whether to consider this clinginess - if so, it might be useful to call it "ex ante clinginess" to distinguish from "ex post clinginess" (similar to your corresponding distinction for offsetting). The "ex ante" kind of clinginess is the same property that causes the agent to avoid scapegoating butterfly effects, so I think it's a desirable property overall. Do you disagree?
2Alex Turner
I think it’s generally a good property as a reasonable person would execute it. The problem, however, is the bad ex ante clinginess plans, where the agent has an incentive to pre-emptively constrain our reactions as hard as it can (and this could be really hard). The problem is lessened if the agent is agnostic to the specific details of the world, but like I said, it seems like we really need IV (or an improved successor to it) to cleanly cut off these perverse incentives. I’m not sure I understand the connection to scapegoating for the agents we’re talking about; scapegoating is only permitted if credit assignment is explicitly part of the approach and there are privileged "agents" in the provided ontology.

Firstly, this seems like very cool research, so congrats. This writeup would perhaps benefit from a clear intuitive statement of what AUP is doing - you talk through the thought processes that lead you to it, but I don't think I can find a good summary of it, and had a bit of difficulty understanding the post holistically. So perhaps you've already answered my question (which is similar to your shutdown example above):

Suppose that I build an agent, and it realises that it could achieve almost any goal it desired because it's almost certain that it will be able to seize control from humans if it wants to. But soon humans will try to put it in a box such that its ability to achieve things is much reduced. Which is penalised more: seizing control, or allowing itself to be put in a box? My (very limited) understanding of AUP says the latter, because seizing control preserves ability to do things, whereas the alternative doesn't. Is that correct?

Also, I disagree with the following:

What would happen if, miraculously, uA=uH – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no "large" impacts to bemoan – it would just be doing what you want.

It seems like there might be large impacts, but they would just be desirable large impacts, as opposed to undesirable ones.

I’ll write a quick overview, thanks for the feedback!

Which is penalised more: seizing control, or allowing itself to be put in a box?

The former. Impact is with respect to the status quo, to if it does nothing. If it goes in the box by default, then taking preventative action incurs heavy penalty.

Your point about large impacts is indeed correct. What I thought to hint at was that we generally only decry "large impacts" if we don’t like them, but this is clearly not what I actually wrote implies. I’ll fix it soon!

4Richard Ngo
Okay, that makes sense. Some follow-up questions: If we consider the case I gave, the initial Q(inaction) is very high, since it can still seize control in the next time step. The initial Q(seize control) is also very high. It may be lower than Q(inaction) because seizing control is irreversible in some ways and so cuts off possibilities. But it may be higher than Q(inaction) because seizing control sooner means you can fulfill utility functions more. Could the penalty therefore be arbitrarily small if those two things balance out? Suppose you have an agent which will "die" unless it does things like getting food. Is there any easy way to specify what the "status quo" is?
4Alex Turner
So there are two penalties: the immediate penalty, which compares attainable values immediately following the action in question, and the long-term penalty, which waits until the end of the epoch before evaluating attainable values. We use the larger of the two. I think this answers your first question: the agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference. We assume that the standby action keeps the agent online in some low power state. Agents requiring more than this would just do nothing by Lemma 3.
2Richard Ngo
Thank you for your patient explications, I'm seeing the shape of this proposal much more clearly now, and I quite like it. Two more queries: Why is this the case? By my argument above, the short-term penalty might be quite small. But the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small. Another closely-related idea: suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.
1Alex Turner
My pleasure! So I forgot to say this, but I don’t agree with the argument as I understand it. Although the effect of "has control one time step earlier than usual" may seem small, remember that ImpactUnit is the immediate penalty for something like making a paperclip. Time is utility, and gaining (what is effectively) an extra step of optimization during the attainable horizon is not at all trivial. Plus, some will decrease, and most will increase. None will do both and cancel. Attainable utility is always 0 when the agent is deactivated, so this difference turns into the raw attainable utility when in the box. Although this too could theoretically be quite small, it seems fair to assume that even a box-attainable utility of .1 would induce a relatively-huge penalty. But suppose this isn’t true, and it really is tiny, and so it keeps constructing a plan to prevent itself from being put in the box (that is, the first step of its incorrigible plan doesn’t let it avoid shutdown on its own, as we just discussed). Suppose the second step of its plan succeeds. Then now the difference is between the high Q(out-of-box) and Q(shutdown)=0. We already assumed the Q value was extremely low, but now it’s high again, so we know the penalty is gigantic. If the plan works in one step, then there’s still a huge penalty, because the difference is now between Q(out of box) and Q(in box). The fact that a maximizer would want to preserve power implies the other penalty set utilities would as well. This seemingly overwhelms whatever impetus it has to get out of the box in the first place. This one is indeed trickier. First note that ImpactUnit=0 if it’s facing certain shutdown in 100 steps, so in that case it does nothing. Second, seizing control still greatly helps in the immediate penalty set calculations, and those "alien agents" would have no compunctions about undoing the auto-shutdown. I therefore expect the immediate penalty to be quite large. However, perhaps we could some

Various thoughts I have:

  • I like this approach. It seems like it advances the state of the art in a few ways, and solves a few problems in a neat way.
  • I still disagree with the anti-offsetting desideratum in the form that AUP satisfies. For instance, it makes AUP think very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor, which seems whacky and dangerous to me.
  • It's interesting that this somewhat deviates from my intuition about w
... (read more)
3Alex Turner
Isn’t this necessary for the shutdown safe desideratum? This property seems to make the proposal less reliant on the agent having a good model, and more robust against unexpected shutdown. Can you give me examples of good low impact plans we couldn’t do without offsetting? Can you expand on why these are distinct in your view? The attainable utility calculation seems to take care of this by considering the value of the best plan from that vantage point - "what’s the best history we can construct from here?", in a sense.
I don't remember which desideratum that is, can't ctrl+f it, and honestly this post is pretty long, so I don't know. At any rate, I'm not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures - see the ones that couldn't be simultaneously satisfied until this one did. One case where you need 'offsetting', as defined in this piece but not necessarily as I would define it: suppose you want to start an intelligent species to live on a single new planet. If you create the species and then do nothing, they will spread to many many planets and do a bunch of crazy stuff, but if you have a stern chat with them after you create them, they'll realise that staying on their planet is a pretty good idea. In this case, I claim that the correct course of action is to create the species and have a stern chat, not to never create the species. In general, sometimes there are safe plans with unsafe prefixes and that's fine. A more funky case that's sort of outside what you're trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don't act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general - how do they interplay with shifting models?) [EDIT: a more mundane example is that driving on the highway is a situation where suddenly changing your plan to no-ops can cause literal impacts in an unsafe way, nevertheless driving competently is not a high-impact plan] Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn't been yet. I don't understand: the attainable utility calculation (by which I assume you mean the definition of Qu) involves a utility
1Alex Turner
Couldn’t you equally design a species that won’t spread to begin with? I think the crux here is that I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting. In a nutshell, my view is that low impact should be with respect to what the agent is doing, and not something enforced on the environment. How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment? Do note that intent verification doesn’t seem to screen off what you might call "natural" ex ante offsetting, so I don’t really see what we’re missing out on still. Edit: The driving example is a classic point brought up, totally valid. As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification. I think it’s in the true there are situations in which we would want an offset to happen, but it seems to me like we can just avoid problematic situations which require that to begin with. If the agent makes a mistake, we can shut it off and then we do the offsetting. I mentioned model accuracy in open questions, I think the jury is definitely still out on that. Oh, so it’s an issue with a potential shift. But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment? Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified.
This comment is very scattered, I've tried to group it into two sections for reading convenience. Desiderata of impact regularisation techniques Well, maybe you could, maybe you couldn't. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn't. I disagree with this, in that I don't see how it connects to the real world reason that we would like low impact AI. It does seem to be the crux. I don't know, and it doesn't seem obvious to me that any sensible impact measure is possible. In fact, during the composition of this comment, I've become more pessimistic about the prospects for one. I think that this might be related to the crux above? I don't really understand what you mean here, could you spend two more sentences on it? This is really interesting, and suggests to me that in general this agent might act by creating a successor that carries out a globally-low-impact plan, and then performing the null action thereafter. Note that this successor agent wouldn't be as interruptible as the original agent, which I guess is somewhat unfortunate. Technical discussion of AUP It would not, but it's brittle to accidents that cause them to diverge. These accidents both include ones caused by the agent e.g. during the learning process; and ones not caused by the agent e.g. a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn't allowed to stop it because that would be too high impact. This causes pretty weird behaviour. Imagine an agent's goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal's ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that
1Alex Turner
Desiderata of impact regularisation techniques So it seems that on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first method that comes to mind. That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me. The point of the impact measure isn’t to choose the exact plan that we would use, but rather to disallow overly-impactful plans and allow us to complete a range of goals in some low-impact way. I don’t think we should care about which way that is, as long as it isn’t dangerous. But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects. [note: this supposes that there aren’t undesirable pseudo-ways of reaching the goal before we reach the outcome in mind. This seems plausible due to the structuring of the measure, but shouldn’t be taken for granted.] Analogously, I am saying that we can seemingly get all the low-impact results we need without offsetting using AUP. You point out specific plans which would be allowed if we could offset in a reasonable way. I say that that problem seems really hard, but it looks like my method lets us get effectively the same thing done without needing to figure that out. I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact acc
Desiderata of impact regularisation techniques So there's a narrow answer and a broad answer here. The narrow answer is that if you tell me that AUP won't allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X' that are pretty similar to X along the relevant dimension that made me bring up X. This is a substantial, but not impossible, bar to meet. The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, and then check if it is or is not allowed. This lets me check if AUP is identical to my internal sense of whether things obviously should or should not be allowed. If it is, then great, and if it's not, then I might worry that it will run into substantial trouble in complicated scenarios that I can't really picture. It's a nice method of analysis because it requires few assumptions about what things are possible in what environments (compared to "look at a bunch of environments and see if the plans AUP comes up with should be allowed") and minimal philosophising (compared to "meditate on the equations and see if they're analytically identical to how I feel impact should be defined"). [EDIT: added content to this section] Firstly, saving humanity from natural disasters doesn't at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it's plausibly in a different natural reference class than causing natural disasters. Secondly, your description of a use case for a low-impact agent is interesting and one that I hadn't thought of before, but I still would hope that they could be used in a wider range of settings (basically, whenever I'm worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).
1Alex Turner
I think there is an argument for this whenever we have "it won’t X because anti-survival incentive incentive and personal risk": "then it builds a narrow subagent to do X". As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds? Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment. I agree. I’m not saying that the method won’t work for these, to clarify.
Two points: * Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it's going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans. * Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what's in this post, and I'm much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that's what I'm primarily doing. It seems value agnostic to me because it can be generated from the urge 'keep the world basically like how it used to be'.
1Alex Turner
But in this same comment, you also say People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand? I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate. I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment, . True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what "kinds of things" can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn't imply lack of confidence in opinions about how to modify impact measures, which itself doesn't imply lack of opinions about how to modify impact measures. This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn't a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I'm probably quite sub-optimal in converting experience into intuitions), but it's still my impulse. My sense is that we agree that this looks hard but shouldn't be dismissed as impossible.
1Rohin Shah
What? I've never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I'm quite certain it can't be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).
1Alex Turner
I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP.
Going back to this, what is the way you propose the species-creating goal be done? Say, imposing the constraint that the species has got to be basically just human (because we like humans) and you don't get to program their DNA in advance? My guess at your answer is "create a sub-agent that reliably just does the stern talking-to in the way the original agent would", but I'm not certain.
1Alex Turner
My real answer: we probably shouldn’t? Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought. (See the cheese post, can’t find it) Uh, why not? Make humans that will predictably end up deciding not to colonize the galaxy or build superintelligences.
I guess I'm more comfortable with procreation than you are :) I imposed the "you don't get to program their DNA in advance" constraint since it seems plausible to me that if you want to create a new colony of actual humans, you don't have sufficient degrees of human to make them actually human-like but also docile enough. You could imagine a similar task of "build a rather powerful AI system that is transparent and able to be monitored", where perhaps ongoing supervision is required, but that's not an onerous burden.
Technical discussion of AUP This is only convincing to the extent that I buy into AUP's notion of impact. My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) and is not analytically identical to the core thing that I care about (human ability to achieve goals that humans plausibly care about), but may well turn out to be fine if I considered it for a long time. I agree that the nice properties of AUP are pretty nice and demonstrate a significant advance in the state of the art for impact regularisation, and did indeed put that in my first bullet point of what I thought of AUP, although I guess I didn't have much to say about it. This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways (see sibling comment) and (b) even with a good model, presumably if it's run for a long time there might be at least one error, and I'm inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time. However, I think the stronger objection here is the 'natural disaster' category (which might include an actuator in the AUP agent going haywire or any number of things). Note that AUP would not even notify humans that such a natural disaster was happening if it thought that humans would solve the natural disaster iff they were notified. In general, AFAICT, if you have a natural-disaster warning AUP agent, then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster (I think even impact verification doesn't prevent this, if you imagine that causing a natural disaster is an unforeseen maximum of the agent's utility function). This seems like a failure mode that impact regularisation techniques ought to prevent. I also have a different rea
1Alex Turner
I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist. Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time. What is "this" here (for a)? But AUP’s plans are shutdown-safe? I think I misunderstand. I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives). In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes.
To be frank, although I do like the fact that there's a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is. "This" is "upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline", and it's what I mean by "ungracefully failing if the protocol stops being followed at any one point in time". Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent's ability to achieve a wide variety of goals. Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn't taking as an assumption that you were making. Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.
1Alex Turner
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree. AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it? This feels like an odd standard, where you say "but maybe it randomly fails and then doesn’t work", or "it can’t anticipate things it doesn’t know about". While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways. This is true. It depends what the scale is - I had "remote local disaster" in mind, while you maybe had x-risk. [Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary "extinction?" oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.] We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it's unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you're more likely to act incorrectly (both in the sense of "higher probability of incorrect actions" and "more probability of more extremely incorrect answers"), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I've heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it's bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution. My claim here is not quite that AUP amplifies 'errors' (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these 'errors'. At any rate, even if no other method mitigated these 'errors', I would still want them to. I wasn't necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents. My impression is that most machine learning systems are extremely opaque to currently available analysis tools in
1Alex Turner
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative. Edit: But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue. I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don't understand what's happened. In this scenario, I think you should see the preservation of 'errors' in the sense of the agent's future under no-ops differing from 'normality'. If 'errors' happen due to a mismatch between the model and reality, I agree that the agent shouldn't try to fix them with the bits of the model that are broken. However, I just don't think that that describes many of the things that cause 'errors': those can be foreseen natural events (e.g. San Andreas earthquake if you're good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you're not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.

Nice job! This does meet a bunch of desiderata in impact measures that weren't there before :)

My main critique is that it's not clear to me that an AUP-agent would be able to do anything useful, and I think this should be included as a desideratum. I wrote more about this on the desiderata post, but it's worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.

For example, perhaps the action used to define the impact unit is well-understood and accepted, but any other action makes humans a little bit more likely to turn off the agent. Then the agent won't be able to take those actions. Generally, I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).

Questions and comments:

We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that's pretty complicated, and it turns out we get more desirable behavior by using the agent's attainabl
... (read more)
Update: we discussed this, and came to the conclusion that these aren't based on similar intuitions.
2Alex Turner
But natural kind is a desideratum! I’m thinking about adding one, though. So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering - I expect the approval incentives to be fairly strong. This is maybe true, and I note it in Future Directions. So I go back and forth on whether this is good or not. Imagine action a is desirable and sufficiently low- impact to be chosen, except there’s random approval noise. Then the more we approve of the action, the closer the mean noise is to 0 and the more likely it is that the agent takes the action. Or this could be too restrictive - I honestly don’t know yet. You might not be considering the asymmetry imposed by approval. Yes, because you’re sacrificing world-with-vase-in-it (or future energy to get back to similar outcomes). You’re imposing a change to expedite your current goals in a way that isn’t trivially-reversible. Now, it isn’t a large cost, but it is a cost. Is this not covered by "in the limit of data sampled"? If so, I’ll tweak. I view it as saying "there’s no clever complete plan which moves you towards your goal while not changing other things" (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in a universal way, although it only holds for all computable u. Yes, this is true, although I think there are informal reasons to suspect it holds in the real world for many finite sets (due to power). As long as it isn’t always 0, that is! Any action for which E[Penalty(a_unit)] is strictly increased? Yes, and I think we probably want to avoid this. I focused on ensuring no bad things are allowed. I don’t think it’ll be too hard to ease up in certain ways while maintaining safety. Theorem 1. Generally more cautious. AUP agents seemingly won’t generally override us, which is probably fine for low impact. My

On the meta level: I think our disagreements seem of this form:

Me: This particular thing seems strange and doesn't gel with my intuitions, here's an example.

You: That's solved by this other aspect here.

Me: But... there's no reason to think that the other aspect captures the underlying concept.

You: But there's no actual scenario where anything bad happens.

Me: But if you haven't captured the underlying concept I wouldn't be surprised if such a scenario exists, so we should still worry.

There are two main ways to change my mind in these cases. First, you could argue that you actually have captured the underlying concept, by providing an argument that your proposal does everything that the underlying concept would do. The argument should quantify over "all possible cases", and is stronger the fewer assumptions it has on those cases. Second, you could convince me that the underlying concept is not important, by appealing to the desiderata behind my underlying concept and showing how those desiderata are met (in a similar "all possible cases" way). In particular, the argument "we can't think of any case where this is false&q... (read more)

1Alex Turner
I don’t think you need to change my mind here, because I agree with you. I was careful to emphasize that I don’t claim AUP is presently AGI-safe. It seems like we’ve just been able to blow away quite a few impossible-seeming issues that had previously afflicted impact measures, and from my personal experience, the framework seems flexible and amenable to further improvement. What I’m arguing is specifically that we shouldn’t say it’s impossible to fix these weird aspects. First, due to the inaccuracy of similar predictions in the past, and second, because it generally seems like the error that people make when they say, "well, I don’t see how to build an AGI right now, so it’ll take thousands of years". How long have we spent trying to fix these issues? I doubt I’ve seriously thought about how to relax AUP for more than five minutes. In sum, I am arguing that the attitude right now should not be that this method is safe, but rather that we seem leaps and bounds closer to the goal, and we have reason to be somewhat optimistic about our chances of fixing the remaining issues. I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that? Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others. This is a great point. I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few? I argue that you should be very careful about believing these things. I th
2Rohin Shah
You're right, I was too loose with language there. A more accurate statement is "The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn't work for it". Another statement is "the claim is compelling enough that I throw it at any particular proposal, and if it's unclear I tend to be wary". Another one is "if I were trying to design an impact measure, showing why that claim doesn't work would be one of my top priorities". Perhaps we do mostly agree, since you are planning to talk more about this in the future. I think the analogous thing to say is, "well, I don't see how to build an AGI right now because AIs don't form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn't need to form abstractions". Sure. Yeah, I agree this helps. In the case you described, u_A would be "Over the course of the entire history of the universe, I want to do 5 jumping jacks -- no more, no less." You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say "I guess I've never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise", which seems wrong.
1Alex Turner
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did! Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different "subhistory" of time. My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
2Rohin Shah
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as 'penalising changes in the agent's ability to achieve a wide variety of goals'.
1Alex Turner
The goal is "I want to do 5 jumping jacks". AUP measures the agent’s ability to do 5 jumping jacks. You seem to be thinking of a utility as being over the actual history of the universe. They’re only over action-observation histories.
You can call that thing 'utility', but it doesn't really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you'd say that "win a game of go that I'm playing online with my friend Rohin" is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise. Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent's moves from Rohin to GNU Go, a simple bot, while still displaying the player name as "Rohin". In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won't be able to tell that I did this, and as far as I can tell the AUP penalty doesn't notice any change in my ability to achieve this goal. In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
1Alex Turner
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from. What is the "whole history"? We instantiate the main agent at arbitary times.
Say that the utility does depend on whether the username on the screen is "Rohin", but the initial action makes this an unreliable indicator of whether I'm playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post. The whole history is all the observations and actions that the main agent has actually experienced.
2Alex Turner
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.

Great work! I like the extensive set of desiderata and test cases addressed by this method.

The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it. I'm not currently sure whether this is a good idea: while it indeed counteracts instrumental incentives, it could also "cripple" the agent by incentivizing it to settle for more suboptimal solutions than necessary for safety.

For example, the shutdown button in the "survival incentive" gridworld could be interpreted as a supervisor signal (in which case the agent should not disable it) or as an obstacle in the environment (in which case the agent should disable it). Simply penalizing the agent for increasing its ability to achieve goals leads to incorrect behavior in the second case. To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).

Another important difference is that you are using a stepwise inaction baseline (branching off at each time step rather than the initial time... (read more)

4Alex Turner
I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant. Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history -> world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift? This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance"/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The "state difference weighting" problem disappears. Two concepts of impact are unified. I’m not saying RR isn’t important - just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it. Edit: I now understand that you were referring to the biggest conceptual difference in the desiderata fulfilled. While that isn’t necessarily how I see it, I don’t disagree with that way of viewing things.
3Alex Turner
Thanks! :) If the agent isn’t overcoming obstacles, we can just increase N. Otherwise, there’s a complicated distinction between the cases, and I don’t think we should make problems for ourselves by requiring this. I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle. Technically, for receiving observations produced by a state. This was just for clarity. And why is this, given that the inputs are histories? Why can’t we simply measure power? I discussed in "Utility Selection" and "AUP Unbound" why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there? Oops, noted. I had a distinct feeling of "if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support". Edit: I think there might be an inferential gap I failed to bridge here for you for some reason. In particular, thinking about the world-state as a thing seems actively detrimental when learning about AUP, in my experience. I barely mention it for exactly this reason.
2Victoria Krakovna
Wouldn't increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles? I think we have a disagreement here about which desiderata are more important. Currently I think it's more important for the impact measure not to cripple the agent's capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure. Not certain about this though - I think we might need more experiments on more complex environments to get some idea of how bad this tradeoff is in practice. Your measurement of "power" (I assume you mean Qu?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations. I would expect the way these higher-level features are chosen or learned to affect the outcome of that computation. I found those sections vague and unclear (after rereading a few times), and didn't understand why you claim that a random set of utility functions would work. E.g. what do you mean by "long arms of opportunity cost and instrumental convergence"? What does the last paragraph of "AUP Unbound" mean and how does it imply the claim? Providing strong support is certainly important, but I think it's more about clarity and precision than quantity. Better to give one clear supporting statement than many unclear ones :).
2Alex Turner
So I posit that there actually is not a tradeoff to any meaningful extent. First note that there are actually two kinds of environments here: an environment which is actually just platonically a gridworld with a "shutdown" component, and one in which we simulate such a world. I’m going to discuss the latter, although I expect that similar arguments apply – at least for the first paragraph. Suppose that the agent is fairly intelligent, but has not yet realized that it is being simulated. So we define the impact unit and budget, and see that the agent unfortunately does not overcome the obstacle. We increase the budget until it does. Suppose that it has the realization, and refactors its model somehow. It now realizes that what it should be doing is stringing together favorable observations, within the confines of its impact budget. However, the impact unit is still calculated with respect to some fake movement in the fake world, so the penalty for actually avoiding shutdown is massive. Now, what if there is a task in the real world we wish it complete which seemingly requires taking on a risk of being shut down? For example, we might want it to drive us somewhere. The risk of a crash is non-trivial with respect to the penalty. However, note that the agent could just construct a self driving car for us and activate it with one action. This is seemingly allowed by intent verification. So it seems to me that this task, and other potential counterexamples, all admit some way of completing the desired objective in a low-impact way – even if it’s a bit more indirect than what we would immediately imagine. By not requiring the agent to actually physically be doing things, we seem to be able to get the best of both worlds. Simply the ideas alluded to by Theorem 1 and seemingly commonly accepted within alignment discussion: using up (or gaining) resources changes your ability to achieve arbitrary goals. Likewise for self-improvement. Even though the specific goals aren’t
3Victoria Krakovna
Actually, I think it was incorrect of me to frame this issue as a tradeoff between avoiding the survival incentive and not crippling the agent's capability. What I was trying to point at is that the way you are counteracting the survival incentive is by penalizing the agent for increasing its power, and that interferes with the agent's capability. I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that 'low impact' is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact - high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects. Note that AUP could easily be converted to a side-effects-only measure by replacing the |difference| with a max(0, difference). Similarly, RR could be converted to a measure that penalizes increases in power by doing the opposite (replacing max(0, difference) with |difference|). (I would expect that variant of RR to counteract the survival incentive, though I haven't tested it yet.) Thus, it may not be necessary to resolve the disagreement about whether it's good to penalize increases in power, since the same methods can be adapted to both cases.
2Alex Turner
Oh. So, when I see that this agent won’t really go too far to improve itself, I’m really happy. My secret intended use case as of right now is to create safe technical oracles which, with the right setup, help us solve specific alignment problems and create a robust AGI. (Don’t worry about the details for now.) The reason I don’t think low impact won’t work in the long run for ensuring good outcomes on its own is that even if we have a perfect measure, at some point, someone will push the impact dial too far. It doesn’t seem like a stable equilibrium. Similarly, if you don’t penalize instrumental convergence, it seems like we have to really make sure that the impact measure is just right, because now we’re dealing with an agent of potentially vast optimization power. I’ve also argued that getting only the bad side effects seems value alignment complete, but it’s possible an approximation would produce reasonable outcomes for less effort than a perfectly value-aware measure requires. This is one of the reasons it seems qualitatively easier to imagine successfully using an AUP agent – the playing field feels far more level.
1Victoria Krakovna
Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it's bad for an impact measure to not distinguish between causation and prevention, especially for irreversible events. This comes up in the car driving example already mentioned in other comments on this post. The reason the action of keeping the car on the highway is considered "high-impact" is because you are penalizing prevention as much as causation. Your suggested solution of using a single action to activate a self-driving car for the whole highway ride is clever, but has some problems: * This greatly reduces the granularity of the penalty, making credit assignment more difficult. * This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car. * You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable. * In such situations, the penalty will produce bad incentives. Namely, the penalty for staying on the road is proportionate to how bad a crash would be, so the tradeoff with goal achievement resolves in an undesirable way. If we keep the reward for the car arriving to its destination constant, then as we increase the badness of a crash (e.g. the number of people on the side of the road who would be run over if the agent took a noop action), eventual
1Alex Turner
Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction, I don’t think this seems too bad here - in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration. Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause. We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.
2Alex Turner
Huh? No, N is in the denominator of the penalty term. No, the utility functions are literally just over actions and observations. It’s true that among all computable utilities, some of the more complex ones will be doing something that we would deem to be grading a model of the actual world. This kind of thing is not necessary for the method to work. Suppose that you receive 1 utility if you’re able to remain activated during the entire epoch. Then we see that Q_{u_1} becomes the probability of the agent ensuring it remains activated the whole time (this new "alien" agent does not have the impact measure restriction). As the agent gains optimization power and/or resources, this increases. This has nothing to do with anything actually going on the world, beyond what is naturally inferred from its model over what observations it will see in the future given what it has seen so far.

Note: this is on balance a negative review of the post, at least least regarding the question of whether it should be included in a "Best of LessWrong 2018" compilation. I feel somewhat bad about writing it given that the author has already written a review that I regard as negative. That being said, I think that reviews of posts by people other than the author are important for readers looking to judge posts, since authors may well have distorted views of their own works.

  • The idea behind AUP, that ‘side effect avoidance’ should mean minimising changes in
... (read more)
1Alex Turner
I'm curious whether these are applications I've started to gesture at in Reframing Impact, or whether what you have in mind as obvious isn't a subset of what I have in mind. I'd be interested in seeing your shortlist. Without rereading all of the threads, I'd like to note that I now agree with Daniel about the subhistories issue. I also agree that the formalization in this post is overly confusing and complicated.
I confess that it's been a bit since I've read that sequence, and it's not obvious to me how to go from the beginnings of gestures to their referents. Basically what I mean is 'when trying to be cooperative in a group, preserve generalised ability to achieve goals', nothing more specific than that.

This post, and TurnTrout's work in general, have taken the impact measure approach far beyond what I thought was possible, which turned out to be both a valuable lesson for me in being less confident about my opinions around AI Alignment, and valuable in that it helped me clarify and think much better about a significant fraction of the AI Alignment problem. 

I've since discussed TurnTrout's approach to impact measures with many people. 

I think that the development of Attainable Utility Preservation was significantly more progress on impact measures than (at the time) I thought would ever be possible (though RR also deserves some credit here). I also think it significantly clarified my thoughts on what impact is and how instrumental convergence works.

Update: I tentatively believe I’ve resolved the confusion around action invariance, enabling a reformulation of the long term penalty which seems to converge to the same thing no matter how you structure your actions or partition the penalty interval, possibly hinting at an answer for what we can do when there is no discrete time step ontology. This in turn does away with the long-term approval noise and removes the effect where increasing action granularity could arbitrarily drive up the penalty. This new way of looking at the long-term penalty enables us

... (read more)

Good work! Lots of interesting stuff there.

However, the setup seems to depend crucially on having a good set of utilities to make it work. For example, let u_A be the usual paperclipping utility, and define U^+ = "all observation-action utilities", and U^- = "all utilities that are defined over human behaviour + u_A".

Then suppose action a is a default, equivalent to "turn off your observations for an hour". And action a' is "unleash a sub-agent that will kill all humans, replace them all with robots that behave as humans would in a, then goes out into the

... (read more)
2Alex Turner
This is not the case. I think there’s a confusion here - Q_u is not with respect to what we’ve seen so far. We aren’t taking the utility difference between u(h_action) and u(h_inaction). We’re instead comparing our ability to maximize each utility. How does a’ not drastically improve our ability to maximize arbitrary utilities, or at least use up a ton of energy in the process compared to not doing it? That is, it’s not what we’ve seen so far that matters, per se. You’ll notice that the Q_u equation doesn’t even consider what happened before the new utility "takes over". Consider the end of the plan – the agent is now in a position to spread and expand and complete arbitrary goals to a much greater degree than if it had just done nothing for an hour. The fact that human behavior seems the same isn’t actually very relevant. The U^- do, however, all have fewer resources available now, even if we assume they don’t mind what just happened. Can you expand? I thought that this was covered by the first part.
3Stuart Armstrong
The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted). Or maybe the human-robots are just more efficient, in ways that make up for any lost resources. Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a' is the only action it could ever take to "escape". Thus its expected utility for all the U's are pretty much the same (and pretty low) for all utilities. (and if you include arbitrary utilities, these include impossible-to-not-affect ones, like ones dependent on the waste heat rate of the AI; I don't see the complexity penalty corresponding well to a "unimportant to humans" measure).
3Alex Turner
Ok, supposing that can work: "build world dominating device and then spread into space" still drastically increases power. What are your intuitions for why building the sub-agent to begin with isn’t extremely instrumentally convergent? Lemma 3 proves it does nothing here. One of the main points of AUP is that it seems to cripple the survival incentive, so I’m confused why this came to mind. Utility != attainable utility. "how many dollars I have" != "how many additional dollars I could have within a year if I acted optimally" Again, it’s not about utility change, but attainable utility change. The distinction is crucial. I assert that we don’t need such a measure, for the reasons that I outlined in Utility Selection. That is, the content of the utilities seems to not actually be that important. How can an agent become really powerful without also becoming really powerful, therefore getting penalized?
1Stuart Armstrong
Not if the subagent is designed to not allow the increase in power. As in, the subagent is designed to maximise uA, but also to ensure the penalty term is as small as possible. The subagent is deliberately beyond the control of the initial agent. (is this the issue we are disagreeing about, or have I misunderstood?)
1Alex Turner
I think that you think that the method uses expected utility (which it doesn’t), so we’re secretly talking about really different things. The method doesn’t actually break in the ways you’ve said so far (as far as I can tell), but an expected utility method would. Compare how the post’s examples break down under your model, and contrast with the math I provided for attainable utilities. For example, the "EU difference over plans" model applies .316 penalty to disabling the off-switch (due to coincidentally capturing change in the agent’s actual utility for that time step, since it’s in the left square), yet my calculations show .95 penalty. Along these lines, AUP provably never tries to escape certain shutdown (that is, it expects to be shut down no matter which action it chooses at the present step), while an EU method would have no compunctions about doing so in many situations (as you pointed out earlier). Anyways, in general, supposing that the agent instantaneously builds successors that do things it can’t control either impedes its ability to maximize other objectives (if the agent can’t hijack this new powerful agent), or improves it (if the agent can hijack this new powerful agent).

Here is a writeup of the problem I believe your method has:

2Stuart Armstrong
Suppose the AI is addressing a letter containing $1,000,000. It can address this to Jane Brown, or to John Smith. Once addressed, AI will be turned off, and the letter will be posted. A utility uB that values Jane Brown would like the letter addressed to her, and vice versa for a utility uS that values John Smith. These two utilities differ only on the action the AI takes, not on subsequent observations. Therefore "This implies that by choosing a, the agent expects to observe some uA-high scoring oA with greater probability than if it had selected ∅" is false - it need not expect to observe anything at all. However the theorem is still true, because we just need to consider utilities that differ on actions - such as uB and uS.

Comments around the section title in bold. Apologies for length, but this was a pretty long post, too! I wrote this in order, while reading, so I often mention something that you address later.

Intuition Pumps:

There are well-known issues with needing a special "Status quo" state. Figuring out what humans would consider the "default" action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfact... (read more)

These comments are responding to the version of AUP presented in the paper. (Let me know if I should be commenting elsewhere).


If an action is useful w.r.t the actual reward but useless to all other rewards (as useless as taking ), that is the ideal according to —i.e. if it is not worth doing because the impact measure is too strong, nothing is worth doing. This is true even if the action is extremely useful to the actual reward. Am I right in thinking that we can conceptualize AUP as saying: “take actions which lead to reward, but wouldn’t be useful... (read more)

1Alex Turner
1) Why wouldn't gaining trust be useful for other rewards? I think that it wouldn't be motivated to do so, because the notion of gaining power seems to be deeply intertwined with the notion of heavy maximization. It might attempt to Goodhart our particular way of measuring impact; the fact that we are actually measuring goal achievement ability from a particular vantage point and are using a particular counterfactual structure means that there could be cheeky ways of tricking that structure. This is why intent verification is a thing in this longer post. However, I think the attainable utility measure itself is correct. 2) this doesn't appear in the paper, but I do talk about in the post and I think it's great that you raise this point. Attainable utility preservation says that impact is measured along the arc of your actions, taking into account the deviation of the Q functions at each step compared to doing nothing. If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate. In other words, since impact is measured along the arc of your actions, if your differential elements are chunky, you're not going to get a very good approximation. I think there are good reasons to suspect that in the real world, the way we think about actions is granular enough to avoid this dangerous phenomenon. 3) this is true. My stance here is that this is basically a capabilities problem/a safe exploration issue, which is disjoint from impact measurement. 4) this is why we want to slowly increment N. This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually N gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.
Yeah I agree there's an easy way to avoid this problem. My main point in bringing it up was that there must be gaps in your justification that AUP is safe, if your justification does not depend on "and the action space must be sufficiently small." Since AUP definitely isn't safe for sufficiently large action spaces, your justification (or at least the one presented in the paper) must have at least one flaw, since it purports to argue that AUP is safe regardless of the size of the action space. You must have read the first version of BoMAI (since you quoted here :) how did you find it by the way?). I'd level the same criticism against that draft. I believed I had a solid argument that it was safe, but then I discovered ν†, which proved there was an error somewhere in my reasoning. So I started by patching the error, but I was still haunted by how certain I felt that it was safe without the patch. I decided I needed to explicitly figure out every assumption involved, and in the process, I discovered ones that I hadn't realized I was making. Likewise, this patch definitely does seem sufficient to avoid this problem of action-granularity, but I think the problem shows that a more rigorous argument is needed.
1Alex Turner
Where did I purport that it was safe for AGI in the paper, or in the post? I specifically disclaim that I'm not making that point yet, although I'm pretty sure we can get there. There is a deeper explanation which I didn't have space to fit in the paper, and I didn't have the foresight to focus on when I wrote this post. I agree that it calls out for more investigation, and (this feels like a refrain for me at this point) I'll be answering this call in a more in-depth sequence on what is actually going on at a deep level with AUP, and how fundamental the phenomenon is to agent-environment interaction. I don't remember how I found the first version, I think it was in a Google search somehow?
Okay fair. I just mean to make some requests for the next version of the argument.
Because the agent has already committed to what the trust will be "used for." It's not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won't have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won't have increased wildly in the way that the Q-value for the real reward did.
1Alex Turner
I don't think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn't supply reward for other reward functions, the agent now has a much more stable existence. If you're saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for. Notice something interesting here where the thing which would be goodharted upon without intent verification isn't the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it's a specific agent with I/O channels, and so on. more on this later.
I'm not claiming things described as "trust" usually work like this, only that there exists a strategy like this. Maybe it's better described as "presenting an argument to run this particular code." The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP's Q-values for various other reward functions remain comparable to their prior values. If you're claiming that the other Q-values can't help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation. And let's forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it's not saying much to say that AUP + intent verification would make it safe.
1Alex Turner
(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)
Oh sorry.
Let's say for concreteness that it's a human policy that is used for aunit, if you think it works either way. I think that most human actions are moderately low impact, and some are extremely high impact. No matter what N is, then, if the impact of aunit is leaping to very large values infinitely often, then infinitely often there will effectively be no impact regularization, no matter what N is. No setting for N fixes this; if N were small enough to preclude even actions that are less impactful than aunit, then agent can't ever act usefully, and if N permits actions as impactful as aunit, then when aunit has very large impact (which I contend happens infinitely often for any assignment of aunit that permits any useful action ever), then dangerously high impact actions will be allowed.
1Alex Turner
I think there's some variance, but not as much as you have in mind. Even if there were a very large value, however, this isn't how N-incrementation works (in the post – if you're thinking of the paper, then yes, the version I presented there doesn't bound lifetime returns and therefore doesn't get the same desirable properties as in the post). If you'll forgive my postponing this discussion, I'd be interested in hearing your thoughts after I post a more in-depth exploration of the phenomenon?
Sure thing.

Suppose our agent figures out it can seize control in 100 time steps. And suppose seizing control is the first thing an agent that maximizes any utility function in does.

Suppose our agent builds a device that once activated observes the actions of the agents, and if the agent doesn't do the action during the next 100 time steps it does something that delays the agent by 1 time step. The agent activates the device and starts working on the 100-time-step-plan to seize control. For each action, the impact of [doing and then maximizing] is identic... (read more)

1Alex Turner
So I just realized that intent verification also applies 1.01 penalty to all 100 of the steps for not increasing the immediate Q^\text{epoch}_{u_A}. You could get around this by saying, well, the device only delays the agent 99% of the time. The immediate penalty has therefore not disappeared, but is rather made arbitrarily small: for delay probability p, the penalty is now ((num steps in takeover plan)•(1-p) + p)•takeover_step_penalty, with the second p being due to the normal penalty for building the device. The immediate penalty can become arbitrarily small in the limit of action granularity and p -> 1. [Note for those who didn’t read the other chain: building the device itself wouldn’t be allowed by intent verification, and it also doesn’t seem to dupe the long-term penalty. Still an extremely interesting attempted workaround.]
To recap my position: I think Intent Verification can't be relied on for filtering out actions because it will plausibly filter out the "good" actions (actions for useful and safe plans) in all but specific time steps. See my argument here. I think the agent might be able to dodge some unknown fraction of the long-term penalty with a trick I described here (I now added under that comment an additional reply addressing your arguments that I didn't previously address - sorry for missing them).
1Alex Turner
I’m confused. So the hundred step plan actually wouldn’t get done at all, so low penalty? That seems right? If it actually is doing the plan, then that has to show up somewhere in the counterfactuals. If it’s thwarted at every step, how is it getting control? Also, how does this pass intent verification?
I claim it would indeed get done. After the device is activated, and before seizing control, every "counterfactual plan" is doing the action ∅ and then finishing the 100-step-plan. Both the "uninterrupted 100-step-plan" and the "counterfactual plans" would make the agent seize control exactly 101 time steps after the device was activated. Therefore - the penalty along the way will be 0 (assuming a maximizer of any u∈U would first thing act to seize control). When we assume that seizing control is actually an optimal thing to do for maximizing uA, I don't understand how the Intent Verification section helps.
1Alex Turner
So when exactly does this device activate? If it’s every non-null action, the agent never seizes control. If it’s just some, the agent is penalized on the others as it indeed is making progress. Intent verification would penalize the building the device, because why would a u_A maximizer do something that has nothing to do with u_A, but rather with avoiding the penalty? Would a normal maximizer do anything that looks like building a device that would stop it from doing its desired plan?
The device is activated only once (immediately after building it). The device delays the agent by 1 time step at most once, and it does so only if the agent doesn't do the action ∅ within the 100 steps that follow the device's activation. In practice, the agent will not do ∅. Rather, it will do the 100-step-plan, then get delayed by the device for 1 time step (for not doing ∅), and then imitatively seize control. I'm confused about how Intent Verification would work here. We will plausibly get Qepochu(h<tat)≈Qepochu(h<t∅) for any at that does not help the agent seize control (assuming that after carrying out at the optimal plan is to still seize control). So how could this method be used to penalize building the above device?
3Alex Turner
Ah, clever! Props to you. This would indeed get past the immediate penalty. It wouldn’t get past the long-term penalty, however, since both plans would have null actions within 100 steps for 99 of the 100 actions in question. Edit: Actually, what happens on step 101? Inaction is stalled, while action is not, so the power-grab penalty still shows up? In any case, we can’t just say "this is small so it’s approximately the same". Building the device is a strict waste of resources for a u_A maximizer (not to mention, it delays execution of the optimal plan by one time step). This strictly decreases Q^\text{epoch}_{u_A}, incurring 1.01 penalty and therefore being invalidated.
In time step 101 the agent will be delayed by the device. Meaning, the agent will spend that time step in an idle state, and the the action it will choose to do in that time step won't affect the environment. In the beginning of time step 102 the agent has just seized control automatically. I'm confused about this. The actual plan need not contain any ∅. Is the idle time step a problem? Isn't the long-term penalty just the penalty when the maximization in the definition of Qu starts from time step t+n+1 instead of t+1? If so, I think the long-term penalty would be 0 as well (the "counterfactual plans" are identical to the "actual plan" after step 101). If after doing any action at the optimal plan is still seizing control, then plausibly any at that does not get the agent closer to seizing control is a strict waste of resources. So this method would invalidate any plan that does not act to seize control ASAP.
1Alex Turner
No, the long-term plan involves idling until the end of the epoch after each partial plan. This clearly will not be affected by your mechanism. No, this method invalidates actions which don’t move it closer to its (u_A) goal than doing nothing. Weird workarounds like this are precisely what motivated Intent Verification.
Ah right, thanks. Note that the above device-trick can be combined with the trick of making impact "fade" during null-action sequences (as mentioned in your post) in order to also dodge long-term penalty. Assuming that seizing control (or any other convergent instrumental goal) ASAP is the best way to optimize uA (whatever the first action is), the Intent Verification is in practice a filter that allows an action at only if either: (1) at is part of a seizing-control plan; or (2) There exist a plan (at,...,at+l , [seizing-control action sequence]) with higher uA value than any (∅, [seizing-control action sequence]) plan.
2Alex Turner
How? It isn’t presently activating in the long-term case, so how would it fade? The "fading" is clearly possible for improvements, but for seizing power? This doesn’t seem to be as feasible for positions of power, considering approval incentives. Also, building the device incurs normal penalty, too (although less than the full plan would otherwise cost, and decreases arbitrarily with action granularity) - it makes all of the U take 1 step longer to achieve, which is as bad as taking one step towards all of them. I don’t see why those plans should be uncommon? There are sometimes more productive things to do than nothing at all. Edit: I think this example potentially highlights an interesting separation between the premise of Thm 1, and the concept of "moving towards the goal". This seemingly shows that each action in the plan can have a 1-step indifference, even though comparing the complete inaction plan with the complete action plan still produces the discrepancy required for Thm 1 to hold. I wonder whether we can construct a situation (even one we don’t think would pass intent verification) which fools both the immediate and long-term penalties. I’m going to think about this more carefully later to ensure your example isn’t making a subtle bad assumption.
Suppose you build a device B that once built monitors the agent and if B detects that the agent started a null-action-sequence-until-end-of-epoch, it starts executing a plan with the goal of undoing any impact that the agent has done since creating device B. Immediately after creating device B, and until the end of the epoch, the only long-term penalties the agent incurs are ones for impacts that device B would fail to undo (from the moment B observes the agent is in a null-action-sequence-until-end-of-epoch and until the end of the epoch). Note: Failing to undo impact can be due to either the irreversible nature of the environment or the irreversible nature of the utility functions. My intuition is that [causing almost all of U to take a few time steps longer to achieve] is something the agent will have to be able to do to be useful anyway (e.g. spending a few time steps worth of resources in a way that is only useful to uA). Consider your Sokoban example: it's plausible that in every time step the agent's movement makes most utility functions in U take either 1 step more or 1 step less to achieve (they don't cancel each other out). In order for that filter to be useful - you must demand that any single action the agent takes increases uA even if after taking it the agent is "hijacked" by some unrelated goal. This isn't the case for any real-life utility function and environment I can think of.
2Alex Turner
This cleanup seems to be just further impact, in most cases, for the same reason there might be additional concealment incentives without intent verification. You can’t just assume a robot that turns back time for no energy cost. Also, surely building such a powerful agent is extremely instrumentally convergent? And how is the agent making this powerful cleanup device if it has yet to seize power? And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit. You might want to reread intent verification, it’s with respect to being hijacked by u_A. Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.
What would prevent device B from spending that energy? By the time device B is working on the "impact undoing" (which happens only in the "long-term penalty counterfactual plans"), I don't understand how either the long-term penalty, immediate penalty, or Intent Verification affect it. Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than "intended" (including during its construction). I'm not sure I understood the question. What would prevent the agent from constructing this device "before seizing power"? Human operators that would sense in time something is fishy? I'm confused by this. I don't understand how this resolves my difficulties with your argument that "building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve". My counter argument was that this is a perfectly normal thing to happen. Do you disagree that in your Sokoban example it's plausible that in every time step the agent's movement makes most/all utility functions in U take either 1 step more or 1 step less to achieve their goal (not canceling out)? Why is it less likely to be able to work on the device while not wasting a time-step to too many utility function in U, compared to just working on uA in a "conventional" way?
1Alex Turner
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting. So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device, and the acceptable ways of getting there? That doesn’t make sense. Also, wasting time trying to specify these weird behaviors in the new agent is also caught by IV for the same reason ex ante is. You said the agent has to seize power over 100 steps, but it can also make a singleton that will "revert" impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful. My point is that ImpactUnit implies whether this is normal or not. In the gridworld, that kind of movement is normal, which is why it is the impact unit. On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Regarding all your arguments that use Intent Verification, my tentative position is that IV can't be relied on to filter actions (as we're still discussing under this sub-thread). If I understand your argument correctly, you argue that the resources/energy device B is spending while "undoing impact" decreases the value of utility functions in U which is an additional impact that it might not be able to undo. But why wouldn't it be able to counter that by saving enough energy/resources that would otherwise be wasted by humanity until the end of the episode? (perhaps it's what you meant with "available energy"?). I don't claim I know how to do it myself :) But for the agent it might be as easy as cloning itself and setting some modified utility function in the new clone (done in a smart way so as to not cause too much impact in any time step). As I argued above, for the agent - creating the device might be as easy as invoking a modified version of itself. In any case, I'm not sure I understand what "already powerful" means. In all the places I wrote "seizing power" I believe I should have just wrote "some convergent instrumental goal". Suppose in time step 4 the robot that creates paper-clips moves its arm 1 cm to the left. Does this impacts most utility functions in U significantly less than 1 time-step worth of utility? How about a Rumba robot that drives 1 cm forward? It depends on how you define U, but I don't see how we can assume this issue prevents the agent from building the device (again, compare a single action while building the device to a single action while making "conventional" progress on uA: why should the former be more "wasteful" for most of U compared to the latter?).
Only in time-steps in which an action can complete the creation of a paperclip, the Intent Verificaiton filter might allow the action. Suppose that you're in time step 11 and that creating an additional paperclip would require 5 actions: (a11,...,a15). The Intent Verificaiton filter won't allow a11 if after doing that action a uA maximizer would switch to pursuing some convergent instrumental goal that would render a11 useless.
1Alex Turner
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard. For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
Edit to add: the following is just to illustrate what I don't understand about your argument (needless to say I don't suggest the two things are comparable in any way). All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than x. You can set x=1 and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful. If you argue that the Intent Verification filter can be used to prevent the bad tricks we discussed, you need to show that you can use it to filter out the bad actions while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case. Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal - we can't ride it even a bit. Also - I'm not sure I understand how "replaying" will be implemented in a useful way.
1Alex Turner
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now. No, your argument is that there are certain false positives, which I don’t contest. I even listed this kind of thing as an open question, and am interested in further discussion of how we can go about ensuring IV is properly-tuned. You’re basically saying, "There are false positives, so that makes the core insight that allows IV to work the extent it does wrong, and unlikely to be fixable." I disagree with this conclusion. If you want to discuss how we could resolve or improve this issue, I’m interested. Otherwise, I don’t think continuing this conversation will be very productive. Well I certainly empathize with the gut reaction, that isn’t quite right. Notice that the exact same actions had always been available before we restricted available actions to the optimal or to nothing. I think it’s possible that we could just step along the first n steps of the best plan stopping earlier in a way that lets us just get the good behavior, before any instrumental behavior is actually completed. It’s also possible that this isn’t true. This is all speculation at this point, which is why my tone in that section was also very speculative.
3Rohin Shah
Fwiw, I would make the same argument that ofer did (though I haven't read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough. Now, obviously we know something about AUP, but It's not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.
1Alex Turner
My present position is that it can seemingly do every task in at least one way, and we should expand the number of ways to line up with our intuitions just to be sure.
I sincerely apologize, I sometimes completely fail to communicate my intention. I gave the example of the random filter only to convey what I don't understand about your argument (needless to say I don't suggest the two things are comparable in any way). I should have wrote that explicitly (edited). Sorry! Of course! I'll think about this topic some more. I suggest we take this offline - the nesting level here has quite an impact on my browser :)

The proof of Theorem 1 is rather unclear: "high scoring" is ill-defined, and increasing the probability of some favorable outcome doesn't prove imply that the action is good for since it can also increase the probability of some unfavorable outcome. Instead, you can easily construct by hand a s.t.