Overfitting the AU landscape

When we act, and others act upon us, we aren’t just changing our ability to do things – we’re shaping the local environment towards certain goals, and away from others.[1] We’re fitting the world to our purposes.

What happens to the AU landscape[2] if a paperclip maximizer takes over the world?[3]

Preferences implicit in the evolution of the AU landscape

Shah et al.'s Preferences Implicit in the State of the World leverages the insight that the world state contains information about what we value. That is, there are agents pushing the world in a certain "direction". If you wake up and see a bunch of vases everywhere, then vases are probably important and you shouldn't explode them.

Similarly, the world is being optimized to facilitate achievement of certain goals. AUs are shifting and morphing, often towards what people locally want done (e.g. setting the table for dinner). How can we leverage this for AI alignment?

Exercise: Brainstorm for two minutes by the clock before I anchor you.

Two approaches immediately come to mind for me. Both rely on the agent focusing on the AU landscape rather than the world state.

Value learning without a prespecified ontology or human model. I have previously criticized value learning for needing to locate the human within some kind of prespecified ontology (this criticism is not new). By taking only the agent itself as primitive, perhaps we could get around this (we don't need any fancy engineering or arbitrary choices to figure out AUs/optimal value from the agent's perspective).

Force-multiplying AI. Have the AI observe which of its AUs most increase during some initial period of time, after which it pushes the most-increased-AU even further.

In 2016, Jessica Taylor wrote of a similar idea:

"In general, it seems like "estimating what types of power a benchmark system will try acquiring and then designing an aligned AI system that acquires the same types of power for the user" is a general strategy for making an aligned AI system that is competitive with a benchmark unaligned AI system."

I think the naïve implementation of either idea would fail; e.g., there are a lot of degenerate AUs it might find. However, I'm excited by this because a) the AU landscape evolution is an important source of information, b) it feels like there's something here we could do which nicely avoids ontologies, and c) force-multiplication is qualitatively different than existing proposals.

Project: Work out an AU landscape-based alignment proposal.

Why can't everyone be king?

Consider two coexisting agents each rewarded for gaining power; let's call them Ogre and Giant. Their reward functions[4] (over the partial-observability observations) are identical. Will they compete? If so, why?

Let's think about something easier first. Imagine two agents each rewarded for drinking coffee. Obviously, they compete with each other to secure the maximum amount of coffee. Their objectives are indexical, so they aren't aligned with each other – even though they share a reward function.

Suppose both agents are able to have maximal power. Remember, Ogre's power can be understood as its ability to achieve a lot of different goals. Most of Ogre's possible goals need resources; since Giant is also optimally power-seeking, it will act to preserve its own power and prevent Ogre from using the resources. If Giant weren't there, Ogre could better achieve a range of goals. So, Ogre can still gain power by dethroning Giant. They can't both be king.

Just because agents have indexically identical payoffs doesn't mean they're cooperating; to be aligned with another agent, you should want to steer towards the same kinds of futures.

Most agents aren't pure power maximizers. But since the same resource competition usually applies, the reasoning still goes through.

Objective vs value-specific catastrophes

How useful is our definition of "catastrophe" with respect to humans? After all, literally anything could be a catastrophe for some utility function.[5]

Tying one's shoes is absolutely catastrophic for an agent which only finds value in universes in which shoes have never ever ever been tied. Maybe all possible value in the universe is destroyed if we lose at Go to an AI even once. But this seems rather silly.

Human values are complicated and fragile:

Consider the incredibly important human value of "boredom" - our desire not to do "the same thing" over and over and over again. You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.

But the human AU is not so delicate. That is, given that we have power, we can make value; there don’t seem to be arbitrary, silly value-specific catastrophes for us. Given energy and resources and time and manpower and competence, we can build a better future.

In part, this is because a good chunk of what we care about seems roughly additive over time and space; a bad thing happening somewhere else in spacetime doesn't mean you can't make things better where you are; we have many sources of potential value. In part, this is because we often care about the universe more than the exact universe history; our preferences don’t seem to encode arbitrary deontological landmines. More generally, if we did have such a delicate goal, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that entire universe would be partially ruined for us forever. That just doesn't sound realistic.

It seems that most of our catastrophes are objective catastrophes.[6]

Consider a psychologically traumatizing event which leaves humans uniquely unable to get what they want, but which leaves everyone else (trout, AI, etc.) unaffected. Our ability to find value is ruined. Is this an example of the delicacy of our AU?

No. This is an example of the delicacy of our implementation; notice also that our AUs for constructing red cubes, reliably looking at blue things, and surviving are also ruined. Our power has been decreased.

Detailing the catastrophic convergence conjecture (CCC)

In general, the CCC follows from two sub-claims. 1) Given we still have control over the future, humanity's long-term AU is still reasonably high (i.e. we haven't endured a catastrophe). 2) Realistically, agents are only incentivized to take control from us in order to gain power for their own goal. I'm fairly sure the second claim is true ("evil" agents are the exception prompting the "realistically").

Also, we're implicitly considering the simplified frame of a single smart AI affecting the world, and not structural risk via the broader consequences of others also deploying similar agents. This is important but outside of our scope for now.

Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.

Let's say a reward function is aligned[7] if all of its Blackwell-optimal policies are doing what we want (a policy is Blackwell-optimal if it's optimal and doesn't stop being optimal as the agent cares more about the future). Let's say a reward function class is alignable if it contains an aligned reward function.[8] The CCC is talking about impact alignment only, not about intent alignment.

Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.

Not all unaligned goals induce catastrophes, and of those which do induce catastrophes, not all of them do it because of power-seeking incentives. For example, a reward function for which inaction is the only optimal policy is "unaligned" and non-catastrophic. An "evil" reward function which intrinsically values harming us is unaligned and has a catastrophic optimal policy, but not because of power-seeking incentives.

"Tend to have" means that realistically, the reason we're worrying about catastrophe is because of power-seeking incentives – because the agent is gaining power to better achieve its own goal. Agents don't otherwise seem incentivized to screw us over very hard; CCC can be seen as trying to explain adversarial Goodhart in this context. If CCC isn't true, that would be important for understanding goal-directed alignment incentives and the loss landscape for how much we value deploying different kinds of optimal agents.

While there exist agents which cause catastrophe for other reasons (e.g. an AI mismanaging the power grid could trigger a nuclear war), the CCC claims that the selection pressure which makes these policies optimal tends to come from power-seeking drives.

Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.

"But what about the Blackwell-optimal policy for Tic-Tac-Toe? These agents aren't taking over the world now". The CCC is talking about agents optimizing a reward function in the real world (or, for generality, in another sufficiently complex multiagent environment).

Edit: The initial version of this post talked about "outer alignment"; I changed this to just talk about alignment, because the outer/inner alignment distinction doesn't feel relevant here. What matters is how the AI's policy impacts us; what matters is impact alignment.

Prior work

In fact even if we only resolved the problem for the similar-subgoals case, it would be pretty good news for AI safety. Catastrophic scenarios are mostly caused by our AI systems failing to effectively pursue convergent instrumental subgoals on our behalf, and these subgoals are by definition shared by a broad range of values.

~ Paul Christiano, Scalable AI control

Convergent instrumental subgoals are mostly about gaining power. For example, gaining money is a convergent instrumental subgoal. If some individual (human or AI) has convergent instrumental subgoals pursued well on their behalf, they will gain power. If the most effective convergent instrumental subgoal pursuit is directed towards giving humans more power (rather than giving alien AI values more power), then humans will remain in control of a high percentage of power in the world.

If the world is not severely damaged in a way that prevents any agent (human or AI) from eventually colonizing space (e.g. severe nuclear winter), then the percentage of the cosmic endowment that humans have access to will be roughly close to to the percentage of power that humans have control of at the time of space colonization. So the most relevant factors for the composition of the universe are (a) whether anyone at all can take advantage of the cosmic endowment, and (b) the long-term balance of power between different agents (humans and AIs).

I expect that ensuring that the long-term balance of power favors humans constitutes most of the AI alignment problem...

~ Jessica Taylor, Pursuing convergent instrumental subgoals on the user's behalf doesn't always require good priors


  1. In planning and activity research there are two common approaches to matching agents with environments. Either the agent is designed with the specific environment in mind, or it is provided with learning capabilities so that it can adapt to the environment it is placed in. In this paper we look at a third and underexploited alternative: designing agents which adapt their environments to suit themselves... In this case, due to the action of the agent, the environment comes to be better fitted to the agent as time goes on. We argue that [this notion] is a powerful one, even just in explaining agent-environment interactions.

    Hammond, Kristian J., Timothy M. Converse, and Joshua W. Grass. "The stabilization of environments." Artificial Intelligence 72.1-2 (1995): 305-327. ↩︎

  2. Thinking about overfitting the AU landscape implicitly involves a prior distribution over the goals of the other agents in the landscape. Since this is just a conceptual tool, it's not a big deal. Basically, you know it when you see it. ↩︎

  3. Overfitting the AU landscape towards one agent's unaligned goal is exactly what I meant when I wrote the following in Towards a New Impact Measure:

    Unfortunately, almost never,[9] so we have to stop our reinforcement learners from implicitly interpreting the learned utility function as all we care about. We have to say, "optimize the environment some according to the utility function you've got, but don't be a weirdo by taking us literally and turning the universe into a paperclip factory. Don't overfit the environment to , because that stops you from being able to do well for other utility functions."

    ↩︎
  4. In most finite Markov decision processes, there does not exist a reward function whose optimal value function is (defined as "the ability to achieve goals in general" in my paper) because often violates smoothness constraints on the on-policy optimal value fluctuation (AFAICT, a new result of possibility theory, even though you could prove it using classical techniques). That is, I can show that optimal value can't change too quickly from state to state while the agent is acting optimally, but can drop off very quickly.

    This doesn't matter for Ogre and Giant, because we can still find a reward function whose unique optimal policy navigates to the highest power states. ↩︎

  5. In most finite Markov decision processes, most reward functions do not have such value fragility. Most reward functions have several ways of accumulating reward. ↩︎

  6. When I say "an objective catastrophe destroys a lot of agents' abilities to get what they want", I don't mean that the agents have to actually be present in the world. Breaking a fish tank destroys a fish's ability to live there, even if there's no fish in the tank. ↩︎

  7. This idea comes from Evan Hubinger's Outer alignment and imitative amplification:

    Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according to that loss function are aligned with our goals—that is, they are at least trying to do what we want. More precisely, let and . For a given loss function , let . Then, is outer aligned at optimum if, for all such that , is trying to do what we want.

    ↩︎
  8. Some large reward function classes are probably not alignable; for example, consider all Markovian linear functionals over a webcam's pixel values. ↩︎

  9. I disagree with my usage of "aligned almost never" on a technical basis: assuming a finite state and action space and considering the maxentropy reward function distribution, there must be a positive measure set of reward functions for which the/a human-aligned policy is optimal. ↩︎

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 10:40 AM
Human values are complicated and fragile

It's not clear to me whether you actually meant to suggest this as well, but this line of reasoning makes me wonder if many of our values are actually not that complicated and fragile after all, instead being to connected to AU considerations. E.g. self-determination theory's basic needs of autonomy, competence and relatedness seem like different ways of increasing your AU, and the boredom example might not feel catastrophic because of some highly arbitrary "avoid boredom" bit in the utility function, but rather because looping a single experience over and over isn't going to help you maintain your ability to avoid catastrophes. (That is, our motivations and values optimize for maintaining AU among other things, even if that is not the thing that those values feel like from the inside.)

Intriguing. I don't know whether that suggests our values aren't as complicated as we thought, or whether the pressures which selected them are not complicated.

While I'm not an expert on the biological intrinsic motivation literature, I think it's at least true that some parts of our values were selected for because they're good heuristics for maintaining AU. This is the thing that MCE was trying to explain:

The paper’s central notion begins with the claim is that there is a physical principle, called “causal entropic forces,” that drives a physical system toward a state that maximizes its options for future change. For example, a particle inside a rectangular box will move to the center rather than to the side, because once it is at the center it has the option of moving in any direction. Moreover, argues the paper, physical systems governed by causal entropic forces exhibit intelligent behavior.

I think they have this backwards: intelligent behavior often results in instrumentally convergent behavior (and not necessarily the other way around). Similarly, Salge et al. overview the behavioral empowerment hypothesis:

The adaptation brought about by natural evolution reduce organisms that in absence of specific goals behave as if they were maximizing [mutual information between their actions and future observations].

As I discuss in section 6.1 of Optimal Farsighted Agents Tend to Seek Power, I think that "ability to achieve goals in general" (power) is a better intuitive and technical notion than information-theoretic empowerment. I think it's pretty plausible that we have heuristics which, all else equal, push us to maintain or increase our power.

I have previously criticized value learning for needing to locate the human within some kind of prespecified ontology (this criticism is not new). By taking only the agent itself as primitive, perhaps we could get around this (we don't need any fancy engineering or arbitrary choices to figure out AUs/optimal value from the agent's perspective).

Wouldn't you need to locate the abstract concept of AU within the AI's ontology? Is that easier? Or sorry if I'm misunderstanding.

Wouldn't you need to locate the abstract concept of AU within the AI's ontology? Is that easier? Or sorry if I'm misunderstanding.

To the contrary, an AU is naturally calculated from reward, one of the few things that is ontologically fundamental in the paradigm of RL. As mentioned in the last post, the AU of reward function is - which calculates the maximum possible -return from a given state.

This will become much more obvious in the AUP empirical post.

Sure. Looking forward to that. My current intuition is: Humans have a built-in reward system based on (mumble mumble) dopamine, but the existence of that system doesn't make it easy for us to understand dopamine, or reward functions in general, or anything like that, nor does it make it easy for us to formulate and pursue goals related to those things. It takes quite a bit of education and beautifully-illustrated blog posts to get us to that point :-D

Note that when I said

(we don't need any fancy engineering or arbitrary choices to figure out AUs/optimal value from the agent's perspective).

I meant we could just consider how the agent's AUs are changing without locating a human in the environment.

Cool. We're probably on the same page then.

I understand what you mean with the CCC (and that this seems a bit of a nit-pick!), but I think the wording could usefully be clarified.

As you suggest here, the following is what you mean:

CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking"

However, that's not what the CCC currently says.
E.g. compare:
[Unaligned goals] tend to [have catastrophe-inducing optimal policies] because of [power-seeking incentives].
[People teleported to the moon] tend to [die] because of [lack of oxygen].

The latter doesn't lead to the conclusion: "If people teleported to the moon had oxygen, they wouldn't tend to die."

Your meaning will become clear to anyone who reads this sequence.
For anyone taking a more cursory look, I think it'd be clearer if your clarification were the official CCC:

CCC: (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking"

Currently, I worry about people pulling an accidental motte-and-bailey on themselves, and thinking that [weak interpretation of CCC] implies [conclusions based on strong interpretation]. (or thinking that you're claiming this)

The catastrophic convergence conjecture was originally formulated in terms of "outer alignment catastrophes tending to come from power-seeking behavior." I think that this was a mistake: I meant to talk about impact alignment catastrophes tending to be caused by power-seeking. I've updated the post accordingly.

How much are you thinking about stability under optimization? Most objective catastrophes are also human catastrophes. But if a powerful agent is trying to achieve some goal while avoiding objective catastrophes, it seems like it's still incentivized to dethrone humans - to cause basically the most human-catastrophic thing that's not objective-catastrophic.

I'm not thinking of optimizing for "not an objective catastrophe" directly - it's just a useful concept. The next post covers this.