Can we get impact measurement right? Does there exist One Equation To Rule Them All?

I think there’s a decent chance there isn’t a simple airtight way to implement AUP which lines up with AUP, mostly because it’s just incredibly difficult in general to perfectly specify the reward function.

Reasons why it might be feasible: we’re trying to get the agent to do the goal without it becoming more able to do the goal, which is conceptually simple and natural; since we’ve been able to handle previous problems with AUP with clever design choice modifications, it’s plausible we can do the same for all future problems; since there are a lot of ways to measure power due to instrumental convergence, that increases the chance at least one of them will work; intuitively, this sounds like the kind of thing which could work (if you told me “you can build superintelligent agents which don’t try to seek power by penalizing them for becoming more able to achieve their own goal”, I wouldn’t exactly die of shock).

Even so, I am (perhaps surprisingly) not that excited about actually using impact measures to restrain advanced AI systems. Let’s review some concerns I provided in Reasons for Pessimism about Impact of Impact Measures:

  • Competitive and social pressures incentivize people to cut corners on safety measures, especially those which add overhead. Especially so for training time, assuming the designers slowly increase aggressiveness until they get a reasonable policy.
  • In a world where we know how to build powerful AI but not how to align it (which is actually probably the scenario in which impact measures do the most work), we play a very unfavorable game while we use low-impact agents to somehow transition to a stable, good future: the first person to set the aggressiveness too high, or to discard the impact measure entirely, ends the game.
  • In a What Failure Looks Like-esque scenario, it isn't clear how impact-limiting any single agent helps prevent the world from "gradually drifting off the rails".

You might therefore wonder why I’m working on impact measurement.

Deconfusion

Within Matthew Barnett’s breakdown of how impact measures could help with alignment, I'm most excited about impact measure research as deconfusion. Nate Soares explains:

By deconfusion, I mean something like “making it so that you can think about a given topic without continuously accidentally spouting nonsense.”

To give a concrete example, my thoughts about infinity as a 10-year-old were made of rearranged confusion rather than of anything coherent, as were the thoughts of even the best mathematicians from 1700. “How can 8 plus infinity still be infinity? What happens if we subtract infinity from both sides of the equation?” But my thoughts about infinity as a 20-year-old were not similarly confused, because, by then, I’d been exposed to the more coherent concepts that later mathematicians labored to produce. I wasn’t as smart or as good of a mathematician as Georg Cantor or the best mathematicians from 1700; but deconfusion can be transferred between people; and this transfer can spread the ability to think actually coherent thoughts.

In 1998, conversations about AI risk and technological singularity scenarios often went in circles in a funny sort of way. People who are serious thinkers about the topic today, including my colleagues Eliezer and Anna, said things that today sound confused. (When I say “things that sound confused,” I have in mind things like “isn’t intelligence an incoherent concept,” “but the economy’s already superintelligent,” “if a superhuman AI is smart enough that it could kill us, it’ll also be smart enough to see that that isn’t what the good thing to do is, so we’ll be fine,” “we’re Turing-complete, so it’s impossible to have something dangerously smarter than us, because Turing-complete computations can emulate anything,” and “anyhow, we could just unplug it.”) Today, these conversations are different. In between, folks worked to make themselves and others less fundamentally confused about these topics—so that today, a 14-year-old who wants to skip to the end of all that incoherence can just pick up a copy of Nick Bostrom’s Superintelligence.

Similarly, suppose you’re considering the unimportant and trivial question of whether seeking power is convergently instrumental, which we can now crisply state as "do most reward functions induce optimal policies which take over the planet (more formally, which visit states with high POWER)?".

You’re a bit confused if you argue in the negative by saying “you’re anthropomorphizing; chimpanzees don’t try to do that” (chimpanzees aren’t optimal) or “the set of reward functions which does this has measure 0, so we’ll be fine” (for any reachable state, there exists a positive measure set of reward functions for which visiting it is optimal).

You’re a bit confused if you argue in the affirmative by saying “unintelligent animals fail to gain resources and die; intelligent animals gain resources and thrive. Therefore, since we are talking about really intelligent agents, of course they’ll gain resources and avoid correction.” (animals aren’t optimal, and evolutionary selection pressures narrow down the space of possible “goals” they could be effectively optimizing).

After reading this paper on the formal roots of instrumental convergence, instead of arguing about whether chimpanzees are representative of power-seeking behavior, we can just discuss how, under an agreed-upon reward function distribution, optimal action is likely to flow through the future of our world. We can think about to what extent the paper's implications apply to more realistic reward function distributions (which don't identically distribute reward over states).[1] Since we’re less confused, our discourse doesn’t have to be crazy.

But also since we’re less confused, the privacy of our own minds doesn’t have to be crazy. It's not that I think that any single fact or insight or theorem downstream of my work on AUP is totally obviously necessary to solve AI alignment. But it sure seems good that we can mechanistically understand instrumental convergence and power, know what “impact” means instead of thinking it’s mostly about physical change to the world, think about how agents affect each other, and conjecture why goal-directedness seems to lead to doom by default.[2]

Attempting to iron out flaws from our current-best AUP equation makes one intimately familiar with how and why power-seeking incentives can sneak in even when you’re trying to keep them out in the conceptually correct way. This point is harder for me to articulate, but I think there’s something vaguely important in understanding how this works.

Formalizing instrumental convergence also highlighted a significant hole in our theoretical understanding of the main formalism of reinforcement learning. And if you told me two years ago that you could possibly solve side-effect avoidance in the short-term with one simple trick (“just preserve your ability to optimize a single random reward function, lol”), I’d have thought you were nuts. Clearly, there’s something wrong with our models of reinforcement learning environments if these results are so surprising.

In my opinion, research on AUP has yielded an unusually high rate of deconfusion and insights, probably because we’re thinking about what it means for the agent to interact with us.


  1. When combined with our empirical knowledge of the difficulty of reward function specification, you might begin to suspect that there are lots of ways the agent might be incentivized to gain control, many openings through which power-seeking incentives can permeate – and your reward function would have to penalize all of these! If you were initially skeptical, this might make you think that power-seeking behavior may be more difficult to avoid than you initially thought. ↩︎

  2. If we collectively think more and end up agreeing that AUP solves impact measurement, it would be interesting that you could solve such a complex, messy-looking problem in such a simple way. If, however, CCC ends up being false, I think that would also be a new and interesting fact not currently predicted by our models of alignment failure modes. ↩︎

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 10:37 AM
if you told me “you can build superintelligent agents which don’t try to seek power by penalizing them for becoming more able to achieve their own goal”, I wouldn’t exactly die of shock

This seems broadly reasonable to me, but I don't think it can work under the threat model of optimal agents. "Impact" / "more able" as defined in this sequence can only be caused by events the agent didn't perfectly predict, because impact requires a change in the agent's belief about the reward it can accumulate. In a deterministic environment with a truly optimal agent, the agent's beliefs will never change as it executes the optimal policy, and so there will never be impact. So AUP_conceptual using the definition of impact/power in this sequence doesn't seem like it solves the problem under the threat model of perfectly optimal agents. (That's fine! We won't have those!)In practice, I interpret variants of AUP-the-method (as in the previous post) as trying to get safety via some combination of two things:

  • Power proxy: When using a set of auxiliary reward functions, the agent's beliefs about attainable utility for the auxiliary rewards changes, because it is not following an optimal policy for them. This forms a good proxy for power that is compatible with the agent having perfect beliefs. The main problem here is that proxies can be gamed (as in various subagent constructions).
  • Starting from a "dumber" belief: (Super unclear / fuzzy) Given that the agent's actual beliefs won't change, we can instead have it measure the difference between its beliefs and some "dumber" beliefs, e.g. its beliefs if it were following an inaction policy or a random policy for N timesteps, followed by an optimal policy. The problem here is that you aren't leveraging the AI's understanding of the environment, and so in practice I'd expect the effect of this is going to depend pretty significantly on the environment.

I like AUP primarily because of the first reason: while the power proxy is not ungameable, it certainly seems quite good, and seems like it only deviates from our intuitive notion of power in very weird circumstances or under adversarial optimization. While this means it isn't superintelligence-safe, it still seems like an important idea that might be useful in other ways.

Once you remove the auxiliary rewards and only use the primary reward R, I think you have mostly lost this benefit: at this point you are saying "optimize for R, but don't optimize for long-term R", which seems pretty weird and not a good proxy for power. At this point I think you're only getting the benefit of starting from a "dumber" belief, or perhaps you shift reward acquisition to be closer to the present than the far future, but this seems pretty divorced from the CCC and all of the conceptual progress made in this sequence. It seems much more in the same spirit as quantilization and/or satisficing, and I'd rather use one of those two methods (since they're simpler and easier to understand).(I analyzed a couple of variants of AUP-without-auxiliary-rewards here; I think it mostly supports my claim that these implementations of AUP are pretty similar in spirit to quantilization / satisficing.)

In a deterministic environment with a truly optimal agent, the agent's beliefs will never change as it executes the optimal policy, and so there will never be impact. So AUP_conceptual using the definition of impact/power in this sequence doesn't seem like it solves the problem under the threat model of perfectly optimal agents.

I don't think this critique applies to AUP. AUP is defined as penalizing the intuitive version of "change in power", not the formal definition. From our perspective, we could still say an agent is penalized for changes in power (intuitively perceived), even if the world is secretly deterministic.

If I'm an optimal agent with perfect beliefs about what the (deterministic) world will do, even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?

If by "intuitive" you mean "from the perspective of real humans, even if the agent is optimal / superintelligent", then I feel like there are lots of conceptual solutions to AI alignment, like "do what I mean", "don't do bad things", "do good things", "promote human flourishing", etc.

(this comment and the previous both point at relatively early-stage thoughts; sorry if it seems like I'm equivocating)

even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?

I think there's a piece of intuition missing from that first claim, which goes something like "power has to do with easily exploitable opportunities in a given situation", so it doesn't matter if the agent is optimal. In that case, gaining a ton of money would increase power.

If by "intuitive" you mean "from the perspective of real humans, even if the agent is optimal / superintelligent", then I feel like there are lots of conceptual solutions to AI alignment, like "do what I mean", "don't do bad things", "do good things", "promote human flourishing", etc.

While I was initially leaning towards this perspective, I'm leaning away now. However, still note that this solution doesn't have anything to do with human values in particular.

has to do with easily exploitable opportunities in a given situation

Sorry, I don't understand what you mean here.

However, still note that this solution doesn't have anything to do with human values in particular.

I feel like I can still generate lots of solutions that have that property. For example, "preserve human autonomy", "be nice", "follow norms", "do what I mean", "be corrigible", "don't do anything I wouldn't do", "be obedient".

All of these depend on the AI having some knowledge about humans, but so does penalizing power.

Sorry, I don't understand what you mean here.

When I say that our intuitive sense of power has to do with the easily exploitable opportunities available to an actor, that refers to opportunities which e.g. a ~human-level intelligence could notice and take advantage of. This has some strange edge cases, but it's part of my thinking.

The key point is that AUP relaxes the problem:

If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?

This is not trivial. I think it's a useful question to ask (especially because we can formalize so many of these power intuitions), even if none of the formalizations are perfect.

The key point is that AUPconceptual relaxes the problem:

If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?

This is not trivial.

Probably I'm just missing something, but I don't see why you couldn't say something similar about:

"preserve human autonomy", "be nice", "follow norms", "do what I mean", "be corrigible", "don't do anything I wouldn't do", "be obedient"

E.g.

If we could robustly reward the agent for intuitively perceived nice actions (whatever that means), would that solve the problem?

It seems like the main difference is that for power in particular is that there's more hope that we could formalize power without reference to humans (which seems harder to do for e.g. "niceness"), but then my original point applies.

(This discussion was continued privately – to clarify, I was narrowly arguing that AUP is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)