# 10

In the context of AI alignment an impact penalty is one way of avoiding large negative side effects from misalignment. The idea is that rather than specifying negative impacts, we can try to avoid catastrophes by avoiding large side effects altogether.

Impact measures are ways to map a policy to a number which is intended to correspond to "how big of an impact will this action have on the world?" Using an impact measure, we can regularize any system with a lot of optimization power by adding an impact term to its utility function.

This post records and summarizes much of the early research on impact penalties. I emphasize the aims of each work, and the problems associated with each approach, and add occasional commentary along the way. In the next post I will dive more deeply into recent research which, at least in my opinion, is much more promising.

## The mathematics of reduced impact: help needed, by Stuart Armstrong (2012)

This is the first published work I could find which put forward explicit suggestions for impact measures and defined research directions. Armstrong proposed various ways that we could measure the difference between worlds, incorporate this information into a probability distribution, and then use that to compare actions.

Notably, this post put a lot of emphasis on comparing specific ontology-dependent variables between worlds in a way that is highly sensitive to our representation. This framing of low impact shows up in pretty much all of the early writings on impact measures.

One example of an impact measure is the "Twenty (million) questions" approach, where humans define a vector of variables like "GDP" and "the quantity of pesticides used for growing strawberries." We could theoretically add some regularizer to the utility function, which measures the impact difference between a proposed action and the null action, scaled by a constant factor. The AI would then be incentivized to keep these variables as close possible to what they would have been in the counterfactual where the AI had never done anything at all.

The flaw with this specific approach was immediately pointed out, both by Armstrong, and in the comments below. Eliezer Yudkowsky objected to the approach since it implied that some artificial intelligence would try to manage the state of affairs of the entire universe in order to keep the state of the world to be identical to the counterfactual where the AI never existed,

Coarse-grained impact measures end with the AI deploying massive-scale nanotech in order to try and cancel out butterfly effects and force the world onto a coarse-grained path as close as possible to what it would've had if the AI "hadn't existed" however that counterfactual was defined. [...] giving an AI a huge penalty function over the world to minimize seems like an obvious recipe for building something that will exert lots and lots of power.

I agree that this is an issue with the way that the impact measure was defined — in particular, the way that it depended on some metric comparing worlds. However, the last line sounds a bit silly to me. If your impact measure provides an artificial intelligence an incentive to "exert lots and lots of power" then it doesn't really sound like an impact measure at all.

This critique is expanded in Yudkowsky's Arbital article which came out four years later, which I also go into below.

## Low Impact Artificial Intelligences, by Stuart Armstrong and Benjamin Levinstein (2015)

This work starts from fundamentally the same starting point as the Lesswrong post above, but expands upon it by having more explicit and rigorous approaches.

One approach outlined in the article is the approach of viewing impact as a form of news which informs us of the world that we are living in.

The idea is simple enough: if you were notified tomorrow that Deepmind had constructed an advanced superintelligence, this would rightfully change the way that you thought about the world, and would inform you that some radical things are soon arriving which could alter the trajectory of what we value. Putting an emphasis on importance and the value of information forms the intuition behind this method of measuring impact. This also enables us to look at importance by putting us in the shoes of an agent which is given information.

How would we measure importance? One approach is to look at a large set U of different utility functions, and check that knowing X makes little expected difference for any of them.

In other words, conditional on the action taking place, how does this affect utility?

A concept which neighbors this approach is to imagine that impact is a measure of how detectable an action is. Presumably, the exact way that hydrogen atoms are distributed throughout the center of the sun shouldn't affect the impact measure very much, since their arrangement gives us pretty much no actionable information here on Earth. Consequently, knowing a lot about the particular way that atoms in the sun are arranged doesn't change what we predict will happen in the world — at least unless you are so powerful that it allows you to simulate the entire universe flawlessly, and feeds into the final input into this simulation.

By shifting the focus from particle positions and state representations, this paper gets closer to the way that I personally interpret impact measurement, and how I understand impact is defined in more recent research.

The paper moves then towards describing very abstract ways of measuring impact, including measuring general changes in probability distributions, such as some generalized cross-entropy between two probability distributions. The intuition here is that worlds with a low-impact artificial intelligence will have a broad probability distribution, whereas worlds with high-impact artificial intelligence will have almost all the probability density on a specific narrow region. If we could somehow measure the divergence between these probability distributions in a way that was natural and was resilient to the choice of representation, this would provide a way of measuring impact that clearly isn't value laden.

The weakness of the abstract approach is that there is no clear interpretation of what's going on. The choice of how we compare probability distributions seems a bit arbitrary, and it also doesn't seem like the type of thing a human would care about if they were naively minimizing impact. Furthermore, this approach, like the one that came before, requires some baseline weak prediction capability in order for it to be applied consistently. To see why, consider that a sufficiently advanced superintelligence will always have essentially all of its probability distribution on a single future — the actual future.

Armstrong and Levinstein wrote a brief discussion for how we can calibrate the impact measure. The way that machine learning practitioners have traditionally calibrated regularizers is by measuring their effect on validation accuracy. After plotting validation accuracy against some scaling factor, practitioners settle on the value which allows their model to generalize the best. In AI alignment we must take a different approach, since it would be dangerous to experiment with small scaling values for the impact penalty without an idea of the magnitude of the measurement.

The paper points to an additional issue: if the impact measure has sharp, discontinuous increases, then calibrating the impact measure may be like balancing a pen on the tip of a finger.

It is conceivable that we spend a million steps reducing µ through the ‘do nothing’ range, and that the next step moves over the ‘safe increase of u’, straight to the ‘dangerous impact’ area.

This above problem motivates the following criterion for impact measures. An impact measure should scale roughly linearly in the measurement of impact on the world. Creating one paperclip might have some effect on the world, and creating two paperclips might have some effect on the world, but creating three paperclips should have some effect that is close to or else the impact measure is broken.

Since measuring impact is plausibly something that can be done without the help of a superintelligence, this provides a potential research opportunity. In other words, we can check ex ante whether any impact measure is robust to these types of linear increases in impact. On the other hand, if we find that some impact penalty requires superintelligent capabilities in order to measure, then it may cast doubt on the method of measurement, and our ability to trust it. And of course, it couldn't reflect any algorithm which humans are running in their head.

After this point in time, specifying desiderata in order to avoid failure modes becomes a shift in the way that impact measures are formulated.

## Concrete Problems in AI Safety (Section 3), by Dario Amodei and Chris Olah et al. (2016)

The main contribution in this paper is that it proposes a way to learn impact. In this sense, impact is less of an actual thing that we add to the utility function, and more of something that is discovered by repeated demonstration. The intention behind this shift is to move focus away from explicit ways of representing impact, which can be brittle if we focus too much on exactly how the environment is represented. The downside is that it doesn't appear to scale to superintelligent capabilities.

If I understand this proposal correctly, an example of impact in this case would be to penalize some ML system each time it makes a large error. Over time the ML system would have an implicit penalty term for errors of that type, such that in the future it won't be very likely to do something which has a large impact on the world. Of course, if we consider that as AI systems grow in competence they are likely to try strategies which we had not even thought about before, this approach is particularly susceptible to edge instantiation.

The paper also discusses using empowerment to measure influence before immediately rejecting it as a solution. The authors consider that researching simple and already well-trodden mathematical functions could yield fruitful results, providing optimism for future research into this area.

Despite these issues, the example of empowerment does show that simple measures (even purely information-theoretic ones!) are capable of capturing very general notions of influence on the environment.

## Low impact, by Eliezer Yudkowsky (2016)

This Arbital article summarizes impact measurements more succinctly and comprehensively any previous work. It outlines the aim of impact research and provides useful critiques for impact penalties which are based on measuring the distance between state representations. In particular, the article expands on three foreseeable difficulties in an impact measure:

1. An artificial intelligence could try to minimize its influence by doing bad things that were low impact relative to the normal course of reality. An example would be an AI that tried to cure cancer but kept death rates constant, since this would have happened if we didn't cure cancer.

2. Due to the existence of chaotic systems, an AI would be indifferent to particular systems which are nearly impossible to control directly, such as the weather. In this case, since the AI is indifferent, it might as well deploy nanobots in the sky since there's not much we can do to keep weather constant.

3. The AI will end up wanting to keep everything in complete stasis, which introduces the incentive of taking over the entire world in order to keep things the way they would have turned out if the AI didn't exist. For instance, if we constructed a low impact AI in order to learn more about alignment experimentally, the low-impact AI would want us to learn as little as possible from the experiment because every new insight we gain would be something we would not have gotten if the AI did not exist.

As I have indicated above, I think that these types of errors are quite important to consider, but I do think that impact can be framed differently in order to avoid them. In particular, there is a lot of focus on measuring the distance between worlds in some type of representation. I am skeptical that this will forever remain a problem because I don't think that humans are susceptible to this mistake, and I also think that there are agent-relative definitions of impact which are more useful to think about.

To provide one example which guides my intuitions, I would imagine that being elected president is quite impactful from an individual point of view. But when I make this judgement I don't primarily think about any particular changes to the world. Instead, my judgement of the impact of this event is focused more around the type of power I gain as president, such as being able to wield the control of the military. Conversely, being in a nuclear war is quite impactful because of how it limits our current situation. Having the power to exert influence, or being able to live a safe and happy life is altered dramatically in a world affected by nuclear war.

This idea is related to instrumental convergence, which is perhaps more evidence that there is a natural core to this concept of measuring impact. In one sense, they could be part of a bigger whole: collecting money is impactful because it allows me to do more things as I become wealthier. And indeed, there may be a better word than "impact" for the exact concept which I am imagining.

## Penalizing side effects using stepwise relative reachability, by Victoria Krakovna et al. (2018)

From what I am aware, the current approaches which researchers are most optimistic about are the impact measures based on this paper. In the first version of this paper, which came out in 2018 (updated in 2019 for attainable utility), the authors define relative reachability and compare it against a baseline state, which is also defined. I will explore this paper, and the impact measures which are derivative to this one in the next post.

In the last post in this sequence, I promised to "cover the basics of statistical learning theory." Despite the ease of writing those words, I found it to be much more difficult than I first imagined, delaying me a few days. In the meantime, I will focus the next post on surveying recent impact research.

New Comment