Impact penalties are designed to help prevent an artificial intelligence from taking actions which are catastrophic.

Despite the apparent simplicity of this approach, there are in fact a plurality of different frameworks under which impact measures could prove helpful. In this post, I seek to clarify the different ways that an impact measure could ultimately help align an artificial intelligence or otherwise benefit the long-term future.

It think it's possible some critiques of impact are grounded in an intuition that it doesn't help us achieve X, where X is something that the speaker thought impact was supposed to help us with, or is something that would be good to have in general. The obvious reply to these critiques is then to say that it was never intended to do X, and that impact penalties aren't meant to be a complete solution to alignment.

My hope is that in distinguishing the ways that impact penalties can help alignment, I will shed light on why some people are more pessimistic or optimistic than others. I am not necessarily endorsing the study of impact measurements as an especially tractable or important research area, but I do think it's useful to gather some of the strongest arguments for it.

Roughly speaking, I think that that an impact measure could potentially help humanity in at least one of four main scenarios.

1. Designing a utility function that roughly optimizes for what humans reflectively value, but with a recognition that mistakes are possible such that regularizing against extreme maxima seems like a good idea (ie. Impact as a regularizer).

2. Constructing an environment for testing AIs that we want to be extra careful about due to uncertainty regarding their ability to do something extremely dangerous (ie. Impact as a safety protocol).

3. Creating early-stage task AIs that have a limited function, but are not intended to do any large scale world optimization (ie. Impact as an influence-limiter).

4. Less directly, impact measures could still help humanity with alignment because researching them could allow us to make meaningful progress on deconfusion (ie Impact as deconfusion).

Impact as a regularizer

In machine learning a regularizer is a term that we add to our loss function or training process that reduces the capacity of a model in the hopes of being able to generalize better.

One common instance of a regularizer is a scaled $L_{2}$ norm penalty of the model parameters that we add to our loss function. A popular interpretation of this type of regularization is that it represents a prior over what we think the model parameters should be. For example, in Ridge Regression, this interpretation can be made formal by invoking a Gaussian prior on the parameters.

The idea is that in the absence of vast evidence, we shouldn't allow the model to use its limited information to make decisions that we the researchers understand would be rash and unjustified given the evidence.

One framing of impact measures is that we can apply the same rationale to artificial intelligence. If we consider some scheme where an AI has been the task of undertaking ambitious value learning, we should make it so that whatever the AI initially believes is the true utility function $U$ , it should be extra cautious not to optimize the world so heavily unless it has gathered a very large amount of evidence that $U$ really is the right utility function.

One way that this could be realized is by some form of impact penalty which eventually gets phased out as the AI gathers more evidence. This isn't currently the way that I have seen impact measurement framed. However, to me it is still quite intuitive.

Consider a toy scenario where we have solved ambitious value learning and decide to design an AI to optimize human values in the long term. In this scenario, when the AI is first turned on, it is given the task of learning what humans want. In the beginning, in addition to its task of learning human values, it also tries helping us in low impact ways, perhaps by cleaning our laundry and doing the dishes. Over time, as it gathers enough evidence to fully understand human culture and philosophy, it will have the confidence to do things which are much more impactful, like becoming the CEO of some corporation.

I think that it's important to note that this is not what I currently think will happen in the real world. However, I think it's useful to imagine these types of scenarios because they offer concrete starting points for what a good regularization strategy might look like. In practice, I am not too optimistic about ambitious value learning, but more narrow forms of value learning could still benefit from impact measurements. As we are still somewhat far from any form of advanced artificial intelligence, uncertainty about which methods will work makes this analysis difficult.

Impact as a safety protocol

When I think about advanced artificial intelligence, my mind tends to forward chain from current AI developments, and imagines them being scaled up dramatically. In these types of scenarios, I'm most worried about something like mesa optimization, where in the process of making a model which performs some useful task, we end up searching over a very large space of optimizers that ultimately end up optimizing for some other task which we never intended for.

To oversimplify things for a bit, there are a few ways that we could ameliorate the issue of misaligned mesa optimization. One way is that we could find a way to robustly align arbitrary mesa objectives with base objectives. I am a bit pessimistic about this strategy working without some radical insights, because it currently seems really hard. If we could do that, it would be something which would require a huge chunk of alignment to be solved.

Alternatively, we could whitelist our search space such that only certain safe optimizers could be discovered. This is a task where I see impact measurements could be helpful.

When we do some type of search over models, we could construct an explicit optimizer that forms the core of each model. The actual parameters that we perform gradient descent over would need to be limited enough such that we could still transparently see what type of "utility function" is being inner optimized, but not so limited that the model search itself would be useless.

If we could constrain and control this space of optimizers enough, then we should be able to explicitly add safety precautions to these mesa objectives. The exact way that this could be performed is a bit difficult for me to imagine. Still, I think that as long as we are able to perform some type of explicit constraint on what type of optimization is allowed, then it should be possible to penalize mesa optimizers in a way that could potentially avoid catastrophe.

During the process of training, the model will start unaligned and gradually shift towards performing better on the base objective. At any point during the training, we wouldn't want the model to try to do anything that might be extremely impactful, both because it will initially be unaligned, and because we are uncertain about the safety of the trained model itself. An impact penalty could thus help us to create a safe testing environment.

The intention here is not that we would add some type of impact penalty to the AIs that are eventually deployed. It is simply that as we perform the testing, there will be some limitation on much power we are giving the mesa optimizers. Having a penalty for mesa optimization can then be viewed as a short term safety patch in order to minimize the chances that an AI does something extremely bad that we didn't expect.

It is perhaps at first hard to see how an AI could be dangerous during the training process. But I believe that there is good reason to believe that as our experiments get larger, they will require artificial agents to understand more about the real world while they are training, which incurs significant risk. There are also specific predictable ways in which a model being trained could turn dangerous, such as in the case of deceptive alignment. It is conceivable that having some way to reduce impact for optimizers in these cases will be helpful.

Impact as an influence-limiter

Even if we didn't end up putting an impact penalty directly into some type of ambitiously aligned AGI, or use it as a safety protocol during testing, there are still a few disjunctive scenarios in which impact measures could help construct limited AIs. A few examples would be if we were constructing Oracle AIs and Task AGIs.

Impact measurements could help Oracles by cleanly providing a separation between "just giving us true important information" and "heavily optimizing the world in the process." This is, as I understand, one of the main issue with Oracle alignment at the moment, which means that intuitively an impact measurement could be quite helpful in that regard.

One rationale for constructing a task AGI is that it allows humanity to perform some type of important action which buys us more time to solve the more ambitious varieties of alignment. I am personally less optimistic about this particular solution to alignment, as in my view it would require a very advanced form of coordination of artificial intelligence. In general I incline towards the view that competitive AIs will take the form of more service-specific machine models, which might imply that even if we succeeded at creating some low impact AGI that achieved a specific purpose, it wouldn't be competitive with the other AIs which that themselves have no impact penalty at all.

Still, there is a broad agreement that if we have a good theory about what is happening within an AI then we are more likely to succeed at aligning it. Creating agentic AIs seems like a good way to have that form of understanding. If this is the route that humanity ends up taking, then impact measurements could provide immense value.

This justification for impact measures is perhaps the most salient in the debate over impact measurements. It seems to be behind the critique that impact measurements need to be useful rather than just safe and value-neutral. At the same time, I know from personal experience that there at least one person currently thinking about ways we can leverage current impact penalties to be useful in this scenario. Since I don't have a good model for how this can be done, I will refrain from specific rebuttals of this idea.

Impact as deconfusion

The concept of impact appears to neighbor other relevant alignment concepts, like mild optimization, corrigibility, safe shutdowns, and task AGIs. I suspect that even if impact measures are never actually used in practice, there is still some potential that drawing clear boundaries between these concepts will help clarify approaches for designing powerful artificial intelligence.

This is essentially my model for why some AI alignment researchers believe that deconfusion is helpful. Developing a rich vocabulary for describing concepts is a key feature of how science advances. Particularly clean and insightful definitions help clarify ambiguity, allowing researchers to say things like "That technique sounds like it is a combination of X and Y without having the side effect of Z."

A good counterargument is that there isn't any particular reason to believe that this concept requires priority for deconfusion. It would be bordering on a motte and bailey to claim that some particular research will lead to deconfusion and then when pressed I appeal to research in general. I am not trying to do that here. Instead, I think that impact measurements are potentially good because they focus attention on a subproblem of AI, in particular catastrophe avoidance. And I also think there has empirically been demonstrable progress in a way that provides evidence that this approach is a good idea.

Consider David Manheim and Scott Garrabrant's Categorizing Variants of Goodhart's Law. For those unaware, Goodhart's law is roughly summed up in the saying "Whenever a measure becomes a target, it ceases to become a good measure." This paper tries to catalog all of the different cases which this phenomenon could arise. Crucially, it isn't necessary for the paper to actually present a solution to Goodhart's law in order to illuminate how we could avoid the issue. By distinguishing ways in which the law holds, we can focus on addressing those specific sub-issues rather than blindly coming up with one giant patch for the entire problem.

Similarly, the idea of impact measurement is a confusing concept. There's one interpretation in which an "impact" is some type of distance between two representations of the world. In this interpretation, saying that something had a large impact is another way of saying that the world changed a lot as a result. In newer interpretations of impact, we like to say that an impact is really about a difference in what we are able to achieve.

A distinction between "difference in world models" and "differences in what we are able to do" is subtle, and enlightening (at least to me). It allows a new terminology in which I can talk about the impact of artificial intelligence. For example, in Nick Bostrom's founding paper on existential risk studies, his definition for existential risk included events which could

permanently and drastically curtail [humanity's] potential.

One interpretation of this above definition is that Bostrom was referring to potential in the sense of the second definition of impact rather than the first.

A highly unrealistic way that this distinction could help us is if we had some future terminology which allowed us to unambiguously ask AI researchers to "see how much impact this new action will have on the world." AI researchers could then boot up an Oracle AI and ask the question in a crisply formalized framework.

More realistically, the I could imagine that the field may eventually stumble on useful cognitive strategies to frame the alignment problem such that impact measurement becomes a convenient precise concept to work with. As AI gets more powerful, the way that we understand alignment will become nearer to us, forcing us to quickly adapt our language and strategies to the specific evidence we are given.

Within a particular subdomain, I think an AI researcher could ask questions about what they are trying to accomplish, and talk about it using the vocabulary of well understood topics, which could eventually include impact measurements. The idea of impact measurement is simple enough that it will (probably) get independently invented a few times as we get closer to powerful AI. Having thoroughly examined the concept ahead of time rather than afterwards offers future researchers a standard toolbox of precise, deconfused language.

I do not think the terminology surrounding impact measurements will ever quite reach the ranks of terms like "regularizer" or "loss function" but I do have an inclination to think that simple and common sense concepts should be rigorously defined as the field advances. Since we have intense uncertainty about the type of AIs that will end up being powerful, or about the approaches that will be useful, it is possibly most helpful at this point in time to develop tools which can reliably be handed off for future researchers, rather than putting too much faith into one particular method of alignment.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

10

Four Ways An Impact Measure Could Help Alignment

10

Impact as a regularizer

Impact as a safety protocol

Impact as an influence-limiter

Impact as deconfusion