AI Alignment Metastrategy

Vanessa Kosoy

I call "alignment strategy" the high-level approach to solving the technical problem^[1]. For example, value learning is one strategy, while delegating alignment research to AI is another. I call "alignment metastrategy" the high-level approach to converging on solving the technical problem in a manner which is timely and effective. (Examples will follow.)

In a previous article, I summarized my criticism of prosaic alignment. However, my analysis of the associated metastrategy was too sloppy. I will attempt to somewhat remedy that here, and also briefly discuss other metastrategies, to serve as points of contrast and comparison.

Conservative Metastrategy

The conservative metastrategy follows the following algorithm:

As much as possible, stop all work on AI capability outside of this process.
Develop the mathematical theory of intelligent agents to a level where we can propose adequate alignment protocols with high confidence. Ideally, the theoretical problems should be solved in such order that results with direct capability applications emerge as late as possible.
Design and implement empirical tests of the theory that incur minimal risk in worlds in which the theory contains errors or the assumptions of the theory are violated in practice.
If the tests show problems, go back to step 2.
Proceed with incrementally more ambitious tests in the same manner, until you're ready to deploy an AI defense system.

This is my own favorite metastrategy. The main reason it can fail is if the unconservative research we failed to stop creates unaligned TAI before we can deploy an AI defense system (currently, we have a long way to go to complete step 2).

I think that it's pretty clear that a competent civilization would follow this path, since it seems like the only one which leads to a good long-term outcome without taking unnecessary risks^[2]. Of course, in itself that is an insufficient argument to prove that, in our actual civilization, the conservative metastrategy is the best for those concerned with AI risk. But, it is suggestive.

Beyond that, I won't lay out the case for the conservative metastrategy here. The interested reader can turn to 1 2 3 4 5.

Incrementalist Metastrategy

The incrementalist metastrategy follows the following algorithm:

Find an advance in AI capability (by any means, including trial and error).
Find a way to align the new AI design (prioritizing solutions that you expect to scale further).
Validate alignment using a combination of empiricial tests and interpretability tools.
If validation fails, go back to step 2.
If possible, deploy an AI defense system using current level of capabilities.
Go to step 1.

This is (more or less) the metastrategy favored by adherents of prosaic alignment. In particular, this is what the relatively safety-conscious actors involved with leading AI labs present as their plan.

There are 3 main mutually reinforcing problems with putting our hopes in this metastrategy, which I discuss below. There are 2 aspects to each problem: the "design" aspect, which is what would happen if the best version of the incrementalist metastrategy was implemented, and the "implementation" aspect, which is what happens in AI labs in practice (even when they claim to follow incrementalist metastrategy).

Information Security

Design

If a new AI capability is found in step 1, and the knowledge is allowed to propagate, then irresponsible actors will continue to compound it with additional advances before the alignment problem is solved on the new level. Ideally, either the new capability should remain secret at least until the entire iteration is over, or government policy should prevent any actor from subverting the metastrategy, or some sufficient combination of both needs to be in place. Implementing the above involves major institutional and political challenges, which undermines the feasibility of the entire metastrategy.

Now, one could argue that the same problem exists with the conservative metastrategy: after all, some theoretical discoveries can also be used by irresponsible actors for dangerous AI capability advances. However, when it comes to theoretical research, the problem only shows up in some cases and usually requires additional steps by irresponsible actors (going from theory to practice). Moreover, inasmuch as dangerous AI capability advances only happen when they're grounded at solid theory, there is more hope that the risks will be transparent, predictable and avoidable.

Hence, while the conservative metastrategy also calls for regulation against irresponsible actors, the entire approach allows for graceful degradation when this part fails. On the other hand, the incrementalist metastrategy calls for working directly on the sort of research which presents the most risk: opaque AI designs discovered by trial-and-error and selected only for their capabilities. This is comparable to the dubious rationale behind gain-of-function research.

Implementation

In practice, even "safety conscious" AI labs share a lot of technical details about their research. Even when they don't, their employees are not selected for outstanding trustworthiness and understanding of the risks. Many of them might ultimately continue to other organizations and disseminate the knowledge, even when nominally bound by NDA^[3]. I also doubt they maintain a very high standard of cybersecurity. All this is in sharp contrast to organizations which actually take secrecy seriously (military and government bodies dealing with classified information).

Measuring Alignment

Design

Step 3 in the incrementalist metastrategy requires validating alignment. But, how can we tell whether an AI is aligned? What does it even mean for it to be aligned? One possibility is the economic criterion: "the AI is aligned when it behaves close enough to the designer's intent to be a profitable product". This must be appealing to company executives. However, under this definition there might be abrupt changes in alignment as a function of capability progress (or just time) that undermine the entire case for incrementalism.

The latter is true even if capability progress is perfectly smooth. For example, imagine agentic AIs that go along with our wishes because they know that they are within our power, right until they aren't. We gradually hand over the entire economy to AIs, then one day the AIs decide it's time to get rid of the humans. Even if some problems become apparent in advance, at some point it will be very hard to stop/slow the tide in the face of economic incentives.

Even with the best intentions, measuring alignment is hard. One difficulty is deceptive alignment: agentic AI that strategically searches for ways to appear aligned because this appearance is correlated with utility (for obvious reasons) under its internal world-model. But even if the AI is not deceiving us on purpose, we might still select an AI design whose behavior locally appears aligned but internal reasoning is very different from what we expect, or we can end up with behaviors that are too complicated to interpret. See Wentworth's article for more discussion.

A satisfactory solution to the measurement problem requires interpretability tools that can detect any unintended motivation inside the AIs reasoning, before you can complete even a single iteration. However, we are far from having such tools. Moreover, it's not clear for what kind of AI designs such tools exist even in principle. Worse, even if we had candidate tools, how would we know they work correctly? Without either a theoretical foundation or an alternative reliable method of measurement?

Implementation

AFAICT, AI labs just follow the economic criterion. We have no idea what kind of reasoning processes happen inside e.g. large language models and have no tools to determine that. When problems are found, they are patched as best as the companies know how, but some problems (jailbreaks, hallucinations) still don't have a general solution. I'm pointing this out not because these specific problems are important in themselves, but because it shows the companies don't actually follow the incrementalist metastrategy at all: despite the fact their products are demonstrably misaligned in some sense, they continue to compound capability advances.

Unpredictable Progress

What happens if progress in AI capabilities fails to be smooth and predictable? (Which I believe will happen.) It creates two problems:

When an innovation is found that substantially changes the design, previous alignment techniques might be rendered irrelevant, to such an extent that a working alignment technique for the new design cannot be a small modification of previous techniques. This gets us stuck in steps 2-3 for a dangerously long time, during which information can leak if the information security is less than perfect. Moreover, it might mean that a lot of previous work on alignment was wasted.
When an innovation is found that substantially advances the capabilities, it creates a gap between SOTA AI and SOTA aligned AI. If this gap reaches a decisive strategic advantage, the result is catastrophic. If this gap doesn't reach a decisive strategic advantage in itself, there is still a risk that information leaks will allow irresponsible actors to compound the gap further.

The combination of unpredictable progress with information security failures can create an extremely dangerous situation where responsible actors need to play difficult catch-up with irresponsible actors, even when the irresponsible actors are too weak to "win" the race otherwise. This consideration is in addition to the risk of a "responsible" actor ending the world by a lab experiment.

I don't distinguish between "design" and "implementation" here, because the incrementalist metastrategy has only been defended by AI labs (even nominally) for a few years, and there haven't been many large and unexpected innovations during this time^[4].

Butlerian Metastrategy

The butlerian metastrategy is: postpone TAI indefinitely. Essentially, it requires creating an AI defense system without the help of TAI^[5].

This metastrategy doesn't have many advocates, but I still find it useful to have a name for it. Most metastrategies exist on a continuum where the butlerian metastrategy is an (extreme) endpoint, depending on how bullish they are on stopping irresponsible forms of AI research and for how long.

The major problems with this metastrategy are:

Creating an adequate AI defense system without the help of TAI is extremely difficult, probably much more so than solving the technical alignment problem.
The best case scenario in this metastrategy is a future without TAI, which means giving up on enormous value.

Transhumanist Metastrategy

The transhumanist metastrategy follows the following algorithm:

As much as possible, stop all work on AI, or at least anything with capability implications.
Create technology for substantially augmenting human intelligence.
Delegate the rest of the problem to the augmented humans (who would presumably solve it much better than anything we could come up with).

This metastrategy is advocated by Eliezer Yudkowsky.

There are 3 reasons I'm pessimistic about this metastrategy:

Substantially augmenting adult human intelligence seems to me harder than solving the technical AI alignment problem. I have a plan for the latter, and I'm not sure there is any serious plan for the former. (Although, see this.) Augmenting intelligence by epilogenics seems more tractable, but will take too long.
It's not clear whether this metastrategy is safe. On the one hand, we have an aligned starting point (baseline humans). On the other hand, we are dealing with a giant undocumented mess of spaghetti code (the design of the human brain). It seems easy to imagine that changes which strongly increase intelligence can also substantially mutate values in ways that are critical but difficult to detect (among other reasons, because the augmented humans will be incentivized to hide it). We might end up handing the future to e.g. a race of superintelligent psychopaths.
The sort of experimentation on humans that would be required is (I think) illegal in most countries, and legalizing it is a major challenge, especially under a time limit.

Inasmuch as augmented humans are safe, this metastrategy has the advantage that (in contrast to incrementalism) it doesn't burn the commons by advancing AI capability. Therefore, it might make sense to work on this and the conservative approach in parallel.

^{^}
There is some arbitrary choice about what constitutes "high-level". It would be more precise to say that there is are multiple layers or even a spectrum between "general direction" and "full detailed solution". For example, Physicalist Superimitation is much more detailed than just value learning, but is still insufficiently detailed for actual implementation.
^{^}
Some critics doubt that the sort of mathematical theory I pursue is possible at all. To those I would point out that a competent civilization would at least make a serious attempt to create that theory before trying something else.
^{^}
I think that enforcing an NDA is pretty hard: the company doesn't have an easy way to know the NDA is violated, and even if it suspects that, proving it is hard, especially given that the technical details mutate along the way.
^{^}
As far as I know. If there have been but they were kept secret, then, kudos!
^{^}
In general, an AI defense system can be preventative i.e. aimed at preventing unaligned AI from being created rather than disabling it after it is created. It would definitely have to be preventative in this case.

48

AI Alignment Metastrategy

48

Conservative Metastrategy

Incrementalist Metastrategy

Information Security

Design

Implementation

Measuring Alignment

Design

Implementation

Unpredictable Progress

Butlerian Metastrategy

Transhumanist Metastrategy