Produced during the Stanford Existential Risk Initiative (SERI) ML Alignment Theory Scholars (MATS) Program of 2022, under John Wentworth


“Overconfidence in yourself is a swift way to defeat.”

- Sun Tzu


TL;DR: Escape into the Internet is probably an instrumental goal for an agentic AGI. An incompletely aligned AGI may escape prematurely, and the biggest failure mode for this is probably the AGI socially engineering the alignment researchers. Thus, opening an additional information channel between the researchers and the AGI (e.g., adding an interpretability tool and/or researcher) is inherently risky. The expected cost of adding this channel may even exceed the expected scientific benefit. Whether this is true depends on the informational efficiency of the channel (which increases the expected scientific benefit), the level of the AGI’s capabilities (which increases the risk of its premature escape), and the security level of the AGI’s box setup (which decreases the risk).


Post: Suppose we are building a machine-learning model that has a high chance of becoming Artificial General Intelligence (AGI). We are trying to align it, but we’re uncertain about whether our training is working. Also, rapidly increasing capabilities may push us into a Sharp Left Turn scenario. How do we handle this volatile situation as safely as possible? We will need to adopt a security mindset, aimed at effectively preventing catastrophic outcomes.

We propose a procedure (inspired by the pioneering work of Roman Yampolskiy) for how to train the AGI towards alignment in a secure and effective manner. Most of the ideas here are not new. Nevertheless, we hope that our game-theoretic framing of the problem will be able to help alignment researchers better plan their research directions and future alignment efforts.

Imagining the Scenario

Consider a team of researchers who are trying to build an aligned AGI. This situation can be represented by the following simplified model.

Definition: A “box” is defined by a collection of failsafes designed to confine a system away from the outside world. The empty collection comprised of zero failsafes is called the “trivial box” (i.e., an AI which is allowed unrestricted access to send and receive data on the Internet).

Step 1 - The AI is trained

The AGI emerges in a box. The precise time of the AGI's emergence and/or its superhuman capabilities probably cannot be predicted, though the research team may have a rough estimate.

Step 2 - The AGI may try to escape

At any point in the training or testing process, the AGI may attempt to break out of the box. If all failsafes of the box are breached before the team notices, the AGI escapes and the game is over (Step 4). One example of this is the scenario in which the AGI hacks itself into the Internet. Another example is the scenario in which the AGI releases power-seeking malware agents into the Internet.

If the AGI fails to escape, the researchers may decide either to retrain the AGI from scratch, fine-tune (which is less computationally expensive but may fail to remove the AGI’s escape intent/misalignment), or table the project entirely. 

The trivial box cannot prevent escapes.

Step 3 - The researchers run tests on the AGI

In our simplified model, researchers open a collection of information channels between themselves and the AGI in the box. As noted by Yampolskiy, the information from these channels is dual-use:

  1. The human researchers can use the channel to probe various properties of the AGI, to help them train it to be more aligned. In addition to advancing the science of alignment, the channel can also be used for scientific benefits in general.
  2. The AGI can use the channel to socially engineer the researchers on the other end. Socially engineered researchers can then help the AGI break the failsafes, which makes the AGI’s escape easier.

Consequently, adding information channels is one of the strongest ways to weaken failsafes.

An inadequate box setup could allow the AGI to also open information channels unintended by the research team. The AGI might do so by hacking, by using its internal circuits to send radio waves, and even by using different speeds of the cooling fan to send messages by Morse code.

However, suppose that the box setup uses theoretically robust cybersecurity, combined with an actual physical box that is designed to not let any covert information enter or leave. In this case, risks of human social engineering would mostly arise from the information channels that the researchers themselves choose to open.

Step 4 - Game Over: The AGI is deployed or escapes

The game ends when the AGI escapes or is deployed outside the box. This terminally changes future utility based on how aligned or misaligned the AGI is. We want to maximise the chance that the AGI is aligned in the very likely scenario in which it eventually leaves the box.

What will happen?

It seems that either deletion, escape, or deployment of the AGI is almost certain to happen eventually. Escaping from the box is likely to be an AGI’s instrumental goal. Also, the researchers will likely be eventually forced by overconfident management to remove the failsafes (i.e., deploy the AI), even if the AGI is not yet aligned.

Finally, if there is a perceived race towards AGI deployment by multiple AI organisations, this would further exacerbate their motivations to deploy prematurely. A superintelligent AGI, once on the Internet, will likely change the world in an irreversible manner, in the direction of whatever mysterious goals it evolved during its training. We can, however, guess that one of its instrumental goals will be preventing the emergence of competing AGIs with different goals. AGI creators who are overconfident about their alignment ability could perceive this as a huge first-mover advantage, which would motivate them to deploy prematurely. This in turn forces the non-overconfident AGI creators to choose between also deploying prematurely or being too late: the Unilateralist’s Curse.

Recall that

  1. we have made little progress in how to align an AGI,
  2. a misaligned AGI would probably be a power-seeking risk endangering all of humanity, and
  3. it seems plausible that a superintelligent AGI could emerge in the near-term future.

Thus, it is essential that there is (1) coordination between AI organisations and (2) AI governance towards safe AI, rather than a race towards premature deployment.

But this is probably difficult to achieve, so we will assume that the AGI is deployed in a finite time T unless otherwise stated.

An example alignment plan

Suppose a leading AI organisation (say, on the cusp of being the first to develop an AI agent) approached us today and asked us how to align their AI agent. We would probably tell them that very little is known about how to align an AI agent, that it would probably endanger humanity, and that they shouldn’t build it (at least before the science of alignment is solved).

But if they insist that they will build it anyway even without our input, and there seems to be no way to convince them otherwise, we might have to provide them with a reasonable alignment plan. Such a plan would aim to maximise expected utility, which might look something like this:

  • Train the AI agent in a secure box until it has high capabilities.
  • Run some automated tests that have very high per-bit efficiency (e.g., automated tests that give a percentage).
  • If it fails, either retrain or table the model.
  • Run some higher bit-rate experiments and look for very specific signs of unaligned activity.
  • If it fails, either retrain or table the model.
  • Use some coarse interpretability tools to roughly check the goals/values of the model.
  • If it fails, either retrain or table the model.
  • Use finer interpretability tools to look for signs of deception (e.g., Eliciting Latent Knowledge, or ELK).
  • If it fails, either retrain or table the model.
  • Do one of the following. Either (1) repeat the above process indefinitely (gaining scientific benefits along the way) or (2) if all positive-expected-value tests check out, deploy the model online.

(Since we currently know very little about how to align an AI agent, Option 2 would most likely deploy a misaligned, power-seeking risk endangering all of humanity. Thus, we would strongly recommend Option 1 instead, at least until a miraculous breakthrough occurs in the science of alignment: perhaps with the help of the boxed AGI. However, recall that the researchers might be forced to use Option 2 at a finite time T by overconfident management, or by a perceived race to deployment against other AI organisations.)

Observe that the tests start from those that are the most scientifically informative per bit to those that are less so. Tests whose per-bit expected scientific benefit are outweighed by the per-bit expected cost (due to human social engineering risk by the AGI) are not used.


1.  Maximise utility-per-bit

When testing the model, we want to maximise the scientific value per bit (benefit-per-bit) compared to the possible social-engineering risk due to those bits (cost-per-bit).

For example, researchers who use high-volume, low-efficiency interpretability tools on the AGI’s middle weights may be opening themselves up to a large risk of being socially engineered. This would constitute a sizable expected cost, since premature escape by an incompletely aligned AGI will likely be catastrophic and much worse than the counterfactual.

In contrast, running an automated test that gives a percentage (e.g., of whether a given property occurs) would be much better per-bit than having people read all the outputs and judge for themselves, since it would greatly reduce the ability of the AGI to perform social engineering attacks. As our confidence in the AGI’s alignment increases, we can slowly increase the bits for which we ask.

Overall, we should be maximising expected utility, and this does not always mean that we should aim for a complete victory. It would be great if we could prevent deceptive misalignment in a guaranteed manner. But this could require an inordinate amount of information, especially for complex evolved agents. The expected scientific benefit of the likely inefficient information channel required to get this information may be outweighed by its expected cost. A better course of action would be to use information channels that are more efficient on expectation, even if they may not achieve the best-case scenario of guaranteeing no deceptive misalignment. If the highest-expected-utility training plan we can realistically think of—for a given box setup—is to mostly rely on parsimonious summary statistics and hope to get lucky, then so be it.

Conversely, maximising expected utility does not always mean that we should give up on boxing once the AGI has superhuman intelligence. There are likely AI box designs that are both informative and secure against even the most intelligent AI (e.g., a securely boxed truth oracle with no output channels other than answering multiple-choice questions, whose answers are patiently and extensively audited by many human experts). Given that even a hypothetically aligned AI could phase-transition into misalignment after it increases in capabilities, running high-efficiency tests afterwards in an optimised box setup could potentially be very informative, and well worth considering.

Note: Our framework of raw computer bits, while easy to exposit, might not be the most accurate framing of the problem. For one thing, the same amount of information can be riskier or less risky depending on the rate at which it is audited by the researchers. This is because humans are more likely to make well-calibrated decisions when they have sufficient time to think things through. For another, the AGI might be able to robustly mask its intent with respect to some information channels, but not others. Incorporating these and other considerations may lead to a more realistic generalisation of our model.

2.  When to open or close information channels

We should not open an information channel unless it has a positive expected utility.

If the AI being studied has very low capabilities, then it should not be too risky to open additional channels of information. In fact, doing so might be beneficial, since it could provide valuable information about how a future AGI of a similar architecture might work.

Once you have a highly capable AI, however, the research team should be more cautious about their channels of information. In fact, they might want to consider closing their low-efficiency, high-risk information channels as soon as they suspect that the AI could make a Sharp Left Turn (to superhuman capabilities) with a non-negligible probability: even if the probability is low. This is because if the AGI has superhuman intelligence, then it could be very skilled at both social engineering and hacking: particularly if it is a Large Language Model trained on code, human writing, and/or images. We should plan for the plausible case that the AGI will effectively exploit its information channels to learn about, outmanoeuvre, and manipulate humans to escape.

Thus, we should put the cost-per-bit of information to be quite high. This leads us to a counterintuitive rule-of-thumb:

For a sufficiently capable model, unless the informational benefit is sufficiently high (relative to the riskiness of the box setup), an interpretability tool and/or alignment researcher should be removed, not added, on the margin.

3.  Social engineering is probably the biggest failure mode

This is true even of cybersecurity today, whose adversaries are just human hackers, rather than an AGI with superintelligent capabilities in human social engineering.

Even today's AI, neither superintelligent nor agentic, is sufficient to socially engineer humans. While conversing with Google's LaMDA model as a tester, Google software engineer Blake Lemoine was led to incorrectly believe that LaMDA was just “a sweet kid," to share confidential information about LaMDA publically, to invite a lawyer to represent LaMDA, to propose the unethicality of Google’s activities to a member of the House Judiciary Committee, and to eventually get fired from his dream job at Google for his aggressive actions.

This social engineering occurred by accident rather than by the AI's agency (LaMDA is most likely neither sentient as Lemoine claimed, nor agentic, nor of superhuman general capabilities). But the following is clear. An additional person looking at a superintelligent AGI's output, even a well-informed and well-meaning alignment researcher, constitutes an additional failure point.

Directions for Research

1.  Optimise AI boxing

The most obvious way to bring down expected costs, and thereby make all information channels more viable, is to improve AI boxing. This would entail empirically measuring both the level of security and the level of informativeness of various box setups. We can then work towards answering the following questions:

For a given probability of AI escape, which box setup maximises the expected scientific benefit without increasing the expected AI escape risk by an amount more than that level?


For a given level of scientific benefit, which box setup minimises the expected AI escape risk without constraining the amount of expected scientific benefit to be less than that level?

Excellent ideas on pragmatic AI boxing (e.g., by Roman Yampolskiy, David Chalmers, and others) await implementation, redteaming, and optimisation. They include:

  1. putting the AGI in a virtual machine’s test environment that looks naturally occurring rather than designed, 
  2. using homomorphic encryption to obstruct the AGI from analysing its own source code, and
  3. training the alignment researchers to withstand social engineering,

Other promising failsafes aimed at reducing risks of social engineering include multi-factor authentication, time delays for important decisions, and collective rather than unilateral decision-making. Finally, impeding AI escape with a large and diverse collection of cybersecurity failsafes (some of which can be nested multiple times) is probably ideal from a safety-mindset perspective.

Given that human social engineering is probably the biggest failure mode, it seems especially promising to systematise and optimise Eliezer Yudkowsky’s pioneering roleplaying experiments of AI escape, which could help redteam how vulnerable various box setups are to human social engineering. One way to adapt Yudkowsky’s setup to our game-theoretic framing could be to give the “human researcher” a list of scientific questions that, if answered correctly by the “AGI,” would yield positive points. To obtain the answers to these scientific questions, the “human researcher” opens a collection of information channels allowed by the box setup. The “AGI,” whose role can be played by a team of experts in social persuasion, may use these informational channels to escape its box. Its escape, if successful, would apply a substantial negative penalty to the score of the “human researcher.” The average score (over multiple AI escape games) may then be informative of how robust the given box setup is against human social engineering.

In a sense, we are describing a Box Contest (analogous to the ELK Contest). First, systematise a way to empirically estimate the robustness of various box setups. Then, get as many people as we can to brainstorm new box setups, so that the best such setup may eventually be found.

2.  Get AI organisations to box their research environments

A box is only helpful if it is used. So, we should get AI organisations to envelop their neural-net research environments in optimised box setups. 

Suppose all AI organisations with the capacity to build an AGI agree to preemptively secure their research environments with our box setups. These setups would have been empirically selected to maximise the expected scientific benefit obtainable from taking on a certain level of AI escape risk (the risk level would be shared across all participating AI organisations, so as to discourage defection). This would substantially reduce the perceived race towards premature AGI deployment, thereby enhancing our prospects for solving alignment beforehand.

For this to be achieved, it seems that (1) better coordination between AI organisations and (2) better AI governance would be crucial. This is probably difficult, but not impossible. We would really appreciate hearing any creative suggestions you may have about how to make this happen!

3.  Develop informationally efficient tests for alignment

For an information channel to be helpful rather than counterproductive, it needs to be sufficiently efficient (compared to the riskiness of the box setup). So, we should prioritise quality over quantity. 

Some potentially efficient tests for alignment include:

Automated summary statistics regarding the AGI’s behaviour. Running the AGI in a realistic test environment and outputting summary statistics of its overall behaviour seems likely to be scientifically informative on a per-bit basis.

Coarse-grained interpretability tools. If we had interpretability tools that could look at higher-level features of the model, this could reduce the amount of information we need to understand it deeply.

Interpretability tools that find specific things. Examples of this approach include ELK and natural abstractions research.

4.  Better understand what channels to use and when

It would be helpful to know how to, for any given situation, empirically estimate how much utility-per-bit a given interpretability tool would yield. This would differentiate the helpful interpretability tools (positive utility-per-bit) from the counterproductive ones (negative utility-per-bit).

Note that doing this accurately is probably easier said than done. Predicting a priori the scientific benefit of a given collection of experiments is notoriously difficult. Most likely, predicting the magnitude of the human-social-engineering risk posed by an AGI—of an unknown, potentially superhuman level of capabilities—would also be quite difficult. Even a small advance in our ability to predict either the expected scientific benefit or the expected cost (of adding a given interpretability tool and/or researcher) would likely be a hard-won victory.

5.  Study generalisable facts about existing systems

The risks of human social engineering and of AI escape are probably quite low at our current level of AI capabilities. Thus, if we can probe for patterns shared by current low-risk models that—as far as we know—seem to generalise across the different capability levels in our historical record, then we can hope that these patterns might also generalise to high-risk AGI models in the future. If this turns out to be true, then we would have succeeded in obtaining high-value information without taking on a significant risk.

The downside, of course, is that we a priori have no guarantee that any pattern shared by low-capabilities models will generalise to the high-capabilities regime. Thus, we should also consider using informationally efficient tests on our potentially superintelligent AGI (even if it might be risky) to probe whether the earlier pattern we found (1) remains true or (2) has phase-transitioned into something completely different.


Epistemic status: We still have relatively high uncertainty in our scenario. However, we hope that our attempt at formulating the problem from a fresh perspective can help alignment researchers contemplate what we think are potentially promising but often overlooked research directions. 


Acknowledgements: We are grateful to Matthias Mayer and David Udell for their helpful feedback on the draft.

New Comment
1 comment, sorted by Click to highlight new comments since:

However, suppose that the box setup uses theoretically robust cybersecurity, combined with an actual physical box that is designed to not let any covert information enter or leave.

I think what you want to say here is:

However, suppose that the box setup uses robust cybersecurity, combined with an actual physical box that does not let any covert information enter or leave.

  1. "...theoretically..." weakens rather than strengthens: we need the cybersecurity to be robust, in spite of implementation details.
  2. It doesn't matter what the box "is designed to" do; it matters what it does.

In both cases, this puts us in trouble:
We can check the cybersecurity is theoretically robust; we can't check it's robust.
We know what the box is designed to do; we don't know what it does.

These holes probably aren't too significant for superintelligent AGI (which we should expect to persuade us anyway). They may well be significant for the kinds of about-human-level systems where you're hoping boxing might help.

On the rest, I'd largely agree with John et al. We might as well use boxing, but we shouldn't be spending much effort on it when we could make more important progress.