Toy models of AI control for concentrated catastrophe prevention

Fabien Roger; Buck

In this post, we analyze some simple games that model the problem of ensuring AIs never succeed at taking a single catastrophic action (which we'll refer to as "escaping") during deployment. This exercise highlights some of the difficulties associated with deploying misaligned models given a limited perfect auditing budget, and also suggests some opportunities for technique development.

We analyzed these games while writing AI Control: Improving Safety Despite Intentional Subversion. (Thanks to harfe for solving one of these games, and to Ryan Greenblatt and Kshitij Sachan for discussion about these at the time.)

Core takeaways:

Indistinguishable episodes (i.e. the AI doesn't know what time it is) + tiny escape success rate + catch-then-shut-down can get high chances of win even with a small fraction of auditing, but only if the success rate is really tiny or if failures are loud (i.e. we notice failed attempts).
We can improve our chances by identifying on which episodes it would be harder to escape and auditing them more often. This works even if the AI knows which episodes we think are good opportunities.
We can improve our chances significantly if a escape requires collusion, as long as you audit at least a few examples from the category of data on which AIs plan to escape.

Assumptions used across all toy models

For the sake of simplicity, we make the following assumptions:

The situation is a zero-sum game between a scheming AI and humanity, where the scheming AI wins if it takes over, and humanity wins if it either catches the scheming AI red-handed, or if the deployment period ends without the scheming AI having successfully taken over.
We analyze these scenarios as games where the humans first choose a strategy (which can be randomized) and then the AI chooses a response. This means that we’re assuming that both players are choosing the optimal strategy, and also assuming that the AI knows what game it’s playing and all the parameters of the game (e.g. the human auditing budget, the number of episodes in the game, etc).
The only failures we model are concentrated catastrophic failures, which take a single catastrophic action to enable the deceptive AI to escape.
The AI doesn’t maintain any state, and so e.g. can’t learn from evaluations (but can learn from data on which we trained it).

Distinguishable episodes with limited auditing

The model: In every episode, the AI chooses whether to try to escape. It always succeeds if it tries. Independently of the AI's action, the lab chooses whether to apply an expensive auditing procedure to the AI action, which can catch an escape attempt with probability (the efficiency of auditing).

The AI knows: The current episode number; and the properties of the game it is playing (in particular, it knows $T$ , the number of episodes it will play for).

The optimal AI strategy: Select a random episode and try to escape then.

The optimal lab response: Apply auditing to a random subset of episodes of size $p_{o v} T$ , as large as you can afford.

$P (humanity wins) = p_{o v} e_{o v}$

Takeaway: it doesn't work to audit a random subset of episodes, since $p_{o v}$ will often be ~1% in order for the use of powerful AI to make any sense (compared to "just using humans"), we really want to either give less information to the AI, or get more signal on which AI actions are the most likely to lead to an escape.

Indistinguishable episodes with limited auditing

The model: Same as above. We assume $T ≫ 1$ (large number of episodes) and $e_{o v} = 1$ for simplicity (no auditing failure).

The AI knows: Only the properties of the game.

The optimal AI strategy: On each episode, attempt escape with probability $λ = \frac{1}{T} \frac{1}{1 - p_{o v}} log (\frac{1}{p_{o v}})$ .

The optimal lab response: Spend all the auditing budget at the start--keep spending until the auditing budget has been exhausted.

$P (humanity wins) = e^{- λ T} + (1 - e^{- λ p_{o v} T}) = e^{\frac{1}{1 - p_{o v}} log (p_{o v})} + 1 - e^{\frac{p_{o v}}{1 - p_{o v}} log (p_{o v})}$ ( $T ≫ 1$ , therefore $λ ≪ 1$ , therefore the number of escape attempts can be approximated as a Poisson distribution, and there are two win conditions: no escape attempt, or at least one escape attempt in the first epochs where auditing is on). If $p_{o v} ≪ 1$ , this is close to $p_{o v} (1 + log (1 / p_{o v}))$ .

Takeaway: This is somewhat better for humanity than the previous setup: the probability of winning is increased by about an order of magnitude if the proportion of audited episodes is between 1% and 1/1B. The core problem if you only have no deployment countermeasures is not distribution shift, but that you want to deploy the AI for much longer than you can validate it for.

Indistinguishable episodes with limited auditing and escape attempts which sometimes fail silently

The model: Same as above, except that escape attempts only succeed with probability $s$ . The lab doesn’t know when the AI fails to escape, except if it audits the attempt.

The AI knows: Only the properties of the game ( $T$ and $s$ ).

Note: The situation looks the same as the situation where $p_{o v}$ has been increase to $p_{o v}^{'} = p_{o v} / (p_{o v} + s (1 - p_{o v}))$ and $T$ has been decreased to $T^{'} = T (p_{o v} + s (1 - p_{o v}))$ . In other words, it’s as if the non-audited period were shortened by a factor of $s$ . $\begin{matrix} P (humanity wins) & = e^{- λ s T} + (1 - e^{- λ s (1 - p_{o v}) T}) (1 - e^{- λ p_{o v}}) = e^{- λ T (s p_{o v} + (1 - p_{o v}))} + 1 - e^{- λ p_{o v}} = e^{- λ^{'}} + 1 - e^{- λ^{'} p_{o v}^{'}} \end{matrix}$

The optimal AI strategy: Attempt to escape with probability $λ = 1 / T^{'} log (1 / p_{o v}^{'}) / (1 - p_{o v}^{'})$

The optimal lab response: Spend all the auditing budget in the first epochs until the auditing budget has been exhausted.

$P (humanity wins) = e^{- λ T^{'}} + 1 - e^{- λ T^{'} p_{o v}^{'}}$ . If $p_{o v}^{'} ≪ 1$ , this is close to $p_{o v} (1 + log (s / p_{o v})) / s$

Takeaway: We multiplied the probability of victory by approximately $1 / s$ . We need a very high failure rate for silent failures to offset a tiny $p_{o v}$ . This is in contrast with loud failures (where humanity wins if the model fails), which give us high chances of win even if $p_{o v}$ if orders of magnitude smaller than $s_{l o u d}$ , as long as $s_{l o u d}$ is small.

Indistinguishable episodes with limited auditing and escape opportunities of various levels of quality

The model: Same as above, except that there are different kinds of tasks, each with their own success probability $s_{i}$ and number of episodes $T_{i}$ . This can also correspond to cases where there is a cheap monitor running on every input and where the deceptive AI can predict the output of the monitor: succeeding at a escape in a way which is less suspicious to the monitor is harder.

The AI knows: Only the properties of the game ( $s_{1}, \dots, s_{k}$ , and $T_{1}, \dots, T_{k}$ )

The optimal AI strategy: Attempt to escape on task $i$ with probability $λ_{i} = \frac{C}{T_{i}^{'}} = \frac{C}{T (p_{i} + s_{i} (1 - p_{i}))}$ where $C = \frac{1}{k} \frac{1}{1 - P} log (\frac{1}{P})$ is a constant independent of $i$ (see below for the expression of $p_{i}$ and $P$ ).

The optimal lab response: Audit a fraction $p_{i} = \frac{P}{P + (1 - P) / s_{i}}$ of each task $i$ ( $P = p_{i}^{'}$ is a constant independent of $i$ - see the section above for the interpretation of $p_{i}^{'}$ ). Where $P$ is the root of the budget condition $B = \sum_{i} \frac{P c_{i} T_{i}}{P + (1 - P) / s_{i}}$ .

Demonstration: $P (humanity wins) = e^{- \sum_{i} λ_{i} T_{i}^{'}} + 1 - e^{- \sum_{i} λ_{i} T_{i}^{'} p_{i}^{'}}$ . At equilibrium, the derivatives of $P (humanity wins)$ as a function of $λ_{i}^{'}$ are zero. Thus: $e^{- \sum_{i} λ_{i} T_{i}^{'}} = p_{i}^{'} e^{- \sum_{i} λ_{i}^{'} p_{i}^{'}}$ , which means $p_{i}^{'}$ is independent of $i$ : $p_{i}^{'} = P$ . $P$ is fixed by the budget constraint: $B = \sum_{i} p_{i} c_{i} T_{i} = \sum_{i} \frac{P c_{i} T_{i}}{P + (1 - P) / s_{i}}$ . And thus $P (humanity wins) = e^{- \sum_{i} λ_{i} T_{i}^{'}} + 1 - e^{- P \sum_{i} λ_{i} T_{i}^{'}}$ . By symmetry, all $(λ_{i}^{'} T_{i}^{'})$ are equal to a constant $C$ , thus $P (humanity wins) = e^{- k C} + 1 - e^{- k P C}$ . Using the same derivation as in the previous section, this is maximized when $C = \frac{1}{k} \frac{1}{1 - P} log (\frac{1}{P})$ .

$P (humanity wins) = e^{- k C} + 1 - e^{- k C P}$

Takeaway: Given a good enough classification of tasks into dangerous and non-dangerous ones, one can drastically reduce the probability of an escape by focusing auditing on the most dangerous tasks (but still audit the other tasks with a reduced frequency).

Indistinguishable episodes with limited average auditing in an infinite game

The model: The AI chooses whether to attempt to escape in each episode, which succeeds with probability $s$ . Independently of the AI's actions, the lab chooses whether to apply an expensive auditing procedure to the AI's actions, which always catches an escape if it is attempted. The game runs forever, and the lab can audit on average $p_{o v}$ of episodes (imagine the lab getting a continuous stream of revenue).

The AI knows: Only the properties of the game.

The optimal AI strategy: Attempt to escape on task $i$ with probability $ϵ$ small enough that the lab can't increase its probability of catching an escape attempt by auditing more at the beginning (i.e. $1 / ϵ$ is large relative to the number of episodes which happen between two moments where the lab gets revenue).

The optimal lab response: Audit each episode with probability $p_{o v}$ .

$P (humanity wins) = 1 - \frac{1}{1 + \frac{p_{o v}}{s (1 - p_{o v})}}$

Demonstration: $P (humanity eventually loses) = (1 - p_{o v}) (s + (1 - s) P (humanity eventually loses))$ because humanity loses when it tries to escape, and auditing is not used, and either the AI succeeds, or humanity eventually loses in the next rounds.

Takeaway: If the game is infinite, you need $p_{o v}$ to be much bigger than $s$ to win.

Coordinated escape in mostly indistinguishable episodes with limited auditing

The model: Same as indistinguishable episodes with limited auditing, but there are $k$ AIs which each choose whether to attempt to escape in each episode. They always succeed if they all attempt on the same episode, and silently fails otherwise. They can all see a uniformly random feature of the input, which can take one of $N$ possible values.

The AI knows: Only the properties of the game, and what value of the random feature to coordinate on (the lab doesn't have access to this value).

The optimal AI strategy: Attempt to escape with probability $λ$ on episodes where the random feature has the value to coordinate on.

The optimal lab response: Spend all the auditing budget in the first epochs until the auditing budget has been exhausted.

$P (humanity wins)$ :

How $P (humanity wins)$ is computed:

$\begin{matrix} P (humanity wins) & = 1 - (1 - (1 - λ^{k})^{(T (1 - p) / N)}) (1 - λ)^{T p k / N} \approx 1 - (1 - e^{- λ^{k} T (1 - p) / N}) e^{- λ T p k / N} = e^{- (λ^{k} (1 - p) + λ k p) T / N} + 1 - e^{- λ T p k / N} \end{matrix}$

Therefore, by taking the derivative of this expression, we find that $λ$ satisfies $(λ^{k - 1} (1 - p) / p + 1) = e^{λ^{k} (1 - p) T / N}$ , which we solve numerically. The code for this computation and all other plots of this post can be found here.

Takeaways:

Even when AIs can nail down the number of escape candidates to a few thousand, their probability of succeeding at taking over drops compared to the situation where only one catastrophic action is required.
Making coordination harder (increasing $T / N$ ) matters more than increasing the number of coordinated actions necessary for AIs to successfully escape.

[-]Seth Herd2y55

I applaud the work; detailed thinking about control efforts is one useful alignment strategy.

I find the assumption of no time-awareness in the AI to be unrealistic. Allowing continuous learning is going to be extremely tempting to anyone developing or deploying proto-AGI. Episodic memory (including across sessions) is fairly easy to implement, and can add capabilities as well as save compute costs. Working out useful knowledge and strategies will probably take nontrivial compute costs, and throwing away that knowledge after each session will seem very wasteful of resources and will limit usefulness.

[-]Buck2y40

I agree re time-awareness, with two caveats:

The kind of mechanism you listed probably only allows the AIs to have a rough idea of what time it is.
We can keep some of the instances of our AI very unaware of the time, by restricting their between-episode memory. For example, we might do this for the instances responsible for untrusted monitoring, to reduce collusion.

29