Thanks to John Wentworth, Garrett Baker, Theo Chapman, and David Lorell for feedback and discussions on drafts of this post.

In this post I’ll describe some of my thoughts on the AI control research agenda. If you haven’t read that post, I’m not going to try and summarize it here[1], so I recommend reading it first.

Here’s a TL;DR of my cruxes:

  • Evaluating whether you’re in a regime where control is doable requires strong capability evaluations. I expect that getting our capability evals to be comprehensive enough to be confident of our findings here is really hard.
  • Control evaluations are less likely to work if our AIs become wildly superhuman in problematic domains (such as hacking, persuasion, etc) before transformative AI[2]. I think the assumption that this wouldn’t happen is a very strong one, and that we already have unexpectedly superhuman performance in some problematic domains that I expect to only get better.
  • Models deployed in the real world form part of composite systems that select for performance in often-unexpected ways. Strong restrictions help prevent this, but leads either to crippling alignment taxes or selecting for models that can evade our control mechanisms.

On requiring very good capability evaluations

From The case for ensuring that powerful AIs are controlled:

How can we decide which models to treat as untrusted? The only approach we know of for confidently ruling out scheming without fundamental breakthroughs is via lack of capabilities—some models (e.g., GPT-4) just aren't capable enough to plausibly be scheming. So by default we'd propose treating models as untrusted if they pass appropriate capability evaluations.

My last post was - very broadly - on why I think we need much more ambitious language model evaluations. Currently, I think we don’t yet know the right questions to be asking to properly explore the space of capabilities that future - and current - models possess.

I think current evals are sorely lacking. They're far from reaching the degree of comprehensiveness that we’d like, in order to be confident of whether to trust a model. Even if we manage to identify some specific capability expected to be a bottleneck in preventing a model from evading control - which, I believe, is quite difficult in generality - there are multiple ways of achieving these capabilities. Some methods are more challenging to identify than others. Trusting a model because it performed poorly on some of them is a weaker guarantee than knowing it actually can’t perform well on them in a context that pops up in the real world[3].

I don’t expect this to make control infeasible, but I do expect it to make it harder. I don’t expect doing ambitiously better language model evals to be very hard, but I expect that we’d be making quantitative differences on the frontline of “there is a very broad space of capabilities here that we’ll struggle to fully identify on any reasonable timescale”.

There’s a related idea on how we see capability gains from better prompting methods from time and experimentation in the real world that are hard to fully explore in advance using evals. This is somewhat separate in that it’s more about greater performance on specific questions rather than there being more relevant questions, and is an issue I mostly expect to be solved by having access to fine-tuning internally during evals.

On expecting superhuman performance before transformative AI

From the same control post as above:

​​[...] if the AI actually is wildly superhuman in a problematic domain to such an extent that one of the items in the bulleted list above applies, then control techniques are unlikely to work and control evaluations also have a high risk of failure. Control techniques are likely to fail if inscrutability in a problematic domain makes it too hard to check outputs are safe. Control evaluations might fail because the red team will struggle to elicit or estimate these qualitatively wildly superhuman abilities.

We think we probably won't run into issues related to qualitatively wildly superhuman skills in problematic domains.

My criticism here is pretty correlated with my last one. Specifically, I think one of the places where current evals are lacking is in identifying places where language models are already superhuman. For one particularly legible example, see this comment by janus. Importantly, this last part:

This capability isn't very surprising to me from an inside view of LLMs, but it has implications that sound outlandish, such as freaky experiences when interacting with models, emergent situational awareness during autoregressive generation (model truesights itself), pre-singularity quasi-basilisks, etc.

I don’t know to what extent models will be superhuman specifically at highly problematic domains before transformative AI. But my uncertainty here comes from a place of “man, I think we have a pretty bad idea of what models are really good at right now”. It’s entirely possible that on the specific abilities that are useful for evading control mechanisms, this assumption holds, but I don’t yet see a strong reason for believing that right now.

Specifically, in reference to:

There are a relatively small number of core problematic domains (hacking, persuasion, and manipulating scientific results or conceptual arguments) and these problematic domains seem like the types of domains that would be pretty correlated with general capabilities/intelligence so long as we don't train AIs on these specific tasks much more than we train them on the core tasks we care about. Furthermore, we probably won't need to mostly train our AI on some particularly problematic domain in order to make it generally capable. Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research). However, this is based on a small number of data points as AI has only recently become capable of this level of performance.

I think that some of these domains are almost certainly the kind that early transformatively useful models will possess the latent capability to be wildly superhuman at. If you consider persuasion for instance, I think that even current models have ~superhuman understanding of text authors, which seems extremely useful for superhuman persuasion.

I’m very uncertain about how easy accessing these capabilities will be for future models, but I think history here tends toward noisy capabilities foreshadowing robust capabilities. I don’t know if the same is true for domains such as hacking, but I wouldn’t be confident given our relative success at understanding capabilities in other domains.

On the dynamics of systems in the real world

This is either the least or most important of my criticisms. It’s certainly the most speculative. But I think it points at something that’s often under-considered, and which seems very important for good theories of change.

In Thoughts On (Solving) Deep Deception, I tried to reframe one kind of deception that arises from models externalizing the optimization necessary to come up with deceptive actions[4]. The intuition being that simply removing deception from the system itself doesn’t account for all the ways in which you actually can get deception. One way to generalize this intuition is that the real world throws optimization pressure against you from many directions.

Slightly more concretely:

Much of the hope behind control mechanisms as I understand them comes from the idea that there’s a high-stakes symmetry between the AI system and us. Catching the AI red-handed is extremely bad for the AI, and the AI successfully doing something that leads to a concentrated failure is extremely bad for us. Our means of control could keep this symmetry favorable for us, unless there’s a lot of adversarial optimization to break them.

It’s plausible that a powerful generalist model in the real world, even before being transformatively useful, is capable of efficiently searching through different cognitive strategies and plans it can adopt to find ones that achieve good performance on whatever task it wants to solve. This means that if managing to evade our control is both possible under adversarial optimization and that this leads to greater performance[5], a competent agent could be capable of externalizing this optimization and breaking our mechanisms.

If, however, you manage to enforce policies such that all models are sandboxed from write access to the internet[6], auditing all of the models’ reasoning and actions[7], remove steganography to reasonable degrees[8], and shut down all frontier models if any one is caught[9], then this seems like much less of a problem. I have some concerns about RLHF providing selection pressure for models that seem like they’re not reasoning about deception in contexts where deception is highly useful, because this seems like a great way to exploit non-robust solutions to steganography. However, I’m highly uncertain about the extent to which this becomes highly relevant before our first transformatively useful AI systems.

My biggest concern with this setup is the possibility that this ends up with a lot of false positives. More specifically, consider a situation where a model is given some task and, after realizing that good performance on the task would involve taking actions an overseer would not approve of, reasons about deceiving the overseer. This was the setting for a recent demonstration where GPT-4 reasons about how to deceive humans in service of its goal. This is meaningfully different from the kind of false positive where a model tries to noise our control mechanism by writing things that look like backdoors but aren’t, because here the model actually is being deceptive - it just isn’t very high-stakes. Assuming the high-stakes setting however, that seems to imply we can’t just brush these threat vectors off.

If our control mechanisms do not go off in this setting, and you only really care about high-stakes situations, then you have to carefully design your mechanisms such that allowing for this kind of deception doesn’t concede broad attack vectors. If on the other hand your control mechanisms do trigger, then immediately shutting down all frontier models seems like a pretty large alignment tax to pay. This also doesn't give us many of the benefits of good control mechanisms, like having deceptively aligned models to examine or extracting useful work out of near-transformatively-useful AI.

I don’t expect any of this to be particularly novel. One obvious solution to this problem could be to come up with a better operationalization of what it is that you want your control mechanisms to trigger on. But that implicitly involves relaxing some constraints to allow for your models to reason as in the setting above, which seems to me like it opens up pretty dangerous channels of attack absent solving some pretty hard problems - though it’s entirely possible I’m just missing something very obvious and this entire section is mistaken.

If you do something except shut down your most powerful models when your control mechanisms trigger - anything from simply rolling back your models for a new training run to actively training against your evals - then you bring in a selection pressure for models that manage to evade your control mechanism while still being competent.

The key insight here is that there’s a certain amount of optimization required to specify a plan that evades the control mechanism. By default for near-future models, I think this is likely to come from the model explicitly reasoning about how to evade it. However, when you block off this route and keep selecting for competent models without also robustly selecting against models whose actions lead to evading the control mechanism, you allow for other sources for this optimization pressure. It’s still an empirical question of how much pressure this is and how much it affects your safety guarantees, but it seems like a much more dangerous situation, especially in multi-polar setups with large amounts of selection pressure by default (unless we commit to shutting down all frontier models at once).

The generalized version of this argument is that the system you really want to control is the composite system that all the agents are part of, or even just a single agent with the ability to act on long timescales. There are also emergent dynamics that crop up from multi-agent systems that we care about in this context[10][11]. This carries some analogue to the idea that looking at individual components of a system doesn’t buy you much in terms of understanding whether the overall system is going to behave the way you want.

I’m extremely uncertain as to the extent this is relevant on short timescales even in multi-polar setups, however. I’m definitely not confident enough in it to think this makes control infeasible. I think it’s an open question that could empirically go either way, and that we have very little evidence for one way or the other so far. However, this seems like yet another set of unanswered questions / potential roadblocks to control being feasible or easy in practice.

Aside: Finding directions promising

When I first heard about the AI control agenda, I was… skeptical. I thought of a few reasons why I thought it wouldn’t work, some of which I don’t think people think about enough, and didn’t really find satisfactory answers to them. I decided that the direction didn’t seem very promising to me, and that would’ve been that.

But for various reasons, I ended up spending some more time thinking about it[12]. The more I thought about it, the more interesting the direction seemed to me. There’s a somewhat trivial argument as to being generally excited about throwing up more lines of defense, but that wasn’t very convincing. If I had to verbalize what changed between now and then, it’s some mix of the following:

  • The technical problem is hard. Unlike some, I think it’s still pretty solvable to the extent that I’d still rather work directly on it, but I think it’s still really hard. Over the last few months I’ve been updating slowly toward it being harder than I once thought[13], which has the side effect of upweighting agendas that don’t revolve around solving those hard problems.
  • Buying time at the end is very, very good. Governance efforts are varied in what kind of time they buy us - if we paused all AI research right now, we would have more time in an absolute sense, but not more time with very useful models. That’s still good, but mechanisms - control, governance, or otherwise - that purport to buy time directly from the back end seem the most valuable to me[14].
  • While I think my criticisms still hold, it’s quite plausible they’re just not that damaging to the agenda’s theory of change. Much of my criticism revolves around describing quite plausible worlds where some of the cruxes of the agenda fail, but I don’t think they describe a supermajority of all realistic future worlds. It’s easy - at least in my experience - when working on solving worst-case alignment, to identify many problems that crop up, and have a much fuzzier time of drawing clear distinctions between salience and realistic relevance of each of them.

To be clear, I expect I’m still more pessimistic about the control agenda than its authors. I think that my original reasons for reservation still lead me to lean toward it being quite difficult to pull off in practice and probably not work in the real world, and that my updates are best characterized as “uncertainty moving in a direction favorable to the agenda”. That said, I’m pretty excited about this agenda, and am looking forward to seeing more progress.

  1. ^

     I’m currently under a pact to write a new post every 100 hours for the next few weeks, so I’m optimizing for speed.

  2. ^

     Redwood operationalizes “transformatively useful AI” as AIs that are capable of substantially reducing risk from subsequent AIs. I’m happy with any operationalization that points at roughly the same time period around which we have to have solved alignment to not get catastrophically bad outcomes.

  3. ^
  4. ^

     What I mean by this is that the model can search through different cognitive strategies over time, that it can adopt to have higher performance on whatever task it cares about. In contexts where deception is useful for higher performance, this leads to adopting strategies that have high performance because they incidentally lead to humans or an overseer being deceived, without the system itself ever performing deceptive cognition.

  5. ^

     Even if simply along axes of efficiency or speed, or not risking time-consuming manual audits. It’s possible that we could have control mechanisms that come with zero competitiveness costs, but this seems unlikely.

  6. ^

     Such that the models don’t have state.

  7. ^

     Such that the model can never start directly reasoning about how to get past a safeguard without being caught.

  8. ^

     An assumption I’m skeptical of for reasons outlined in this post, but which doesn’t seem implausible for median-case scenarios without optimization pressure.

  9. ^

     Such that you don’t reduce your safety measure from uncorrelated failure modes from different models, but also more importantly: Having any channel of selection pressure for models that aren’t thinking about deception while still having the potential for steganography potentially breaks the setup.

  10. ^

     By emergent dynamics, I mean failures that crop up because the composite system has its own dynamics that may not be immediately obvious from observing individual components. As an example (originally from John Wentworth), consider a company where managers lay down incentives that they think are good, and employees incidentally commit crime following them. None of the people in this company need to individually be malicious, and yet the company itself could lead to bad outcomes.

  11. ^

     Another realistic concern is that in an ecosystem with multiple buyers and sellers of control mechanisms or audits, there may be a mild selection pressure for auditors that impose the lowest alignment tax and trade off against safety.

  12. ^

     Primary among them is the reason I’m writing this post: my post on model evals, which @Buck thought was interesting and mentioned being interested in seeing me write this critique.

  13. ^

     Still not enough to sway me from thinking it’s solvable, yet - and I expect that my reasons for believing why it is stem from a specific idea about how feasible a methodological approach at solving some subproblems are.

  14. ^

     Due to a variety of factors, from having much better setups to study specific alignment failures, as well as having very powerful models accelerating alignment research.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 11:26 AM

On requiring very good capability evaluations

Note that there are two pretty different things here:

  1. Capability evaluation for determing if models are plausibly scheming (aka deceptively aligned)
  2. Capability evaluations for determing if models would be able to cause harm if they were scheming (which we call control evaluations).

The first one is always going to be somewhat imprecise, but we can potentially get away with being pretty conservative (e.g. saying that all models more powerful than GPT-4 are plausibly scheming).

(ETA: to be clear, this is one of the trickiest and least precise parts of the control proposal. And further precision here would be quite useful (even for approaches other than control.)

There’s a related idea on how we see capability gains from better prompting methods from time and experimentation in the real world that are hard to fully explore in advance using evals.

Note that our goal (with control evaluations) will be to estimate the capabilities of the model when it is actually trying to do the task zero-shot. We call this self-elicitation. In the current regime, competing with self-elicitation looks much easier than competing with arbitrary elicitation methods developed in the future. (Models might have superhuman abilties that neither they nor us can effectively utilize, though future humans might be able to utilize these abillities.)

Also, the self-elicitation case we care about is one where the AI needs to do the task without having it be obvious that it's doing something problematic. So we don't necessarily need to find optimal CoT strategies because the AI might not even be able to do CoT.

I expect that in the case where AIs can relatively easily self-elicit their abilities (and conditioning on similar architectures to now), eliciting abilities also becomes much more sample efficient (a small number of examples yields large returns). We have some forth coming work which is basically on this.

Overall, I expect that capability elicitation difficulties are probably the main technical difficulty with control evaluations. So, this does indeed seem like a key technical difficulty. (Though perhaps not the main difficulty with control overall, which is more likely to be just getting good enough countermeasures IMO.)

I think that even current models have ~superhuman understanding of text authors, which seems extremely useful for superhuman persuasion.

Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.

Overall, I'm pretty skeptical of the specific stories I've heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties given a realistic effort at eliciting R&D capabilities.

(And we should be able to (in principle) handle persuasion abilities comparable with good humans.)

On the dynamics of systems in the real world

I don't think I understand this section well enough to really respond, sorry.

One response though, it sounds like you're talking about cases where our training directly incentivizes reasoning about how to trick overseers and this will result in "false alarms". If these cases actually involve AIs doing egregiously bad actions, I think this isn't really a false alarm. See also here.

Also, There are various methods to determine how egregious an AI actions is as discussed here.

Separately, I think cases where our training method directly selects for particular average case behavior should be relatively easy to study with model organisms or in the wild. So we should be able to get a pretty good understand of what's going on. (This might not suffice for avoiding problems.)

Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.

Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).

Models don't need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.

Overall, I'm with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn't understand an author better than coworkers] -> [unlikely there's a superhuman persuasion strategy] argument seems weak.

It's unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.

If we don't understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don't necessarily help. E.g. giving huge amounts of context, since [inferences about author given input ()] are not a subset of [inferences about author given input (    ...  )].

However, a specific [doesn't understand an author better than coworkers] -> [unlikely there's a superhuman persuasion strategy] argument seems weak.

Note that I wasn't making this argument. I was just reponding to one specific story and then noting "I'm pretty skeptical of the specific stories I've heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties".

This is obviously only one of many possible arguments.

Sure, understood.

However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.

I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me that this isn't sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)

Yep, I just literally meant, "human coworker level doesn't suffice". I was just making a relatively narrow argument here, sorry about the confusion.

For one particularly legible example, see this comment by janus.

Link should presumably be to this comment.

Hmm I think somehow the problem is that the equals sign in your url is being encoded as an ASCII value with a % sign etc rather than being treated as a raw equals sign, weird.

Fixed, thanks! Yeah that's weird - I copied it over from a Google doc after publishing to preserve footnotes, so maybe it's some weird formatting bug from all of those steps.