I want to draw attention to a new paper, written by myself, David "davidad" Dalrymple, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum.

In this paper we introduce the concept of "guaranteed safe (GS) AI", which is a broad research strategy for obtaining safe AI systems with provable quantitative safety guarantees. Moreover, with a sufficient push, this strategy could plausibly be implemented on a moderately short time scale. The key components of GS AI are:

  1. A formal safety specification that mathematically describes what effects or behaviors are considered safe or acceptable.
  2. A world model that provides a mathematical description of the environment of the AI system.
  3. A verifier that provides a formal proof (or some other comparable auditable assurance) that the AI system satisfies the safety specification with respect to the world model.

The first thing to note is that a safety specification in general is not the same thing as a reward function, utility function, or loss function (though they include these objects as special cases). For example, it may specify that the AI system should not communicate outside of certain channels, copy itself to external computers, modify its own source code, or obtain information about certain classes of things in the external world, etc. The safety specifications may be specified manually, generated by a learning algorithm, written by an AI system, or obtained through other means. Further detail is provided in the main paper.

The next thing to note is that most useful safety specifications must be given relative to a world model. Without a world model, we can only use specifications defined directly over input-output relations. However, we want to define specifications over input-outcome relations instead. This is why a world model is a core component of GS AI. Also note that:

  1. The world model need not be a "complete" model of the world. Rather, the required amount of detail and the appropriate level of abstraction depends on both the safety specification(s) and the AI system's context of use.
  2. The world model should of course account for uncertainty, which may include both stochasticity and nondeterminism. 
  3. The AI system whose safety is being verified may or may not use a world model, and if it does, we may or may not be able to extract it. However, the world model that is used for the verification of the safety properties need not be the same as the world model of the AI system whose safety is being verified (if it has one).

The world model would likely have to be AI-generated, and should ideally be interpretable. In the main paper, we outline a few potential strategies for producing such a world model.

Finally, the verifier produces a quantitative assurance that the base-level AI controller satisfies the safety specification(s) relative to the world model(s). In the most straightforward form, this could simply take the shape of a formal proof. However, if a direct formal proof cannot be obtained, then there are weaker alternatives that would still produce a quantitative guarantee. For example, the assurance may take the form of a proof that bounds the probability of failing to satisfy the safety specification, or a proof that the AI system will converge towards satisfying the safety specification (with increasing amounts of data or computational resources, for example). Such proofs are of course often very hard to obtain. However, further progress in automated theorem proving (and related techniques) may make it very substantially easier to obtain such proofs. Furthermore, an automated theorem prover AI could be very powerful without having dangerous capabilities. For more detail, see the main paper.

If each of these three components can be created, then they can be used to provide auditable, quantitative safety guarantees for AI systems. This strategy does also not require interpretability to be solved, but could still provide a solution to the inner alignment problem (and rule out deceptive alignment, etc). Moreover, it should be possible to implement this strategy without any new fundamental insights; improvement of existing techniques (using LLMs and other tools) may be sufficient. If we get a substantive research push in this direction, then I am optimistic about the prospects of achieving substantially safer AI systems through the GS AI strategy.

For more detail, see the full paper.

New Comment
8 comments, sorted by Click to highlight new comments since:

(Note that this paper was already posted here, so see comments on that post as well.)

This seems interesting, but I've seen no plausible case that there's a version of (1) that's both sufficient and achievable. I've seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don't help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])

The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing], or [AI system keeps the world locked into some least-harmful path], depending on the setup - and here of course "least harmful" isn't a utopia, since it's a distribution of safety specifications, not desirability specifications).
Am I mistaken about this?

I'm very pleased that people are thinking about this, but I fail to understand the optimism - hopefully I'm confused somewhere!
Is anyone working on toy examples as proof of concept?

I worry that there's so much deeply technical work here that not enough time is being spent to check that the concept is workable (is anyone focusing on this?). I'd suggest focusing on mental influences: what kind of specification would allow me to radically change my ideas, but not to be driven insane? What's the basis to think we can find such a specification?

It seems to me that finding a fit-for-purpose safety/acceptability specification won't be significantly easier than finding a specification for ambitious value alignment.

It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).

It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.

[again, the below is all in the spirit of "I think this direction is plausibly useful, and I'd like to see more work on it"]

not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world.

Sure, but this seems to say "Don't worry, the malicious superintelligence can only manipulate your mind indirectly". This is not the level of assurance I want from something calling itself "Guaranteed safe".

It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details

This is one mechanism by which such a system could cause great downstream harm.
Suppose that we have a process to avoid this. What assurance do we have that there aren't other mechanisms to cause harm?

I don't yet buy the description complexity penalty argument (as I currently understand it - but quite possibly I'm missing something). It's possible to manipulate by strategically omitting information. Perhaps the "penalise heavily biased sampling" is intended to avoid this (??). If so, I'm not sure how this gets us more than a hand-waving argument.
I imagine it's very hard to do indirect manipulation without adding much complexity.
I imagine that ASL-4+ systems are capable of many very hard things.

Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer - which I expect is untrue for any simple x.
I can buy that there are simple properties whose reduction guarantees safety if it's done to an extreme degree - but then I'm back to expecting the system to do nothing useful.

As an aside, I'd note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That's not a criticism of the overall approach - I just want to highlight that I don't think we get to have both [system provides helpful-in-ways-we-hadn't-considered output] and [system can't produce harmful output]. Allowing the former seems to allow the latter.

I would like to fund a sleeper-agents-style experiment on this by the end of 2025

That's probably a good idea, but this kind of approach doesn't seem in keeping with a "Guaranteed safe" label. More of a "We haven't yet found a way in which this is unsafe".

Paralysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

"Locked into some least-harmful path" is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours.

As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions.

(understood that you'd want to avoid the below by construction through the specification)

I think the worries about a "least harmful path" failure mode would also apply to a "below 1 catastrophic event per millennium" threshold. It's not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn't be highly undesirable outcomes.

It seems to me that "greatly penalize the additional facts which are enforced" is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn't capture everything that we care about.

I haven't thought about it in any detail, but doesn't using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?

I read the paper, and overall it's an interesting framework. One thing I am somewhat unconvinced about (likely because I have misunderstood something) is its utility despite the dependence on the world model. If we prove guarantees assuming a world model, but don't know what happens if the real world deviates from the world model, then we have a problem. Ideally perhaps we want a guarantee akin to what's proved in learning theory, for example, that the accuracy will be small for any data distribution as long as the distribution remains the same during training and testing. 

But perhaps I have misunderstood what's meant by a world model and maybe it's simply the set of precise assumptions under which the guarantees have been proved. For example, in the learning theory setup, maybe the world model is the assumption that the training and test distributions are the same, as opposed to a description of the data distribution. 

You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world model would need to do, and how such a world model may be created, is discussed in Section 3.2. My personal opinion is that the creation of such a world model probably would be challenging, but not more challenging than the problems encountered in other alignment research paths (such as mechanistic interpretability, etc). Also note that you can obtain guarantees without assuming that the world model is entirely accurate. For example, consider the guarantees that are derived in cryptography, or the guarantees derived from formal verification of airplane controllers, etc. You could also monitor the environment of the AI at runtime to look for signs that the world model is inaccurate in a certain situation, and if such signs are detected, transition the AI to a safe mode where it can be disabled.