davidad (David A. Dalrymple)

Programme Director at UK Advanced Research + Invention Agency focusing on safe transformative AI; formerly Protocol Labs, FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.

Wiki Contributions


For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

That’s basically correct. OAA is more like a research agenda and a story about how one would put the research outputs together to build safe AI, than an engineering agenda that humanity entirely knows how to build. Even I think it’s only about 30% likely to work in time.

I would love it if humanity had a plan that was more likely to be feasible, and in my opinion that’s still an open problem!

OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).

There is still a misuse version: someone could remove the provision in 5.1.5 that the model of Earth-like situations should be largely agnostic about human behavior, and instead building a detailed model of how human nervous systems respond to language. (Then, even though the superintelligence in the box would still be making only descriptive arguments about a policy, the policy that comes out would likely emit normative arguments at deployment time.) Superintelligence misuse is covered under problem 11.

If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in on the conclusion that the compositional world-modeling framework problem, as Owen Lynch and I have laid it out in this post, is potentially “the hard part” of AI safety, that would be wonderful…

I think you’re directionally correct; I agree about the following:

  • A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
  • I imagine these would be mostly if not entirely learned.
  • There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to justify confidence in the proposed policy network, in which case the policy would not exit the box, and the failure is a flop rather than a foom.)

Perhaps the keyphrase you’re missing is “interval MDP abstraction”. One specific paper that combines RL and model-checking and coarse-graining in the way you’re asking for is Formal Controller Synthesis for Continuous-Space MDPs via Model-Free Reinforcement Learning.

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.

Load More