[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda; Arthur Conmy; lsgos; Senthooran Rajamanoharan; Tom Lieberum; János Kramár; Vikrant Varma

davidad (David A. Dalrymple)

Programme Director at UK Advanced Research + Invention Agency focusing on safe transformative AI; formerly Protocol Labs, FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.

Posts

Sorted by New

1davidad's Shortform

21What does davidad want from «boundaries»?

3mo

55A list of core AI safety problems and how I hope to solve them

8mo

26Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

9mo

33An Open Agency Architecture for Safe Transformative AI

28AI Neorealism: a threat model & success criterion for existential safety

19Side-channels: input versus output

30Reframing inner alignment

42You can still fetch the coffee today if you're dead tomorrow

13The Promise and Peril of Finite Sets

1davidad's Shortform

Wiki Contributions

Comments

Safety First: safety before full alignment. The deontic sufficiency hypothesis.

davidad (David A. Dalrymple)4mo32

For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.

RSPs are pauses done right

davidad (David A. Dalrymple)7mo1112

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

A list of core AI safety problems and how I hope to solve them

davidad (David A. Dalrymple)8mo10

Yes, it's the latter. See also the Open Agency Keyholder Prize.

A list of core AI safety problems and how I hope to solve them

davidad (David A. Dalrymple)8mo26

That’s basically correct. OAA is more like a research agenda and a story about how one would put the research outputs together to build safe AI, than an engineering agenda that humanity entirely knows how to build. Even I think it’s only about 30% likely to work in time.

I would love it if humanity had a plan that was more likely to be feasible, and in my opinion that’s still an open problem!

A list of core AI safety problems and how I hope to solve them

davidad (David A. Dalrymple)8mo20

OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).

There is still a misuse version: someone could remove the provision in 5.1.5 that the model of Earth-like situations should be largely agnostic about human behavior, and instead building a detailed model of how human nervous systems respond to language. (Then, even though the superintelligence in the box would still be making only descriptive arguments about a policy, the policy that comes out would likely emit normative arguments at deployment time.) Superintelligence misuse is covered under problem 11.

If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

A list of core AI safety problems and how I hope to solve them

davidad (David A. Dalrymple)8mo30

That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in on the conclusion that the compositional world-modeling framework problem, as Owen Lynch and I have laid it out in this post, is potentially “the hard part” of AI safety, that would be wonderful…

A list of core AI safety problems and how I hope to solve them

davidad (David A. Dalrymple)8mo60

I think you’re directionally correct; I agree about the following:

A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
I imagine these would be mostly if not entirely learned.
There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to justify confidence in the proposed policy network, in which case the policy would not exit the box, and the failure is a flop rather than a foom.)

Perhaps the keyphrase you’re missing is “interval MDP abstraction”. One specific paper that combines RL and model-checking and coarse-graining in the way you’re asking for is Formal Controller Synthesis for Continuous-Space MDPs via Model-Free Reinforcement Learning.

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

davidad (David A. Dalrymple)9mo30

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

davidad (David A. Dalrymple)9mo11

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

Davidad's Bold Plan for Alignment: An In-Depth Explanation

davidad (David A. Dalrymple)9mo10

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.