All of Zac Hatfield-Dodds's Comments + Replies

The obvious targets are of course Anthropic's own frontier models, Claude Instant and Claude 2.

Problem setup: what makes a good decomposition? discusses what success might look like and enable - but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we'd have plenty left to do, unraveling circuits and building a larger-scale understanding of models.

One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!

Example projects you're not allowed to do, if they involve other model families:

  • using Llama 2 as part of an RLAIF setup, which you might want to do when investigating Constitutional AI or decomposition or faithfulness of chain-of-thought or many many other projects;
  • using Llama 2 in auto-interpretability schemes to e.g. label detected features in smaller models, if this will lead to improvements in non-Llama-2 models;
  • fine-tuning other or smaller models on synthetic data produced by Llama 2, which has some downsides but is a great way to check for signs
... (read more)

Llama 2 is not open source.

(a few days after this comment, here's a concurring opinion from the Open Source Initiative - as close to authoritative as you can get)

(later again, here's Yan LeCun testifying under oath: "so first of all Llama system was not made open source ... we released it in a way that did not authorize commercial use, we kind of vetted the people who could download the model it was reserved to researchers and academics")

While their custom licence permits some commercial uses, it is not an OSI approved license, and because it violates the ... (read more)

Thanks a lot for the context! Out of curiosity, why does the model training restriction make it much less useful for safety research?

Huh, that's very useful context, thanks! Seems like pretty sad behaviour.

(Zac's note: I'm posting this on behalf of Jack Clark, who is unfortunately unwell today.  Everything below is his words.)

Hi there, I’m Jack and I lead our policy team. The primary reason it’s not discussed in the post is that the post was already quite long and we wanted to keep the focus on safety - I did some help editing bits of the post and couldn’t figure out a way to shoehorn in stuff about policy without it feeling inelegant / orthogonal.

You do, however, raise a good point, in that we haven’t spent much time publicly explaining what we’re up t... (read more)

For Python basics, I have to anti-recommend Shaw's 'learn the hard way'; it's generally outdated and in some places actively misleading. And why would you want to learn the hard way instead of the best way in any case?

Instead, my standard recommendation is Al Sweigart's Automate the Boring Stuff and then Beyond the Basic Stuff (both readable for free on, or purchasable in books); he's also written some books of exercises. If you prefer a more traditional textbook, Think Python 2e is excellent and also available freely online.

2Neel Nanda1y
Thanks! I learned Python ~10 years ago and have no idea what sources are any good lol. I've edited the post with your recs :)

Unfortunately, the interpretable-composition hypothesis is simply wrong! Many bugs come from 'feature interactions', a term coined by by Pamela Zave - and if it wasn't for that, programming would be as easy as starting from e.g. Lisp's or Forth's tiny sets of primitives.

As to a weaker form, well, yes - using a 'cleanroom' approach and investing heavily in formal verification (and testing) can get you an orders-of-magnitude lower error rate than ordinary software... at orders-of-magnitude greater cost. At more ordinarily-reasonable levels of rigor, I'd em... (read more)

1Luke H Miles1y
Good point about validators failing silently and being more strongly vetted. Abstractly, it seems to me that once the tooling and process is figured out for one task in a narrow domain, you could reuse that stuff on other tasks in the same domain at relatively low cost. But the history of repeated similar vulnerabilities over long time ranges in narrow domains (eg gnu core utils) is perhaps some evidence against that. I agree with the first half and would add that restricting the kind of interface has large additional benefit on top of the interfaces themselves. If you are dealing with classes and functions and singleton module things and macros then you're far more prone to error compared to just using any single one of those things. Even if they are all simple. If system validation has exponential cost with respect to confidence and system size then I think the simplicity of the primitives is perhaps uh the base of the exponent. And 1.011000 is a lot smaller than 21000. My main point is that this the difference between "annoying to validate" and "we will never ever be able to validate"

First, an apology: I didn't mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I'm sorry I wasn't clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler app... (read more)

5Vanessa Kosoy2y
I think we might have some disagreement about degree more than about kind. I think that we are probably going to design architectures that make proving easier rather than proving things about architectures optimized only for capability, but not necessarily. Moreover, some qualitative properties are not sensitive to architecture and we can prove them about classes of architectures that include those optimized for capability. And, I think humans also belong to a useful class of agents with simple description (e.g. along the lines I described here) and you don't need anything like a detailed model of bias. And, people do manage to prove some things about large models, e.g. this, just not enough things. And, some of the proofs might be produced by the system itself in runtime (e.g. the system will have a trustworthy/rigorous part and an untrustworthy/heuristic part and the rigorous part will make sure the heuristic part is proving the safety of its proposals before they are implemented). I think the pipeline of success looks something like theoretical models => phenomenological models (i.e. models informed by a combination of theory and experiment) => security-mindset engineering (i.e. engineering that keeps track of the differences between models and reality and makes sure they are suitably bounded / irrelevant) => plethora of security-mindset testing methods, including but not limited to formal verification (i.e. aiming for fool-proof test coverage while also making sure that, modulo previous tests, each test involves using the system in safe ways even if it has bugs). And ofc it's not a pure waterfall, there is feedback from each stage to previous stages.

I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I'm less convinced than Eliezer about theorem-proving for AGI safety.

There are so many independently fatal objections that I don't know how to structure this or convince someone who thinks it would work. I am therefore offering a $1,000 prize for solving a far easier problem:

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class betw

... (read more)
3Buck Shlegeris2y
Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?

You're attacking a strawman of what kind of theorems we want to prove. Obviously we are not going to prove theorems that contain specific datasets as part of the statement. What we're going to do is build a theory founded on certain assumptions about the real-world (such as locality / decoupling of dynamics on different scales / some kind of chaos / certain bounds on computational complexity / existence of simple fundamental laws etc) and humans (e.g. that they are approximately rational agents, for some definition thereof). Such a theory can produce many ... (read more)

High Impact Careers in Formal Verification: Artificial Intelligence

My research focuses on advanced testing and fuzzing tools, which are so much easier to use that people actually use them - eg in Pytorch, and I understand in Deepmind. If people seem interested I could write up a post on relevance to AI safety in a few weeks.

Core idea: even without proofs, writing out safety properties or other system invariants in code is valuable both (a) for deconfusion, and (b) because we can have a computer search for counterexamples using a variety of heuristics an... (read more)

2Rohin Shah3y
My general take is that anything relying on safety properties / system invariants being written out in formal languages / interpretable code seems not that helpful, because we just don't get such clean safety properties / system invariants. I am pretty enthusiastic about relying on learned models of safety properties / system invariants, that are assumed to be good but not perfect specifications; and fuzzing / testing with respect to those learned models seems great.