Zac Hatfield-Dodds

Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; half a PhD in CS (learning / testing / verification), open sourcerer, more at

Wiki Contributions


For Python basics, I have to anti-recommend Shaw's 'learn the hard way'; it's generally outdated and in some places actively misleading. And why would you want to learn the hard way instead of the best way in any case?

Instead, my standard recommendation is Al Sweigart's Automate the Boring Stuff and then Beyond the Basic Stuff (both readable for free on, or purchasable in books); he's also written some books of exercises. If you prefer a more traditional textbook, Think Python 2e is excellent and also available freely online.

Unfortunately, the interpretable-composition hypothesis is simply wrong! Many bugs come from 'feature interactions', a term coined by by Pamela Zave - and if it wasn't for that, programming would be as easy as starting from e.g. Lisp's or Forth's tiny sets of primitives.

As to a weaker form, well, yes - using a 'cleanroom' approach and investing heavily in formal verification (and testing) can get you an orders-of-magnitude lower error rate than ordinary software... at orders-of-magnitude greater cost. At more ordinarily-reasonable levels of rigor, I'd emphasize understandable interfaces rather than primitives, and agree that ensuring people can think about composition helps albeit in weaker ways than I think you're suggesting.

Most smart contracts are written in macro-crusted rust or in solidity. Most major smart contracts have been hacked. Most blockchain validators were written in go and have not been hacked.

I think your inferences from this are wrong, and the facts are probably wrong too: first, it should be "are known to have been hacked" and "not known to have been hacked". This matters because compromise of a smart contract is globally visible in a way that compromising a validator really really isn't. Second, smart contracts receive wildly more attention by both white-hat and black-hat security experts, because there's a very large bug bounty attached.

First, an apology: I didn't mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I'm sorry I wasn't clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.

Being able to state theorems about AGI seems absolutely necessary for success; but I don't think it's close to sufficient.

I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I'm less convinced than Eliezer about theorem-proving for AGI safety.

There are so many independently fatal objections that I don't know how to structure this or convince someone who thinks it would work. I am therefore offering a $1,000 prize for solving a far easier problem:

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

Your proof must not include executing the model, nor equivalent computations (e.g. concolic execution). You may train a custom model and/or directly set model weights, so long as it uses a standard EfficientNet architecture and reaches 99% accuracy. Bonus points for proving more of the sensitivity curve.

I will also take bets that nobody will accomplish this by 2025, nor any loosely equivalent proof for a GPT-3 class model by 2040. This is a very bold claim, but I believe that rigorously proving even trivial global bounds on the behaviour of large learned models will remain infeasible.

And doing this wouldn't actually help, because a theorem is about the inside of your proof system. Recognising the people in a huge collection of atoms is outside your proof system. Analogue attacks like Rowhammer are not in (the ISA model used by) your proof system - and cache and timing attacks like Spectre probably aren't yet either. Your proof system isn't large enough to model the massive floating-point computation inside GPT-2, let alone GPT-3, and if it could the bounds would be .

I still hope that providing automatic disproof-by-counterexample might, in the long term, nudge ML towards specifiability by making it easy to write and check falsifiable properties of ML systems. On the other hand, hoping that ML switches to a safer paradigm is not the kind of safety story I'd be comfortable relying on.

High Impact Careers in Formal Verification: Artificial Intelligence

My research focuses on advanced testing and fuzzing tools, which are so much easier to use that people actually use them - eg in Pytorch, and I understand in Deepmind. If people seem interested I could write up a post on relevance to AI safety in a few weeks.

Core idea: even without proofs, writing out safety properties or other system invariants in code is valuable both (a) for deconfusion, and (b) because we can have a computer search for counterexamples using a variety of heuristics and feedbacks. At the current margin this tends to improve team productivity and shift ML culture towards valuing specifications, which may be a good thing for AI x-risk.