Zac Hatfield-Dodds

Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev

Wiki Contributions

Comments

Tamsin Leake's Shortform

Zac Hatfield-Dodds1mo911

I agree that there's no substitute for thinking about this for yourself, but I think that morally or socially counting "spending thousands of dollars on yourself, an AI researcher" as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it's-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it's very easy for donating to people or organizations in your social circle to have substantial negative expected value.

I'm glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds7mo32

The obvious targets are of course Anthropic's own frontier models, Claude Instant and Claude 2.

Problem setup: what makes a good decomposition? discusses what success might look like and enable - but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we'd have plenty left to do, unraveling circuits and building a larger-scale understanding of models.

Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

Zac Hatfield-Dodds7mo55

One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!

Meta announces Llama 2; "open sources" it for commercial use

Zac Hatfield-Dodds9mo41

Example projects you're not allowed to do, if they involve other model families:

using Llama 2 as part of an RLAIF setup, which you might want to do when investigating Constitutional AI or decomposition or faithfulness of chain-of-thought or many many other projects;
using Llama 2 in auto-interpretability schemes to e.g. label detected features in smaller models, if this will lead to improvements in non-Llama-2 models;
fine-tuning other or smaller models on synthetic data produced by Llama 2, which has some downsides but is a great way to check for signs of life of a proposed technique

In many cases I expect that individuals will go ahead and do this anyway, much like the license of Llama 1 was flagrantly violated all over the place, but remember that it's differentially risky for any organisation which Meta might like to legally harass.

Meta announces Llama 2; "open sources" it for commercial use

Zac Hatfield-Dodds9mo2013

Llama 2 is not open source.

(a few days after this comment, here's a concurring opinion from the Open Source Initiative - as close to authoritative as you can get)

(later again, here's Yan LeCun testifying under oath: "so first of all Llama system was not made open source ... we released it in a way that did not authorize commercial use, we kind of vetted the people who could download the model it was reserved to researchers and academics")

While their custom licence permits some commercial uses, it is not an OSI approved license, and because it violates the open source definition it never will be. Specifically, the llama 2 licence violates:

1. Source code. It's a little ambiguous what this means for a trained model; I'd claim that an open model release should include the training code (yes) and dataset (no), along with sufficient instructions for others to reproduce the results. However, you could also argue that weights are in fact "the preferred form in which a programmer would modify the program", so this is not an important objection.
1. No Discrimination Against Persons or Groups. See the ban on use by anyone who has, or is affiliated with anyone who has, more than 700M active users. As a side note, Snapchat recently announced that they had 750M active users, so this looks pretty targeted at competing social media (including Tiktok, Google, etc.). As a consequence, the Llama 2 license also violates OSD 7. Distribution of License: "the rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties."
1. No Discrimination Against Fields of Endeavor. If you can't use Llama 2 to - for example - train another model, it's by definition not open source. Their entire acceptable use policy is included by reference and contains a wide variety of sometimes ambiguous restrictions.

So, why does this matter?

As an open-source maintainer and PSF Fellow, I have no objection to the existence of commercially licensed software. I use much of it, and have sold commercial licenses for software that I've written too. However, people - and especially megacorps - misrepresenting their closed-off projects as open source is an infuriating form of parasitism on a reputation painstakingly built over decades.
The restriction on model training makes Llama 2 much less useful for AI safety research, but it incurs just as much direct (roughly all via misuse IMO) and acceleration risk as an open-source release.
Using a custom license adds substantial legal risk for prospective commercial users, especially given the very broad restrictions imposed by the acceptable use policy. This reduces the economic upside enormously relative to standard open terms, and leaves Meta's competitors particularly at risk of lawsuits if they attempt to use Llama 2.

To summarize, Meta gets a better cost/benefit tradeoff by using a custom, non-open-source license especially if people incorrectly percieve it as open source; everyone else is worse off; and it seems to me like they're deliberately misrepresenting what they've done for their own gain. This really, really annoys me.

When someone describes Llama 2 as "open source", please correct them: Meta is offering a limited commercial license which discriminates against specific users and bans many valuable use-cases, including in alignment research.

Anthropic's Core Views on AI Safety

Zac Hatfield-Dodds1y104

(Zac's note: I'm posting this on behalf of Jack Clark, who is unfortunately unwell today. Everything below is his words.)

Hi there, I’m Jack and I lead our policy team. The primary reason it’s not discussed in the post is that the post was already quite long and we wanted to keep the focus on safety - I did some help editing bits of the post and couldn’t figure out a way to shoehorn in stuff about policy without it feeling inelegant / orthogonal.

You do, however, raise a good point, in that we haven’t spent much time publicly explaining what we’re up to as a team. One of my goals for 2023 is to do a long writeup here. But since you asked, here’s some information:

You can generally think of the Anthropic policy team as doing three primary things:

Trying to proactively educate policymakers about the scaling trends of AI systems and their relation to safety. Myself and my colleague Deep Ganguli (Societal Impacts) basically co-wrote this paper https://arxiv.org/abs/2202.07785 - you can think of us as generally briefing out a lot of the narrative in here.
Pushing a few specific things that we care about. We think evals/measures for safety of AI systems aren’t very good [Zac: i.e. should be improved!], so we’ve spent a lot of time engaging with NIST’s ‘Risk Management Framework’ for AI systems as a way to create more useful policy institutions here - while we expect labs in private sector and academia will do much of this research, NIST is one of the best institutions to take these insights and a) standardize some of them and b) circulate insights across government. We’ve also spent time on the National AI Research Resource as we see it as a path to have a larger set of people do safety-oriented analysis of increasingly large models.
Responding to interest. An increasing amount of our work is reactive (huge uptick in interest in past few months since launch of ChatGPT). By reactive I mean that policymakers reach out to us and ask for our thoughts on things. We generally aim to give impartial, technically informed advice, including pointing out things that aren’t favorable to Anthropic to point out (like emphasizing the very significant safety concerns with large models). We do this because a) we think we’re well positioned to give policymakers good information and b) as the stakes get higher, we expect policymakers will tend to put higher weight on the advice of labs which ‘showed up’ for them before it was strategic to do so. Therefore we tend to spend a lot of time doing a lot of meetings to help out policymakers, no matter how ‘important’ they or their country/organization are - we basically ignore hierarchy and try to satisfy all requests that come in at this stage.

More broadly, we try to be transparent on the micro level, but haven’t invested yet in being transparent on the macro. What I mean by that is many of our RFIs, talks, and ideas are public, but we haven’t yet done a single writeup that gives an overview of our work. I am hoping to do this with the team this year!

Some other desiderata that may be useful:

I testified in the Senate last year and wrote quite a long written testimony.
I talked to the Congressional AI Caucus; slides here. Note: I demo’d Claude but it’s worth noting that whenever I demo our system I also break it to illustrate safety concerns. IIRC here I jailbroke it so it would play along with me when I asked it how to make rabies airborne - this was to illustrate how loose some of the safety aspects of contemporary LLMs are.
A general idea I/we push with policymakers is the need to measure and monitor AI systems; Jess Whittlestone and I wrote up a plan here which you can expect us to be roughly outlining in meetings.
A NIST RFI that talks about some of the trends in predictability and surprise and also has some recommendations.

Our wonderful colleagues on the ‘Societal Impacts’ team led this work on Red Teaming and we (Policy) helped out on the paper and some of the research. We generally think red teaming is a great idea to push to policymakers re AI systems; it’s one of those things that is ‘shovel ready’ for the systems of today but, we think, has some decent chance of helping out in future with increasingly large models.

A Barebones Guide to Mechanistic Interpretability Prerequisites

Zac Hatfield-Dodds2y44

For Python basics, I have to anti-recommend Shaw's 'learn the hard way'; it's generally outdated and in some places actively misleading. And why would you want to learn the hard way instead of the best way in any case?

Instead, my standard recommendation is Al Sweigart's Automate the Boring Stuff and then Beyond the Basic Stuff (both readable for free on inventwithpython.com, or purchasable in books); he's also written some books of exercises. If you prefer a more traditional textbook, Think Python 2e is excellent and also available freely online.

Discussion with Eliezer Yudkowsky on AGI interventions

Zac Hatfield-Dodds2y10

First, an apology: I didn't mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I'm sorry I wasn't clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.

Being able to state theorems about AGI seems absolutely necessary for success; but I don't think it's close to sufficient.

Discussion with Eliezer Yudkowsky on AGI interventions

Zac Hatfield-Dodds2y160

I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I'm less convinced than Eliezer about theorem-proving for AGI safety.

There are so many independently fatal objections that I don't know how to structure this or convince someone who thinks it would work. I am therefore offering a $1,000 prize for solving a far easier problem:

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

Your proof must not include executing the model, nor equivalent computations (e.g. concolic execution). You may train a custom model and/or directly set model weights, so long as it uses a standard EfficientNet architecture and reaches 99% accuracy. Bonus points for proving more of the sensitivity curve.

I will also take bets that nobody will accomplish this by 2025, nor any loosely equivalent proof for a GPT-3 class model by 2040. This is a very bold claim, but I believe that rigorously proving even trivial global bounds on the behaviour of large learned models will remain infeasible.

And doing this wouldn't actually help, because a theorem is about the inside of your proof system. Recognising the people in a huge collection of atoms is outside your proof system. Analogue attacks like Rowhammer are not in (the ISA model used by) your proof system - and cache and timing attacks like Spectre probably aren't yet either. Your proof system isn't large enough to model the massive floating-point computation inside GPT-2, let alone GPT-3, and if it could the bounds would be .

I still hope that providing automatic disproof-by-counterexample might, in the long term, nudge ML towards specifiability by making it easy to write and check falsifiable properties of ML systems. On the other hand, hoping that ML switches to a safer paradigm is not the kind of safety story I'd be comfortable relying on.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments