Zac Hatfield-Dodds — AI Alignment Forum

AI ALIGNMENT FORUM
AF

In response to critiques of Guaranteed Safe AI

I'm sorry that I don't have time to write up a detailed response to (critique of?) the response to critiques; hopefully this brief note is still useful.

I remain frustrated by GSAI advocacy. It's suited for well-understood closed domains, excluding e.g. natural language, when discussing feasibility; but 'we need rigorous guarantees for current or near-future AI' when arguing for importance. It's an extension to or complement of current practice; and current practice is irresponsible and inadequate. Often this is coming from different advocates, but that doesn't make it less frustrating for me.
Claiming that non-vacuous sound (over)approximations are feasible, or that we'll be able to specify and verify non-trivial safety properties, is risible. Planning for runtime monitoring and anomaly detection is IMO an excellent idea, but would be entirely pointless if you believed that we had a guarantee!
It's vaporware. I would love to see a demonstration project and perhaps lose my bet, but I don't find papers or posts full of details compelling, however long we could argue over them. Nullius in verba!

I like the idea of using formal tools to complement and extend current practice - I was at the workshop where Towards GSAI was drafted, and offered co-authorship - but as much I admire the people involved, I just don't believe the core claims of the GSAI agenda as it stands.

Ten people on the inside

Zac Hatfield-Dodds11mo18

I don't think Miles' or Richard's stated reasons for resigning included safety policies, for example.

But my broader point is that "fewer safety people should quit leading labs to protest poor safety policies" is basically a non-sequitor from "people have quit leading labs because they think they'll be more effective elsewhere", whether because they want to do something different or independent, or because they no longer trust the lab to behave responsibly.

Ten people on the inside

Zac Hatfield-Dodds11mo32

I agree with Rohin that there are approximately zero useful things that don't make anyone's workflow harder. The default state is "only just working means working, so I've moved on to the next thing" and if you want to change something there'd better be a benefit to balance the risk of breaking it.

Also 3% of compute is so much compute; probably more than the "20% to date over four years" that OpenAI promised and then yanked from superalignment. Take your preferred estimate of lab compute spending, multiply by 3%, and ask yourself whether a rushed unreasonable lab would grant that much money to people working on a topic it didn't care for, at the expense of those it did.

Ten people on the inside

Zac Hatfield-Dodds11mo71

My impression is that few (one or two?) of the safety people who have quit a leading lab did so to protest poor safety policies, and of those few none saw staying as a viable option.

Relatedly, I think Buck far overestimates the influence and resources of safety-concerned staff in a 'rushed unreasonable developer'.

Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

Zac Hatfield-Dodds1y55Review for 2023 Review

I think this is the most important statement on AI risk to date. Where ChatGPT brought "AI could be very capable" into the overton window, the CAIS Statement brought in AI x-risk. When I give talks to NGOs, or business leaders, or government officials, I almost always include a slide with selected signatories and the full text:

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.

I believe it's true, that it was important to say, and that it's had an ongoing, large, and positive impact. Thank you again to the organizers and to my many, many co-signatories.

AIS terminology proposal: standardize terms for probability ranges

Zac Hatfield-Dodds1y*12

I further suggest that if using these defined terms, instead of including a table of definitions somewhere you include the actual probability range or point estimate in parentheses after the term. This avoids any need to explain the conventions, and makes it clear at the point of use that the author had a precise quantitative definition in mind.

For example: it's likely (75%) that flipping a pair of fair coins will get less than two heads, and extremely unlikely (0-5%) that most readers of AI safety papers are familiar with the quantitative convention proposed above - although they may (>20%) be familiar with the general concept. Note that the inline convention allows for other descriptions if they make the sentence more natural!

There Should Be More Alignment-Driven Startups

Zac Hatfield-Dodds2y*20

Without wishing to discourage these efforts, I disagree on a few points here:

Still, the biggest opportunities are often the ones with the lowest probability of success, and startups are the best structures to capitalize on them.

If I'm looking for the best expected value around, that's still monotonic in the probability of success! There are good reasons to think that most organizations are risk-averse (relative to the neutrality of linear $=utils) and startups can be a good way to get around this.

Nonetheless, I remain concerned about regressional Goodhart; and that many founders naively take on the risk appetite of funders who manage a portfolio, without the corresponding diversification (if all your eggs are in one basket, watch that basket very closely). See also Inadequate Equilibria and maybe Fooled by Randomness.

Meanwhile, strongly agreed that AI safety driven startups should be B corps, especially if they're raising money.

Technical quibble; "B Corp" is a voluntary private certification; PBC is a corporate form which imposes legal obligations on directors. I think many of the B Corp criteria are praiseworthy, but this is neither necessary nor sufficient as an alternative to PBC status - and getting certified is probably a poor use of time and attention for a startup when the founders' time and attention are at such a premium.

There Should Be More Alignment-Driven Startups

Zac Hatfield-Dodds2y56

My personal opinion is that starting a company can be great, but I've also seen several fail due to the gaps between their personal goals, a work-it-out-later business plan, and the duties that you/your board owes to your investors.

IMO any purpose-driven company should be founded as a Public Benefit Corporation, to make it clear in advance and in law that you'll also consider the purpose and the interests of people materially affected by the company alongside investor returns. (cf § 365. Duties of directors)

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds2y32

The obvious targets are of course Anthropic's own frontier models, Claude Instant and Claude 2.

Problem setup: what makes a good decomposition? discusses what success might look like and enable - but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we'd have plenty left to do, unraveling circuits and building a larger-scale understanding of models.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments