Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....
Crossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/ Stephen Casper “They are wrong who think that politics is like an ocean voyage or a military campaign, something to be done with some particular end in view, something which leaves off as soon as that end is reached. It is not a public chore, to be...
Part 15 of 12 in the Engineer’s Interpretability Sequence Reflecting on past predictions for new work On October 11, 2024, I posted some thoughts on mechanistic interpretability and presented eight predictions for what I thought the next big paper on sparse autoencoders would and would not do. Then, on March...
Part 14 of 12 in the Engineer’s Interpretability Sequence. Is this market really only at 63%? I think you should take the over. Only 63%? I think you should take the over. Five tiers of rigor for safety-oriented interpretability work Lately, I have been thinking of interpretability research as falling...
Thanks to Zora Che, Michael Chen, Andi Peng, Lev McKinney, Bilal Chughtai, Shashwat Goel, Domenic Rosati, and Rohit Gandikota. TL;DR In contrast to evaluating AI systems under normal "input-space" attacks, using "generalized," attacks, which allow an attacker to manipulate weights or activations, might be able to help us better evaluate...
Part 13 of 12 in the Engineer’s Interpretability Sequence. TL;DR On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but...
TL;DR: Scaling labs have their own alignment problem analogous to AI systems, and there are some similarities between the labs and misaligned/unsafe AI. Introduction Major AI scaling labs (OpenAI/Microsoft, Anthropic, Google/DeepMind, and Meta) are very influential in the AI safety and alignment community. They put out cutting-edge research because of...