I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model...
Joshua Clymer, Alek Westover, Anshul Khandelwal We explore the following hypothesis both conceptually and, to a small extent, empirically. We call this the Alignment Drift Hypothesis: An AI system that is initially aligned will generally drift into misalignment after a sufficient number of successive modifications, even if these modifications select...
I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny, and I’m worried that MIRI’s overconfidence has reduced the credibility of the issue. Here is why I think the core argument in "if anyone builds it, everyone...
Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ...
Right now, alignment seems easy – but that’s because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet. It might seem like there’s a swarming research field improving white box detectors – a new paper about probes...
My goal as an AI safety researcher is to put myself out of a job. I don’t worry too much about how planet sized brains will shape galaxies in 100 years. That’s something for AI systems to figure out. Instead, I worry about safely replacing human researchers with AI agents,...
I’m not a natural “doomsayer.” But unfortunately, part of my job as an AI security researcher is to think about the more troubling scenarios. I’m like a mechanic scrambling last-minute checks before Apollo 13 takes off. If you ask for my take on the situation, I won’t comment on the...