joshc — AI Alignment Forum

Can governments quickly and cheaply slow AI training?

I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience. But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view explainers of what's going on and what the bottlenecks...

Mar 764

Lessons from building a model organism testbed

I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model...

Nov 17, 202522

Will AI systems drift into misalignment?

Joshua Clymer, Alek Westover, Anshul Khandelwal We explore the following hypothesis both conceptually and, to a small extent, empirically. We call this the Alignment Drift Hypothesis: An AI system that is initially aligned will generally drift into misalignment after a sufficient number of successive modifications, even if these modifications select...

Nov 15, 202515

If anyone builds it, everyone will plausibly be fine

I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny, and I’m worried that MIRI’s overconfidence has reduced the credibility of the issue. Here is why I think the core argument in "if anyone builds it, everyone...

Sep 18, 202532

Recent Redwood Research project proposals

by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, and Joey Yudelson

Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ...

Jul 14, 202593

Alignment faking CTFs: Apply to my MATS stream

Right now, alignment seems easy – but that’s because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet. It might seem like there’s a swarming research field improving white box detectors – a new paper about probes...

Apr 4, 202561

How might we safely pass the buck to AI?

My goal as an AI safety researcher is to put myself out of a job. I don’t worry too much about how planet sized brains will shape galaxies in 100 years. That’s something for AI systems to figure out. Instead, I worry about safely replacing human researchers with AI agents,...

Feb 19, 202591

Josh Clymer

Josh Clymer

Josh Clymer

How AI Takeover Might Happen in 2 Years

Planning for Extreme AI Risks

Recent Redwood Research project proposals

How might we safely pass the buck to AI?

Josh Clymer

How AI Takeover Might Happen in 2 Years

Planning for Extreme AI Risks

Recent Redwood Research project proposals

How might we safely pass the buck to AI?

Can governments quickly and cheaply slow AI training?

Lessons from building a model organism testbed

Will AI systems drift into misalignment?

If anyone builds it, everyone will plausibly be fine

Recent Redwood Research project proposals

Alignment faking CTFs: Apply to my MATS stream

How might we safely pass the buck to AI?