David Duvenaud

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Full version on arXiv | X Executive summary AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI...

Jan 30, 2025189

David Duvenaud

David Duvenaud

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Simple probes can catch sleeper agents

Sabotage Evaluations for Frontier Models

David Duvenaud

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Sabotage Evaluations for Frontier Models

Simple probes can catch sleeper agents

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Simple probes can catch sleeper agents

Sabotage Evaluations for Frontier Models