Alex Mallen — AI Alignment Forum

Risk reports need to address deployment-time spread of misalignment

Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from an internally deployed AI. However, an AI that genuinely starts out with largely benign motivations can develop widespread dangerous motivations during deployment. I think this is the most plausible route to consistent adversarial misalignment in the near future....

May 1564

Clarifying the role of the behavioral selection model

This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity. The main focus of this post is clarifying the basic machinery of the behavioral selection model, and conveying why it...

May 1017

Risk from fitness-seeking AIs: mechanisms and mitigations

Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call "fitness-seeking"—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking)....

May 1107

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers

by Jozdien and Alex Mallen

We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get useful work from it on...

Apr 2855

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal. In more powerful systems, this kind of failure...

Apr 14182

The case for satiating cheaply-satisfied AI preferences

A central AI safety concern is that AIs will develop unintended preferences and undermine human control to achieve them. But some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial one. In this post, I argue that developers should consider...

Mar 10103

Will reward-seekers respond to distant incentives?

Reward-seekers are usually modeled as responding only to local incentives administered by developers. Here I ask: Will AIs or humans be able to influence their incentives at a distance—e.g., by retroactively reinforcing actions substantially in the future or by committing to run many copies of them in simulated deployments with...

Feb 1657