Johannes C. Mayer — AI Alignment Forum

Evaluation Avoidance: How Humans and AIs Hack Reward by Disabling Evaluation Instead of Gaming Metrics

Have you ever binge-watched a TV series? Binge-watching puts you in a very peculiar mental state. Assuming you don't reflectively endorse your binge-watching behavior, you'd probably feel pretty bad if you were to reflect on your situation. You might think: "Man, I am wasting my time. I still need to...

Nov 14, 202519

S-Expressions as a Design Language: A Tool for Deconfusion in Alignment

TL;DR: S-expressions are a minimal structure language for early-stage conceptual engineering (aka deconfusion). They are especially useful in alignment, where the hardest part is often figuring out what the problem even is. S-expressions do not enforce any semantics. This lets you write down structure before you know what the structure...

Jun 19, 20255

Constraining Minds, Not Goals: A Structural Approach to AI Alignment

TL;DR: Most alignment work focuses either on theoretical deconfusion or interpreting opaque models. This post argues for a third path: constraining general intelligence through structural control of cognition. Instead of aligning outcomes, we aim to bound the reasoning process—by identifying formal constraints on how plans are generated, world models are...

Jun 13, 202525

Focus on the Hardest Part First

Consider reading this instead. Here is some obvious advice. I think a common failure mode when working on AI alignment[1] is to not focus on the hard parts of the problem first. This is a problem when generating a research agenda, as well as when working on any specific research...

Sep 11, 202342

Transparency for Generalizing Alignment from Toy Models

Status: Some rough thoughts and intuitions. 🪧 indicates signposting TL;DR If we make our optimization procedures transparent, we might be able to analyze them in toy environments to build understanding that generalizes to the real world and to more powerful, scaled-up versions of the systems. 🪧 Let's remind ourselves why...

Apr 2, 202313

Working towards AI alignment is better

There are people not motivated to solve AI alignment, who do work related to AI alignment. E.g. people work on adversarial robustness, understanding how to do science mechanically, or on advancing other paradigms that are more interpretable than modern ML. These people might be interested in the science, or work...

Dec 9, 20228

Agent level parallelism

Let's suppose that an Em researcher runs 1000 times faster than the equivalent human brain. To the Em researcher who runs faster, it will seem like the experiments take significantly longer to run. So there is more waiting around. I expect that a collection of Ems would still be able...

Jun 18, 20225