jenny — AI Alignment Forum

A Toy Environment For Exploring Reasoning About Reward

tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment. Setup When we noticed the increase in verbalized alignment...

Mar 2557

Metagaming matters for training, evaluation, and oversight

Following up on our previous work on verbalized eval awareness: we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. 1. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. 2. It arises in frontier training runs...

Mar 1883

Stress Testing Deliberative Alignment for Anti-Scheming Training

by Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, Alex Meinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, and alex.lloyd

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025133

jenny's Shortform

Jan 24, 20254

Attributing to interactions with GCPD and GWPD

This post provides background, motivation, and a nontechnical summary of the purely mathematical https://arxiv.org/abs/2310.06686. Coauthors (alphabetical): Chris MacLeod, Jenny Nitishinskaya, Buck Shlegeris. Work done mostly while at Redwood Research. Thanks to Joe Benton and Ryan Greenblatt for some math done previously. Thanks to Neel Nanda, Fabien Roger, Nix Goldowsky-Dill, and...

Oct 11, 202320

Impact stories for model internals: an exercise for interpretability researchers

Inspired by Neel's longlist; thanks to @Nicholas Goldowsky-Dill and @Sam Marks for feedback and discussion, and thanks to AWAIR attendees for participating in the associated activity. As part of the Alignment Workshop for AI Researchers in July/August '23, I ran a session on theories of impact for model internals. Many...

Sep 25, 202329

Causal scrubbing: results on induction heads

by LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck, and Nate Thomas

* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to induction heads. The results are also summarized here. Introduction In this post, we’ll apply the causal scrubbing methodology to investigate how induction heads work in a particular 2-layer attention-only language model.[1] While we...

Dec 3, 202234

Jenny Nitishinskaya

Jenny Nitishinskaya

Jenny Nitishinskaya

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Stress Testing Deliberative Alignment for Anti-Scheming Training

Metagaming matters for training, evaluation, and oversight

A Toy Environment For Exploring Reasoning About Reward

Jenny Nitishinskaya

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Stress Testing Deliberative Alignment for Anti-Scheming Training

Metagaming matters for training, evaluation, and oversight

A Toy Environment For Exploring Reasoning About Reward

A Toy Environment For Exploring Reasoning About Reward

Metagaming matters for training, evaluation, and oversight

Stress Testing Deliberative Alignment for Anti-Scheming Training

jenny's Shortform

Attributing to interactions with GCPD and GWPD

Impact stories for model internals: an exercise for interpretability researchers

Causal scrubbing: results on induction heads