x

AI ALIGNMENT FORUM

AF

ariaw — AI Alignment Forum

Aria Wong

Aria Wong

Message

MATS 9.0 Scholar with Neel Nanda
https://ariahw.github.io

66

Ω

23

1

3

7mo

Aria Wong

MATS 9.0 Scholar with Neel Nanda
https://ariahw.github.io

Steering RL Training: Benchmarking Interventions Against Reward Hacking

This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post...

Dec 29, 2025•76