gersonkroiz

Principled Interpretability of Reward Hacking in Closed Frontier Models

Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This is a research sprint report from Neel Nanda’s MATS 9.0 training phase. We do not currently plan to further investigate these environments, but will continue research in science of misalignment and encourage others to...

Jan 124

gersonkroiz

gersonkroiz

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

How to Design Environments for Understanding Model Motives

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz's Shortform

gersonkroiz

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

How to Design Environments for Understanding Model Motives

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz's Shortform

How to Design Environments for Understanding Model Motives

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

Principled Interpretability of Reward Hacking in Closed Frontier Models