gersonkroiz

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

by aditya singh, gersonkroiz, Senthooran Rajamanoharan, and Neel Nanda

Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. Motivation Imagine that a frontier lab’s coding agent has been caught putting a bug in the key code for...

Feb 2760

gersonkroiz

gersonkroiz

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

How to Design Environments for Understanding Model Motives

The Case for Model Forensics

Can Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

How to Design Environments for Understanding Model Motives

The Case for Model Forensics

Can Models be Evaluation Aware Without Explicit Verbalization?

The Case for Model Forensics

How to Design Environments for Understanding Model Motives

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

Principled Interpretability of Reward Hacking in Closed Frontier Models