Miles Turpin

Reward hacking behavior can generalize across tasks

TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments. Abstract Machine...

May 28, 202485

Miles Turpin

Miles Turpin

Reward hacking behavior can generalize across tasks

Unfaithful Explanations in Chain-of-Thought Prompting

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Miles Turpin

Reward hacking behavior can generalize across tasks

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Unfaithful Explanations in Chain-of-Thought Prompting

Reward hacking behavior can generalize across tasks

Unfaithful Explanations in Chain-of-Thought Prompting

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought