AI ALIGNMENT FORUM
AF

ariana_azarbal
Ω88100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d30

Thank you for the suggestions and concrete predictions. 

One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset. 

I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I'd predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models. 

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal16d*60

Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces: 

  1. even mention the presence of a test-case
  2. state an intention to pass tests
  3. identify that one of the test cases is incorrect

Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.

Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data 

Thanks again for the suggestion!

Reply
53Training a Reward Hacker Despite Perfect Labels
19d
8
30Selective Generalization: Improving Capabilities While Maintaining Alignment
2mo
0