In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight.
When thinking about deception and RLHF training, a simplified threat model is something like this:
- A model takes some actions.
- If a human approves of these actions, the human gives the model some reward.
- Humans can be deceived into giving reward in situations where they would otherwise not if they had more knowledge.
- Models will take advantage of this so they can get more reward.
- Models will therefore become deceptive.
Before continuing, I would encourage you to really engage with... (read 1365 more words →)