AI ALIGNMENT FORUM
AF

Josh Levy
Ω20110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
OpenAI: Detecting misbehavior in frontier reasoning models
Josh Levy4mo10

They say above "there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT." This sounds like maybe they are talking about the shoggoth/face distinction, or something in that direction! Yay!

It seems like the suggestion here it to potentially apply less process-supervision (i.e. to part of the CoT). In this case, though, the more straight-forward fix to this issue is to apply more supervision. Specifically, they were not applying supervision to the action, only the CoT. From the paper:

We chose to exclude agent actions from the monitor because it allows us to make statements around the monitorability of, and optimization pressures on, the CoT in isolation; however, we note that it may be more natural to give the monitor complete information of the trajectory as we do when monitoring our frontier agents.

Another way of saying this is that their method is not explicitly trying to make the CoT faithful. They are leaving it open for the model to decide between:

  1. don't say you'll hack and hack anyway
  2. don't say you'll hack and really don't hack

The model finds that it's easier to do 1. But if you were monitoring the action, almost certainly 2 would be chosen.

Maybe the more interesting question is whether we could expect monitoring the action to fix this issue in general. Perhaps not. For example, the action may be just a directive/choice ("do this") that follows a long chain of reasoning, and it may not be possible to monitor it like you can here.

Anyway, the goal should really be to make to the CoT faithful, they were not explicitly trying to do that here, and may have even been pressuring to do the opposite.

Reply
20Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.
1y
0