A description from Doing oversight from the very start of training seems hard:
Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.
Examples of oversight techniques include:
- Transparency tools (either used by a human, an AI, or a human assisted by an AI)
- Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)
- Relaxed adversarial training (which could be seen as an extension of adversarial inputs)