AI ALIGNMENT FORUM
AF

Wikitags

AI Oversight

Edited by RobertM last updated 20th Sep 2022

AI Oversight as described by Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.

Examples of oversight techniques include:

  • Transparency tools (either used by a human, an AI, or a human assisted by an AI)
  • Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)
  • Relaxed adversarial training (which could be seen as an extension of adversarial inputs)
Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged AI Oversight
38Oversight Misses 100% of Thoughts The AI Does Not Think
johnswentworth
3y
23
16Quick thoughts on "scalable oversight" / "super-human feedback" research
David Scott Krueger (formerly: capybaralet)
3y
9
58Measuring and Improving the Faithfulness of Model-Generated Reasoning
Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman, Ethan Perez
2y
12
17Human-AI Complementarity: A Goal for Amplified Oversight
rishubjain, Sophie Bridgers
8mo
3
6Doing oversight from the very start of training seems hard
peterbarnett
3y
1
Add Posts