AI ALIGNMENT FORUMTags
AF

AI Oversight

•

Applied to W2SG: Introduction by Maria Kapros 7mo ago

•

Applied to The weak-to-strong generalization (WTSG) paper in 60 seconds by Alana 9mo ago

•

Applied to Measuring and Improving the Faithfulness of Model-Generated Reasoning by HenningB 1y ago

•

Applied to Is there any existing term summarizing non-scalable oversight methods in outer alignment? by Allen Shen 1y ago

•

Applied to Trying to measure AI deception capabilities using temporary simulation fine-tuning by Alain Le Noac'h 1y ago

•

Applied to Oversight Misses 100% of Thoughts The AI Does Not Think by Thomas Kwa 2y ago

•

Applied to "Corrigibility at some small length" by dath ilan by Christopher King 2y ago

•

Applied to Quick thoughts on "scalable oversight" / "super-human feedback" research by RobertM 2y ago

RobertM v1.1.0Sep 20th 2022 GMT (+28/-18)

~~A description from~~AI Oversight as described by Doing oversight from the very start of training seems hard:

•

Applied to Doing oversight from the very start of training seems hard by RobertM 2y ago

RobertM v1.0.0Sep 20th 2022 GMT (+721)

A description from Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.
Examples of oversight techniques include:
Transparency tools (either used by a human, an AI, or a human assisted by an AI)
Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)
Relaxed adversarial training (which could be seen as an extension of adversarial inputs)

•

Created by RobertM at 2y