AI ALIGNMENT FORUM
AF

Wikitags

Eliciting Latent Knowledge

Edited by Multicore last updated 31st Mar 2022

Eliciting Latent Knowledge is an open problem in AI safety.

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model "knows" facts (like "the camera was tampered with") that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

--ARC report

See also: Transparency/Interpretability

Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Eliciting Latent Knowledge
95ARC's first technical report: Eliciting Latent Knowledge
paulfchristiano, Mark Xu, Ajeya Cotra
4y
72
67Mechanistic anomaly detection and ELK
paulfchristiano
3y
16
62ELK prize results
paulfchristiano, Mark Xu
3y
7
57Finding gliders in the game of life
paulfchristiano
3y
5
32Counterexamples to some ELK proposals
paulfchristiano
4y
6
63Prizes for ELK proposals
paulfchristiano
4y
69
27ELK First Round Contest Winners
Mark Xu, paulfchristiano
4y
1
20Eliciting Latent Knowledge Via Hypothetical Sensors
John_Maxwell
4y
1
20ELK Proposal: Thinking Via A Human Imitator
TurnTrout
4y
6
18Importance of foresight evaluations within ELK
Jonathan Uesato
4y
1
13Towards a better circuit prior: Improving on ELK state-of-the-art
evhub, kcwoolverton
3y
0
38Implications of automated ontology identification
Alex Flint, adamShimi, Robert Miles
4y
7
79What Discovering Latent Knowledge Did and Did Not Find
Fabien Roger
2y
4
29Can we efficiently explain model behaviors?
paulfchristiano
3y
0
33ELK Thought Dump
abramdemski
4y
16
Load More (15/55)
Add Posts