AI ALIGNMENT FORUMTags
AF

Corrigibility

EditHistory
Discussion(0)
Help improve this page(2 flags)

A corrigible agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.

Posts tagged Corrigibility
Most Relevant
3
12Corrigibility
Paul Christiano
2y
2
1
45AI Alignment 2018-19 Review
Rohin Shah
1y
6
1
28Non-Obstruction: A Simple Concept Motivating Corrigibility
Alex Turner
2mo
15
1
21Three mental images from thinking about AGI debate & corrigibility
Steve Byrnes
6mo
32
1
19Towards a mechanistic understanding of corrigibility
Evan Hubinger
1y
25
1
19Corrigibility as outside view
Alex Turner
8mo
10
1
16Can corrigibility be learned safely?
Wei Dai
3y
15
1
13Do what we mean vs. do what we say
Rohin Shah
2y
2
1
14Thoughts on implementing corrigible robust alignment
Steve Byrnes
1y
2
1
8Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.
Ryan Carey
2y
0
1
5An Idea For Corrigible, Recursively Improving Math Oracles
Jim Babcock
6y
0
1
3Corrigible omniscient AI capable of making clones
Kaj Sotala
6y
0
0
21Announcement: AI alignment prize round 4 winners
Vladimir Slepnev
2y
0
0
11New paper: Corrigibility with Utility Preservation
Koen Holtman
1y
0
0
8Petrov corrigibility
Stuart Armstrong
2y
9
Load More (15/17)
Add Posts