AI ALIGNMENT FORUM
Wikitags
AF

Subscribe
Discussion0
1

Gradient Hacking

Subscribe
Discussion0
1
Written by Multicore last updated 27th Aug 2022

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Posts tagged Gradient Hacking
53Gradient hacking
Evan Hubinger
6y
33
71Gradient hacking is extremely difficult
Beren Millidge
2y
0
17Gradient Filtering
Arun Jose, janus
2y
0
15Challenge: construct a Gradient Hacker
Thomas Larsen, Thomas Kwa
2y
2
9Some real examples of gradient hacking
Oliver Sourbut
4y
0
24How does Gradient Descent Interact with Goodhart?
Q
Scott Garrabrant, Evan Hubinger
6y
Q
4
33Gradient Hacker Design Principles From Biology
johnswentworth
3y
11
26Understanding Gradient Hacking
Peter Barnett
4y
2
25Towards Deconfusing Gradient Hacking
leogao
4y
2
19Gradient hacking: definitions and examples
Richard Ngo
3y
1
22Thoughts on gradient hacking
Richard Ngo
4y
8
10Approaches to gradient hacking
Adam Shimi
4y
7
52Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger, Buck Shlegeris
2y
0
30Meta learning to gradient hack
Quintin Pope
4y
7
8Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
Roger Dearnaley
1y
0
Load More (15/19)
Add Posts