AI ALIGNMENT FORUM
AF

Wikitags

Gradient Hacking

Edited by Multicore last updated 27th Aug 2022

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Gradient Hacking
53Gradient hacking
evhub
6y
33
71Gradient hacking is extremely difficult
beren
3y
0
17Gradient Filtering
Jozdien, janus
3y
0
15Challenge: construct a Gradient Hacker
Thomas Larsen, Thomas Kwa
2y
2
9Some real examples of gradient hacking
Oliver Sourbut
4y
0
24How does Gradient Descent Interact with Goodhart?
Q
Scott Garrabrant, evhub
7y
Q
4
33Gradient Hacker Design Principles From Biology
johnswentworth
3y
11
26Understanding Gradient Hacking
peterbarnett
4y
2
25Towards Deconfusing Gradient Hacking
leogao
4y
2
19Gradient hacking: definitions and examples
Richard_Ngo
3y
1
22Thoughts on gradient hacking
Richard_Ngo
4y
8
10Approaches to gradient hacking
adamShimi
4y
7
52Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger, Buck
2y
0
30Meta learning to gradient hack
Quintin Pope
4y
7
8Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
RogerDearnaley
2y
0
Load More (15/19)
Add Posts