AI ALIGNMENT FORUMTags
AF

Gradient Hacking

EditHistorySubscribe
Discussion (0)
Help improve this page (1 flag)
EditHistorySubscribe
Discussion (0)
Help improve this page (1 flag)
Gradient Hacking
Random Tag
Contributors
0Multicore

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Posts tagged Gradient Hacking
3
51Gradient hacking
Evan Hubinger
4y
33
3
65Gradient hacking is extremely difficult
Beren Millidge
1y
0
3
16Gradient Filtering
Arun Jose, janus
1y
0
2
14Challenge: construct a Gradient Hacker
Thomas Larsen, Thomas Kwa
9mo
2
1
9Some real examples of gradient hacking
Oliver Sourbut
2y
0
1
24How does Gradient Descent Interact with Goodhart?
Q
Scott Garrabrant, Evan Hubinger
5y
Q
4
0
32Gradient Hacker Design Principles From Biology
johnswentworth
1y
11
1
25Towards Deconfusing Gradient Hacking
leogao
2y
2
1
19Gradient hacking: definitions and examples
Richard Ngo
1y
1
1
22Thoughts on gradient hacking
Richard Ngo
2y
8
1
10Approaches to gradient hacking
Adam Shimi
2y
7
0
49Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger, Buck Shlegeris
2mo
0
0
30Meta learning to gradient hack
Quintin Pope
2y
7
0
26Understanding Gradient Hacking
Peter Barnett
2y
2
1
8[ASoT] Simulators show us behavioural properties by default
Arun Jose
1y
0
Load More (15/17)
Add Posts