AI ALIGNMENT FORUMTags
AF

Gradient Hacking

EditHistory
Discussion (0)
Help improve this page (1 flag)
EditHistory
Discussion (0)
Help improve this page (1 flag)
Gradient Hacking
Random Tag
Contributors
0Multicore

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Posts tagged Gradient Hacking
3
51Gradient hacking
Evan Hubinger
4y
31
3
63Gradient hacking is extremely difficult
Beren Millidge
4mo
0
3
16Gradient Filtering
Arun Jose, janus
4mo
0
2
13Challenge: construct a Gradient Hacker
Thomas Larsen, Thomas Kwa
3mo
2
1
9Some real examples of gradient hacking
Oliver Sourbut
2y
0
1
24How does Gradient Descent Interact with Goodhart?Q
Scott Garrabrant, Evan Hubinger
4y
Q
4
0
30Gradient Hacker Design Principles From Biology
johnswentworth
9mo
11
1
22Thoughts on gradient hacking
Richard Ngo
2y
8
1
22Towards Deconfusing Gradient Hacking
leogao
2y
2
1
16Gradient hacking: definitions and examples
Richard Ngo
1y
1
1
10Approaches to gradient hacking
Adam Shimi
2y
7
0
30Meta learning to gradient hack
Quintin Pope
2y
6
0
19Gradient Hacking via Schelling Goals
Adam Scherlis
1y
1
0
21Understanding Gradient Hacking
Peter Barnett
1y
2
0
12A Toy Model of Gradient Hacking
Oam Patel
1y
1
Load More (15/17)
Add Posts