Gradient Hacking

Edited by Multicore last updated 27th Aug 2022

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Posts tagged Gradient Hacking

4

53Gradient hacking

evhub

6y

33

3

71Gradient hacking is extremely difficult

beren

3y

0

3

17Gradient Filtering

Jozdien, janus

3y

0

2

15Challenge: construct a Gradient Hacker

Thomas Larsen, Thomas Kwa

3y

2

1

9Some real examples of gradient hacking

Oliver Sourbut

4y

0

1

24How does Gradient Descent Interact with Goodhart?

Q

Scott Garrabrant, evhub

7y

Q

4

0

33Gradient Hacker Design Principles From Biology

johnswentworth

3y

11

1

21Gradient hacking: definitions and examples

Richard_Ngo

3y

1

26Understanding Gradient Hacking

peterbarnett

4y

2

1

25Towards Deconfusing Gradient Hacking

leogao

4y

2

1

22Thoughts on gradient hacking

Richard_Ngo

4y

8

1

10Approaches to gradient hacking

adamShimi

4y

7

1

5A scheme to credit hack policy gradient training

Adrià Garriga-alonso

1mo

0

52Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation

Fabien Roger, Buck

2y

0

30Meta learning to gradient hack

Quintin Pope

4y

7

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Gradient Hacking