x

AI ALIGNMENT FORUM

AF

Gradient Hacking — AI Alignment Forum

Gradient Hacking

Edited by Multicore last updated 27th Aug 2022

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Add Posts

1

1

Posts tagged Gradient Hacking

4

53Gradient hacking

7y

33

3

71Gradient hacking is extremely difficult

3y

0

3

17Gradient Filtering

3y

0

2

16Challenge: construct a Gradient Hacker

Thomas Larsen, Thomas Kwa

3y

2

1

10Some real examples of gradient hacking

5y

0

1

38Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

4mo

0

1

24How does Gradient Descent Interact with Goodhart?

Scott Garrabrant, evhub

7y

4

0

33Gradient Hacker Design Principles From Biology

4y

11

1

21Gradient hacking: definitions and examples

4y

1

1

27Towards Deconfusing Gradient Hacking

5y

2

1

26Understanding Gradient Hacking

4y

2

1

22Thoughts on gradient hacking

5y

8

1

10Approaches to gradient hacking

5y

7

1

5A scheme to credit hack policy gradient training

Adrià Garriga-alonso

7mo

0

0

62Did Claude 3 Opus align itself via gradient hacking?

Fiora Starlight

3mo

0

Load More (15/22)

Add Posts