AI ALIGNMENT FORUMTags
AF

Inner Alignment

EditHistorySubscribe

Help improve this page

EditHistorySubscribe

Help improve this page

Inner Alignment

Contributors

You are viewing revision 1.1.0, last edited by jimv

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimzer) is aligned with the objective funcition of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

...

Posts tagged Inner Alignment

9

30The Inner Alignment Problem

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

5y

6

9

58Risks from Learned Optimization: Introduction

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

5y

33

5

58Inner Alignment: Explain like I'm 12 Edition

4y

12

6

39Demons in Imperfect Search

4y

12

1

68How To Go From Interpretability To Alignment: Just Retarget The Search

2y

24

3

27Mesa-Search vs Mesa-Control

4y

45

2

19How to Control an LLM's Behavior (why my P(DOOM) went down)

Roger Dearnaley

5mo

0

1

90Reward is not the optimization target

2y

88

3

67The Solomonoff Prior is Malign

4y

26

2

47Matt Botvinick on the spontaneous emergence of learning algorithms

4y

59

0

48Searching for Search

Nicholas Kees Dupuis, janus

1y

0

3

32Open question: are minimal circuits daemon-free?

Paul Christiano

6y

19

3

30Concrete experiments in inner alignment

5y

12

3

31Relaxed adversarial training for inner alignment

5y

20

1

14Why almost every RL agent does learned optimization

1y

3