AI ALIGNMENT FORUM
AF

Wikitags

Inner Alignment

Edited by markovial, Ben Pace, et al. last updated 30th Dec 2024

Inner Alignment is the problem of ensuring (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.

Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?

As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment. 

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/goals don't. Since now we have a capable system that is optimizing for a misaligned goal.

To solve the inner alignment problem, some sub-problems that we would have to make progress on include things such as , , and .

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from . The former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. For more information see the .

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don't think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Related Pages: 

, , , ,

External Links:

  • Video by Robert Miles
deceptive alignment
Deceptive Alignment
Subscribe
1
Subscribe
1
mesa-optimizers
Mesa-Optimization
distribution shifts
Eliciting Latent Knowledge
Treacherous Turn
gradient hacking
outer alignment
corresponding tag
Deception
Discussion0
Discussion0
Posts tagged Inner Alignment
30The Inner Alignment Problem
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant
6y
6
58Risks from Learned Optimization: Introduction
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant
6y
33
59Inner Alignment: Explain like I'm 12 Edition
Rafael Harth
5y
12
40Demons in Imperfect Search
johnswentworth
5y
12
77How To Go From Interpretability To Alignment: Just Retarget The Search
johnswentworth
3y
24
27Mesa-Search vs Mesa-Control
Abram Demski
5y
45
94Reward is not the optimization target
Alex Turner
3y
88
19How to Control an LLM's Behavior (why my P(DOOM) went down)
Roger Dearnaley
2y
0
47Matt Botvinick on the spontaneous emergence of learning algorithms
Adam Scholl
5y
59
51Searching for Search
Nicholas Kees Dupuis, janus
3y
0
32Open question: are minimal circuits daemon-free?
Paul Christiano
7y
19
30Concrete experiments in inner alignment
Evan Hubinger
6y
12
31Relaxed adversarial training for inner alignment
Evan Hubinger
6y
20
18A "Bitter Lesson" Approach to Aligning AGI and ASI
Roger Dearnaley
1y
0
14Why almost every RL agent does learned optimization
Lee Sharkey
2y
3
Load More (15/141)
Add Posts