Linda Linsefors	v1.9.0Oct 9th 2023	(+13/-84)
markovial	v1.8.0May 4th 2023	(+911/-873)
markovial	v1.7.0Apr 28th 2023
markovial	v1.6.0Apr 28th 2023	(+2257/-43)
Johannes C. Mayer	v1.5.0May 13th 2022	(+3/-2)
plex	v1.4.0Oct 22nd 2021	added link
MichaelA	v1.3.0May 30th 2021	(+18/-18) Typo correction
Yoav Ravid	v1.2.0Feb 16th 2021	(+67)
jimv	v1.1.0Nov 1st 2020	(+10/-8) Corrected typo in "designed" and added scare quotes around it.
Ben Pace	v1.0.0Jul 17th 2020	(+826)

Linda Linsefors v1.9.0Oct 9th 2023 (+13/-84)

Inner alignment asks the ~~question - “Is the model~~ ~~trying to do~~ ~~what humans want it to do?”, or in other words~~question: How can we robustly aim our AI optimizers at any objective function at all?

markovial v1.8.0May 4th 2023 (+911/-873)

Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/goals don't. Since now we have a capable system that is optimizing for a misaligned goal.

Inner Alignment Vs. Goal Misgeneralization

Goal misgeneralization is when a model learns a goal during training which corresponds to good results when observed by a human, but during deployment, it turns out that the goal being pursued was not actually what the humans had in mind and that the model has actually learned something incorrect based upon the way that the training was structured. Some researchers use the term ~~goal misgeneralization~~ ~~synonymously with~~ ~~inner misalignment~~~~. However,~~ ~~others argue~~ ~~that goal misgeneralization should only be considered one type of possible inner misalignment. For more information see the corresponding tag.~~

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from outer alignment. ~~As mentioned the~~The former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. In other words, outer alignment is the problem of specifying what humans want the AI to do well enough in the first place. Outer alignment problems have their own set of problems that need to be tackled. For more information see the corresponding tag.

Related PagesPages:

Discuss this tag (0)

markovial v1.7.0Apr 28th 2023

Discuss this tag (0)

markovial v1.6.0Apr 28th 2023 (+2257/-43)

Inner alignment asks the question - “Is the model trying to do what humans want it to do?”, or in other words can we robustly aim our AI optimizers at any objective function at all?

More specifically, Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.

As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily ~~maximise~~maximize reproductive success, they instead use birth control ~~and then go out and have fun.~~while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.

To solve the inner alignment problem, some sub-problems that we would have to make progress on include things such as deceptive alignment, distribution shifts, and gradient hacking.

Inner Alignment Vs. Goal Misgeneralization

Goal misgeneralization is when a model learns a goal during training which corresponds to good results when observed by a human, but during deployment, it turns out that the goal being pursued was not actually what the humans had in mind and that the model has actually learned something incorrect based upon the way that the training was structured. Some researchers use the term goal misgeneralization synonymously with inner misalignment. However, others argue that goal misgeneralization should only be considered one type of possible inner misalignment. For more information see the corresponding tag.

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from outer alignment. As mentioned the former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. In other words, outer alignment is the problem of specifying what humans want the AI to do well enough in the first place. Outer alignment problems have their own set of problems that need to be tackled. For more information see the corresponding tag.

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don't think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Related Pages:Pages

Discuss this tag (0)

Johannes C. Mayer v1.5.0May 13th 2022 (+3/-2)

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) isare aligned with the objective function of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.

Discuss this tag (0)

plex v1.4.0Oct 22nd 2021 added link

Discuss this tag (0)

MichaelA v1.3.0May 30th 2021 (+18/-18) Typo correction

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an ~~optimzer)~~optimizer) is aligned with the objective ~~funcition~~function of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.

Discuss this tag (0)

Yoav Ravid v1.2.0Feb 16th 2021 (+67)

Related Pages:Mesa-Optimization

External Links:

Video by Robert Miles

Discuss this tag (0)

jimv v1.1.0Nov 1st 2020 (+10/-8) Corrected typo in "designed" and added scare quotes around it.

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimzer) is aligned with the objective funcition of the training process. As an example, evolution is an optimization force that itself ~~desifned~~'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.

Discuss this tag (0)

Ben Pace v1.0.0Jul 17th 2020 (+826)

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimzer) is aligned with the objective funcition of the training process. As an example, evolution is an optimization force that itself desifned optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Discuss this tag (0)