Outer Alignment

Linda Linsefors (+9/-80)
markovial (+109/-223)
markovial
markovial (+1949/-279)
brook (+27/-17)
Ben Pace (+2/-1)
Ben Pace (+483)

This is often taken to be separate from the inner alignment problem, which asks - “Is the model trying to do what humans want it to do?”, or in other wordsasks: How can we robustly aim our AI optimizers at any objective function at all?

Outer alignment asks the question - "What should we aim our model at?" In other words, is the model optimizing for the correct reward such that there are no exploitable loopholes? It is also known as the reward misspecification problem. This is because if we are unable to tell the AI what we want correctly then we have not been able to specify our reward.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/modeling. Paul Christiano is a researcher who focuses onSome proposed solutions to outer alignment and who has proposedinclude scalable oversight techniques such as IDA, as potential solutions to this problem.well as adversarial oversight techniques such as debate.

Outer Alignment inalignment asks the context of machine learningquestion - "What should we aim our model at?" In other words, is the property wheremodel optimizing for the correct reward such that there are no exploitable loopholes? It is also known as the reward misspecification problem. This is because if we are unable to tell the AI what we want correctly then we have not been able to specify our reward.

Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function is aligned with the intended goal of its designers.designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is an intuitive notion,difficult in part because human intentions are themselves not well-well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/modeling. Paul Christiano is a researcher who focuses on outer alignment and who has proposed techniques such as IDA as potential solutions to this problem.

Outer Alignment vs. Inner Alignment

This is what is typically discussed asoften taken to be separate from the 'value alignment' problem. It is contrasted with inner alignment problem, which asks - “Is the model trying to do what humans want it to do?”, which discusses if an optimizeror in other words can we robustly aim our AI optimizers at any objective function at all?

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the productionclassifications of anfailures according to these terms are fuzzy. Ideally, we don't think of a binary dichotomy of inner and outer aligned system, then whetheralignment that optimizer is itself aligned.See also: can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Outer Alignment in the context of machine learning is the property where the specified loss function is aligned with the intended goal of its designers. This is an intuitive notion, in part because human intentions are themselves not well-understood. This is what is typically discussed as the 'value alignment''value alignment' problem. It is contrasted with inner alignment, which discusses if an optimizer is the production of an outer aligned system, then whether that optimizer is itself aligned.See also:

Outer Alignment in the context of machine learning is the property where the specified loss function is aligned with the intended goal of its designers. This is an intuitive notion, in part because human intentions are themselves not well-understood. This is what is typically discussed as the 'value alignment' problem. It is contrasted with inner alignment, which discusses if an optimizer is the production of aan outer aligned system, then whether that optimizer is itself aligned.

Outer Alignment in the context of machine learning is the property where the specified loss function is aligned with the intended goal of its designers. This is an intuitive notion, in part because human intentions are themselves not well-understood. This is what is typically discussed as the 'value alignment' problem. It is contrasted with inner alignment, which discusses if an optimizer is the production of a outer aligned system, then whether that optimizer is itself aligned.