You are viewing revision 1.3.0, last edited by markovial

Outer alignment asks the question - "What should we aim our model at?" In other words, is the model optimizing for the correct reward such that there are no exploitable loopholes? It is also known as the reward misspecification problem. This is because if we are unable to tell the AI what we want correctly then we have not been able to specify our reward.

Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function aligned with the intended goal of its designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is difficult in part because human intentions are themselves not well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/modeling. Paul Christiano is a researcher who focuses on outer alignment and who has proposed techniques such as IDA as potential solutions to this problem....

(Read More)