I haven't yet had a chance to read the whole thing, but DeepMind has a new paper up from Iason Gabriel titled "Artificial Intelligence, Values and Alignment". Here's the abstract:

This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.

As I said, I haven't read the whole thing, but I'm somewhat suspicious of moving in the direction of principles as discussed in this paper since, based on my skimming, it seems to be suggesting we pick the method of aggregating human preferences an aligned AI would use with a higher level of specificity than I suspect is safe (given the level of confidence we would need to have to make a particular choice).

Nonetheless this paper looks interesting and covers a number of topics in detail I haven't seen show up much in the academic literature even if we talk about them here, so I look forward to looking at this paper more closely and possibly having more detailed comments to offer then.

New Comment