I’m Jose. I realized recently I wasn’t taking existential risk seriously enough, and in April, a year after I first applied, I started running a MIRIx group in my college. I’ll write summaries of the sessions that I thought were worth sharing. Most of the members are very new to FAI, so this will partly be an incentive to push upward and partly my own review process. Hopefully some of this will be helpful to others.
This one focuses on how aligning creator intent with the base objective of an AI might not be enough for outer alignment, starting with an overview of Coherent Extrapolated Volition and its flaws. This was created in collaboration with Jacob Abraham and Abraham Francis.
From the wiki,
In calculating CEV, an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI's utility function.
In other words, CEV is an AI having not only a precise model of human values, but also the meta understanding of how to resolve contradictions and incompleteness in those values in a friendly way.
This line of research was considered obsolete by Eliezer, due to the problems it runs into - some of which make it appear like the proposal itself only shifts the pointers to the goals. In the time we spent discussing it, we ended up with a (most likely not comprehensive) list of the major flaws of CEV.
I thought of CEV to begin this because it specifically targets the assumption that human values themselves might not be enough.
There is little consensus on a definition for the entire alignment problem, but a large part, intent alignment, i.e. making sure the AI does what the programmers want it to, is composed of two components: inner and outer alignment.
Inner Alignment is about making sure the AI actually optimizes what our reward function specifies. In other words, the reward function is the base objective, the objective an AI can search for optimizers to implement. But in its search, the AI may find proxy objectives that are easier to optimize, and do the job fairly well (think of evolution, where the base objective is reproductive fitness, while the mesa objective includes heuristics like pain aversion, status signaling, etc.). This is the mesa objective. Inner Alignment is aligning the base objective with the mesa objective.
Outer Alignment is about making sure we like the reward function we’re training the AI for. That is, if we had a model that solves inner alignment and was actually optimizing for the objective it’s given, would we like that model? This is the centre of much of classical alignment discussion (the paperclip AI thought experiment, for example).
Recall that what CEV addresses is the potential for aligning our intent with the base objective to be insufficient, that a model that optimizes an objective we like can still fail in the limit as it runs into inconsistencies or other problems with our value systems. The friendly resolution of these problems may be beyond a base human or human model at test time; far from necessarily so, but I think with enough probability in at least a few instances to be a problem.
Note: Epistemic status on the following is speculative at best, and is based on what posts and papers we could read in the time we had.
Based on my limited understanding of Outer Alignment, it doesn’t include a formalization of aligning AI with the values we would hold at the limit. Some of the proposals we looked at also ran into this problem.
Imitative amplification, for example, relies on a model that tries to imitate a human with access to the model. With oversight using transparency tools to account for deceptive or other harmful behaviour, it is plausibly outer aligned. However, a base human may not be able to reliably resolve in a friendly way the contradictions and inconsistencies it would face at the limit. That’s fairly uncharted territory, and might involve the human model diverging from the human template too far. I don’t think the oversight would be of much help here either, because it isn’t necessary that these problems would come up as early as training time. It’s also possible nearly any sort of resolution would seem misaligned to us.
Some proposals bypass this problem altogether, but at terrible cost. STEM AI, for example, avoids value modelling entirely, but does so by ignoring the class of use cases where those would be relevant.
It’s possible that we wouldn’t need to worry about this problem at all. Perhaps they’ll be addressed during training time, or instead of resolving inconsistencies, the AI could account for them as new value axioms. But while the former may even be the likely scenario, the alternative still holds a distinct probability, especially in realistic scenarios where training until hypothetical future value conflicts are resolved isn’t competitive. Treating inconsistencies as new axioms could likely be dangerous, and might not even solve the core problem because of an endless chain of new inconsistencies as we add new ones, in Godelian fashion.
Endnote: I hesitated for a while before posting this because it felt like something that must have been addressed already. I didn’t find much commenting on this in any of the posts we went through though, so I just peppered this with what was possibly an irksome number of uncertainty qualifiers. Whatever we got wrong, tell us.