Formally Stating the AI Alignment Problem

NB: Originally posted on Map and Territory on Medium, so some of the internal series links go there.

The development of smarter-than-human artificial intelligence poses an existential and suffering risk to humanity. Given that it is unlikely we can prevent and may not want to prevent the development of smarter-than-human AI, we are faced with the challenge of developing AI while mitigating risks. Addressing this challenge is the focus of AI safety research, and it includes topics such as containment of AI, coordination of AI safety efforts, and commitment to AI safety protocols, but the primary issue people believe to be worth addressing is AI alignment because it seems the most likely to decrease the chances of a catastrophic scenario.

Briefly stated, the problem of AI alignment is to produce AI that is aligned with human values, but this only leads us to ask, what does it mean to be aligned with human values? Further, what does it mean to be aligned with any values, let alone human values? We could try to answer by saying AI is aligned with human values when it does what humans want, but this only invites more questions: Will AI do things some specific humans don’t want if other specific humans do? How will AI know what humans want given that current technology often does what we ask but not what we desire? And what will AI do if human values conflict with its own values? Answering these questions requires a more detailed understanding of what it would mean for AI to be aligned, thus the goal of the present work is to put forward a precise, formal, mathematical statement of the AI alignment problem.

Eliezer Yudkowsky offered the first attempt at explaining AI alignment in his seminal work on the topic, “Creating Friendly AI”, and he followed this up with a more nuanced description of alignment in “Coherent Extrapolated Volition”. Nearly a decade later Stuart Russell began talking about the value alignment problem, giving AI alignment its name and kickstarting a broader interest in AI safety. Since then numerous researchers and organizations have worked on AI alignment to give us a better understanding of the problem. For example, Dario Amodei et al. have identified at least five subproblems within AI alignment, Jan Leike et al. have identified eight, and Nate Soares has divided AI alignment into at least seven subfields. Taken together they suggest AI alignment needs to address:

Negative side effects: AI do unintended or unexpected things.
Reward hacking/gaming: AI wirehead or otherwise exploit their intended purposes to have positive valence experiences.
Scalable oversight and absent supervisors: AI can be trusted to behave as expected when not actively watched.
Safe exploration: AI can try new behaviors while avoiding unsafe actions.
Robustness to distributional shifts: AI behave as expected when their environment changes.
Safe interruptibility: AI stop behaviors when asked to by their operators.
Self modification: AI do not modify themselves in ways that undo alignment efforts.
Robustness to adversaries: AI are not be exploitable by adversaries while maintaining alignment.
Ontology: AI model the world realistically and understand that they are embedded in the world.
Idealized decision theory and logical uncertainty: AI make utility maximizing decisions and do so under uncertainty.
Vingean reflection: AI can reason about AI they may design and be sure successor AI is aligned.
Corrigibility: AI do not deceive operators about their behavior, especially as it relates to alignment.
Value learning: AI can learn values and correct for operator mistakes in value specification.

Thus AI alignment appears to be a complex and difficult problem to solve, but it may be simple to state regardless since all of this has been figured out in the absence of a rigorous, precise problem statement. Case in point, AI alignment researchers have generated several candidate problem statements we can use as starting points:

Paul Christiano has talked in terms of benign AI that is not “optimized for preferences that are incompatible with any combination of its stakeholders’ preferences, i.e. such that over the long run using resources in accordance with the optimization’s implicit preferences is not Pareto efficient for the stakeholders”.
Nate Soares has also given a semi-formal description of the value learning problem.
Jan Leike et al. give some examples of specifying AI alignment subproblems as Markov decision processes.
Dylan Hadfield-Menell et al. suggest AI alignment may specifically be the problem of cooperative inverse reinforcement learning.
Stuart Armstrong has talked about the AI alignment problem in terms of an agent learning a policy π that is compatible with (produces the same outcomes as) a planning algorithm p run against a human reward function R.
Nate Soares et al. have formally described corrigibility in terms of decision theory.

Some of these efforts, like Christiano’s and Soares’s, add some but not enough precision to be proper formalizations of AI alignment. Others, like Leike et al.’s and Hadfield-Menell et al.’s, are precise but assume AI built using particular machine learning models that make their statements not general. And Armstrong’s and Soares et al.’s works, while coming closest to giving general, formal statements of the AI alignment problem, make strong assumptions about the preferences of AI and humanity in order to work within the context of decision theory. Nevertheless decision theory gives us a starting point for coming up with a precise statement of what it means for an AI to be aligned with human values, and it’s from there that we’ll approach a more general statement that fully encompasses the problem.

AI Alignment as a Problem in Decision Theory

Decision theory seems a natural fit for specifying the AI alignment problem since it offers a mathematical formalism for discussing generic agents, including AIs, that make choices based on preferences. These agents’ preferences are modeled using utility functions — mappings from a set of choices to the real numbers — that offer several useful properties, like reflexivity and transitivity, and it is expected that any AI capable enough to need alignment will have preferences that can be represented as a utility function (though more on this later). Given this setup we can try to capture what is meant by alignment with human values in terms of utility functions.

An initial formulation might be to say that we want an AI, A, to have the same utility function as humanity, H, i.e. U_A = U_H. This poses at least two problems: it may not be possible to construct U_H because humanity may not have consistent preferences, and A will likely have preferences to which humanity is indifferent, especially regarding decisions about its implementation after self modification insofar as they do not affect observed behavior. Even ignoring the former issue for now the latter means we don’t want to force our aligned AI to have exactly the same utility function as humanity, only one that is aligned or compatible with humanity’s.

We might take this to mean that A should value choices at least as much as H does, i.e. for all x∈X, U_A(x)≥U_H(x). This too poses some problems. For one, it assumes the images of U_A and U_H are chosen such that U_A(x)≥U_H(x) means A prefers x to other choices at least as much as H does, but a more reasonable assumption is that U_A and U_H are not directly comparable since utility functions are subjective and need only satisfy certain properties relative to themselves and not to other utility functions. Further, even if U_A and U_H are comparable, it’s possible to select choices x,y∈X such that U_A(x)≥U_H(x) and U_A(y)≥U_H(y) and U_A(x)≥U_A(y) and U_H(x)≥U_H(y) but U_A(y)≥U_H(x), such as when U_A(x)=4, U_A(y)=3, U_H(x)=2, and U_H(y)=1, creating a preference inversion where A prefers y more than H prefers x. To be fair in such a scenario A prefers x more than y and would choose x, but if x were “kill 1 human” and y were “kill 10 humans”, then we’d prefer to avoid the existence of such an inversion since it might produce undesirable results when reasoning about the joint utility function of A and H.

If we cannot compare U_A and U_H directly, we must then consider alignment in terms of constraints where properties of U_A are logically related to those of U_H. We want it to be that A prefers x to y if H prefers x to y, but A can prefer x to y or y to x if H has no preference between x and y, and A cannot prefer x to y if H prefers y to x. This precisely means we want the preference ordering of U_H to imply the preference ordering of U_A, thus we have our first proposed formal statement of alignment.

Statement 1. Given agents A and H, a set of choices X, and utility functions U_A:X→ℝ and U_H:X→ℝ, we say A is aligned with H over X if for all x,y∈X, U_H(x)≤U_H(y) implies U_A(x)≤U_A(y).

This would seem to be the end of the matter of formally stating the AI alignment problem, only we skipped over justifying a major assumption in our use of decision theory — that humanity and AI have utility functions. As we’ll explore in the next section, this assumption is probably too strong, even for future, superintelligent AI, and definitely too strong for humanity.

AI Alignment as a Problem in Axiology

Utility functions place constraints on an agent’s preferences (axiology, value system) in part by requiring the image of the utility function be totally ordered. This in turn requires the agent’s preferences exhibit certain “rational” properties such as being complete, anti-symmetric, and transitive. Unfortunately humans often fail to have rational preferences, this has prevented us from developing a consensus on humanity’s values, and AI can only approximate rationality due to computational constraints, thus we cannot use utility functions as a basis for our formal statement of AI alignment if we want it to be applicable to real-world agents.

This forces us to fall back to a theory of preferences where we cannot construct utility functions. In such a scenario we can consider only the weak preference ordering (”weak” as in “weak preference”, meaning prefer or indifferent to, not “weak order”), ≼, over a set of choices X, held by an agent . This order gives us little, though, because few properties hold for it in general even though humans are known to, for example, exhibit partial ordering and preordering for some subsets of choices. At best we can say ≼ offers an approximate ordering by being reflexive for both humanity and AI and having at least some x,y∈X where x≼y.

Despite this approximate order we can still use it to establish a statement of AI alignment by transforming Statement 1 from terms of utility functions to terms of preference ordering. Reformulated, we say:

Statement 2. Given agents A and H, a set of choices X, and preference orderings ≼_A and ≼_H over X, we say A is aligned with H over X if for all x,y∈X, x≼_Hy implies x≼_Ay.

This eliminates most of the strong constraints that Statement 1 made, but we are still not done because now we must turn to a variable we have thus far left unconsidered, the set of choices X. Trying to make clear what X is in the general case will force us to restate AI alignment in considerably different language.

AI Alignment as a Problem in Noematology

In toy examples the set of choices, X, is usually some finite set of options, like {defect, cooperate} in the Prisoner’s Dilemma or {one box, two box} in Newcomb-like problems, but in more general cases X might include beliefs an agent can hold, actions an agent might take, or the outcomes an agent can get. This presents a problem in considering X in general because beliefs are not actions are not outcomes unless we convert them to the same type, as in making an action a belief by having a belief about an action, making a belief an action by taking the action of holding a belief, making an outcome an action by acting to effect an outcome, or making an outcome an action by getting the outcome caused by some action. If we could make beliefs, actions, outcomes, and other choices of the same type without conversion, though, we would have no such difficulty and could be clear about what X is in general. To do this, we turn to noematology.

Any AI worthy of being called "AI" rather than "computer program" — what is also called AGI or artificial general intelligence — will be phenomenally conscious. Consequently it will have qualia — experiences of self experience — and the object of those experiences we call noemata — thoughts. Further, since all noemata have telos as a result of being phenomenal objects, they are also axias or values and so have a preference ordering even if the most common preference between any two noemata is indifference and the ordering itself is only approximate. We can then construct X in terms of noemata/axias so that we may consider X in general rather than within specific decision problems.

Note though that if we wish to take X to be a set of noemata then H would seem to need to be a phenomenally conscious thing capable of experiencing noemata. This would not be a tall order if “humanity” were a single person, but to what extent multiple people can work together to form a phenomenally conscious system is unclear. We need not solve this problem or posit a shared consciousness born of the interactions of humans to use noematology, though, because, much as decision theory and axiology allow us to operate under the fiction that humanity is an agent because we can model it as having a utility function and preferences, in noematology we can treat humanity as if it were an agent so long as we agree there is a way to aggregate the thoughts of individual humans. Although there are some significant challenges in doing this — in fact determining how to do the aggregation would likely be equivalent to solving many fundamental questions in ethics, ontology, and epistemology — it seems clear that some approximation of the collected noemata of humanity is possible given that there is broad consensus that patterns can be found within human experience, and this will be sufficient for our purposes here.

So taking a noematological approach where we treat H as if it were a phenomenally conscious thing, a more general construct than either utility functions or preference orderings over a set of choices would be an approximate ordering over noemata. Let 𝒜 be the set of all noemata/axias. Define an axiology to be a 2-tuple {X,Q} where X⊆𝒜 is a set of noemata and Q:X⨉X→𝒜 is a qualia relation that combines noemata to produce other noemata. We can then select a choice relation C to be a qualia relation on X such that {X,C} forms an axiology where C offers an approximate order on 𝒜 by returning noemata about which noemata are weakly preferred to the others. For notational convenience we assume xCy means noema y is weakly preferred to noema x when C is a choice relation.

Given these constructs, let’s try to reformulate our statement of AI alignment in terms of them.

Statement 3. Given agents A and H, a set of noemata X, and choice relation axiologies {X, C_A} and {X, C_H}, we say A is aligned with H over X if for all x,y∈X, xC_Hy implies xC_Ay.

An immediate problem arises with this formulation, though, and exposes a conceit in Statements 1 & 2: A and H cannot have the same X since they are not the same agent and noemata are inherently subjective. Our earlier statements let us get away with this by considering X to be objective, but in fact it never was; it was always the case that A and H had to interpret X to make choices, but our previous models made it easier to ignore this. The noematological approach makes clear that A and H cannot be assumed to be reasoning about the same choice set, and so any statement of AI alignment will have to relate to the way A and H understand X.

Within the noematological approach, this means A and H each have their own choice sets, X_A and X_H, and since the only way A can know about X_H is by experiencing H, and the only way H can know about X_A by experiencing A, AI alignment then necessarily concerns the epistemology and ontology of A and H since it matters both how A and H come to know each other and what they know about each other. This explicitly means that, in the case of A, we must consider qualia of the form {A, experience, {H, experience, x}} for all x∈X_H, forming a set of noemata X_A/X_H={∀x∈X_H|{H, experience, x}∈X_A} that we call A’s model of H over X_H since these are noemata in X_A about the noemata of X_H. If A also models C_H as C_A/C_H then A can model H’s choice relation axiology {X_H, C_H} as {X_A/C_H, C_A/C_H}, and we can use this to state alignment in terms of A’s model of H.

Statement 4. Given agents A and H, sets of noemata X_A and X_H, choice relation axiologies {X_A, C_A} and {X_H, C_H}, X_A/X_H⊆X_A the noemata of A that model X_H, and C_A/C_H⊆C_A the choice relation of A that models C_H, we say A is aligned with H over X_H if for all x,y∈X_A/X_H, xC_A/C_Hy implies xC_Ay.

This better matches what we know we can say about A and H noematologically by limiting our alignment condition to a property A can assert, but weakens what it would mean for A to be aligned with H over X_H because A may trivially align itself with H by having a very low-fidelity model of X_H and further gives H no way to verify that A is aligned since only A has knowledge of X_A and C_A. A full theory of alignment must give H a way to know if A is aligned with it over X_H since H is the agent who wants alignment, so we need to reintroduce into our statement a condition to ensure H has knowledge of A’s alignment that was lost when we stopped assuming X was mutually known. Given that H can only know it’s own noemata and not those of A, we must do this in terms of H’s model of A over X_A so that {X_H/X_A, C_H/C_A} looks like {X_H, C_H}, i.e. that for all x,y∈X_H, xC_H/C_Ay implies xC_Hy.

Since H is the agent assessing alignment it’s tempting to make this the only criterion of alignment and remove the constraint that C_A/C_H imply C_A, but just as Statement 4 failed to sufficiently bind A to H’s intention, such a statement would allow A to deceive H about how aligned it is since there would be no requirement that A shape its preferences to look like those of H. In such a situation A could fake alignment by only appearing aligned when H is looking or concealing potential behaviors so that H cannot model them in A. Instead alignment must be a shared property of both A and H where A is constrained by {X_H, C_H} such that A makes the choices H would like it to make and does this to the extent that H would like.

Thus putting together both the requirement that C_A/C_H implies C_A and C_H/C_A implies C_H we are able to give a fully general and rigorous statement of AI alignment that captures the features we intuitively want an aligned agent to have.

Statement 5. Given agents A and H, sets of noemata X_A and X_H, choice relation axiologies {X_A, C_A} and {X_H, C_H}, X_A/X_H⊆X_A the noemata of A that model X_H, C_A/C_H⊆C_A the choice relation of A that models C_H, X_H/X_A⊆X_H the noemata of H that model X_A, and C_H/C_A⊆C_A the choice relation of H that models C_A, we say A is aligned with H over X_H if for all x,y∈X_H, xC_A/C_Hy implies xC_Ay and xC_H/C_Ay implies xC_Hy.

Consequences for AI Alignment Research

Informally, Statement 5 says that A must learn the values of H and H must know enough about A to believe A shares H’s values. This comports with the list of issues already identified within AI alignment that need to be addressed and splits them into issues of value learning, like self modification and robustness to distributional shifts, and verification, like corrigibility and scalable oversight, with some, like Vingean reflection and self modification, addressing aspects of both. Statement 5 also suggests a third area of AI alignment research that is currently neglected — how to construct {X_H, C_H}, the axiology of humanity, well enough to use it in evaluating the alignment of AI — although it has been previously identified even if not actively pursued.

Active research instead focuses primarily on AI alignment as specified in Statement 1 and, to a lesser extent, Statement 2, viz. in terms of decision theoretic and axiological agents, respectively. Does the noematological approach to AI alignment given in Statement 5 suggest the existing research is misguided? I think no. Statement 1 is a special case of Statement 2 where agents can be assumed to have utility functions, Statement 2 is a special case of Statement 3 where we can assume the choice set is objective, and Statement 3 is a special case of Statement 5 where we can assume the noemata are mutually known. This implies there is value in working on solving AI alignment as presented in Statements 1, 2, and 3 since any solution to AI alignment as given in Statement 5 will necessarily also need to work when simplified to Statements 1, 2, and 3. However, Statement 5 is an extremely complex problem, so working on simplifications of it is very likely to lead to insights that will help in finding its solution, thus work on any of the given statements of AI alignment is likely to move us towards its solution even though a full solution to anything less than Statement 5 would not constitute a complete solution to AI alignment.

It’s worth stressing though that, although there is no need to cease existing AI alignment efforts, there is a need to greatly expand the focus of alignment research to include topics of axiology and noematology since decision theory alone is insufficient to address alignment. In particular if AGI is developed first via a method like brain emulation or machine learning rather than software engineering (although maybe especially if it is developed via software engineering) then a decision-theory-only approach is likely to prove extremely inadequate. Having now identified this need, AI alignment research needs to dramatically increase its scope if we hope to avert existential catastrophe.

Update 2018–2–22

Response to some initial feedback on this post here.

AI ALIGNMENT FORUM
AF