# 18

Value LearningAI
Frontpage

When I say an AI A is aligned with an operator H, I mean:

A is trying to do what H wants it to do.

The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.

This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.

In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.

## Analogy

Consider a human assistant who is trying their hardest to do what H wants.

I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem.

“Aligned” doesn’t mean “perfect:”

• They could misunderstand an instruction, or be wrong about what H wants at a particular moment in time.
• They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect.
• They may not know everything about H’s preferences, and so fail to recognize that a particular side effect is bad.
• They may build an unaligned AI (while attempting to build an aligned AI).

I use alignment as a statement about the motives of the assistant, not about their knowledge or ability. Improving their knowledge or ability will make them a better assistant — for example, an assistant who knows everything there is to know about H is less likely to be mistaken about what H wants — but it won’t make them more aligned.

(For very low capabilities it becomes hard to talk about alignment. For example, if the assistant can’t recognize or communicate with H, it may not be meaningful to ask whether they are aligned with H.)

## Clarifications

• The definition is intended de dicto rather than de re. An aligned A is trying to “do what H wants it to do.” Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges. I’d call this behavior aligned because A is trying to do what H wants, even though the thing it is trying to do (“buy apples”) turns out not to be what H wants: the de re interpretation is false but the de dicto interpretation is true.
• An aligned AI can make errors, including moral or psychological errors, and fixing those errors isn’t part of my definition of alignment except insofar as it’s part of getting the AI to “try to do what H wants” de dicto. This is a critical difference between my definition and some other common definitions. I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.
• An aligned AI would also be trying to do what H wants with respect to clarifying H’s preferences. For example, it should decide whether to ask if H prefers apples or oranges, based on its best guesses about how important the decision is to H, how confident it is in its current guess, how annoying it would be to ask, etc. Of course, it may also make a mistake at the meta level — for example, it may not understand when it is OK to interrupt H, and therefore avoid asking questions that it would have been better to ask.
• This definition of “alignment” is extremely imprecise. I expect it to correspond to some more precise concept that cleaves reality at the joints. But that might not become clear, one way or the other, until we’ve made significant progress.
• One reason the definition is imprecise is that it’s unclear how to apply the concepts of “intention,” “incentive,” or “motive” to an AI system. One naive approach would be to equate the incentives of an ML system with the objective it was optimized for, but this seems to be a mistake. For example, humans are optimized for reproductive fitness, but it is wrong to say that a human is incentivized to maximize reproductive fitness.
• “What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.

## Postscript on terminological history

I originally described this problem as part of “the AI control problem,” following Nick Bostrom’s usage in Superintelligence, and used “the alignment problem” to mean “understanding how to build AI systems that share human preferences/values” (which would include efforts to clarify human preferences/values).

I adopted the new terminology after some people expressed concern with “the control problem.” There is also a slight difference in meaning: the control problem is about coping with the possibility that an AI would have different preferences from its operator. Alignment is a particular approach to that problem, namely avoiding the preference divergence altogether (so excluding techniques like “put the AI in a really secure box so it can’t cause any trouble”). There currently seems to be a tentative consensus in favor of this approach to the control problem.

I don’t have a strong view about whether “alignment” should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.

This post was originally published here on 7th April 2018.

The next post in this sequence will post on Saturday, and will be "An Unaligned Benchmark" by Paul Christiano.

Tomorrow's AI Alignment Sequences post will be the first in a short new sequence of technical exercises from Scott Garrabrant.

# 18

New Comment
Some comments are truncated due to high volume. Change truncation settings

Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:

• It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction. (Though it is certainly possible, for example by building an unaligned successor AI system, as mentioned in the post.) In contrast, with the definition-optimization decomposition, we need to solve both specification problems with the definition and robustness problems with the optimization.
• Humans seem t

I agree with habryka that this is a really good explanation. I also agree with most of your pros and cons, but for me another major con is that this decomposition moves some problems that I think are crucial and urgent out of "AI alignment" and into the "competence" part, with the implicit or explicit implication that they are not as important, for example the problem of obtaining or helping humans to obtain a better understanding of their values and defending their values against manipulation from other AIs.

In other words, the motivation-competence decomposition seems potentially very useful to me as a way to break down a larger problem into smaller parts so it can be solved more easily, but I don't agree that the urgent/not-urgent divide lines up neatly with the motivation/competence divide.

Aside from the practical issue of confusion between different usages of "AI alignment" (I think others like MIRI had been using "AI alignment" in a broader sense before Paul came up with his narrower definition), even using "AI alignment" in a context where it's clear that I'm using Paul's definition gives me the feeling that I'm implicitly agreeing to his understanding of how various subproblems should be prioritized.

5Paul Christiano3yI think it's bad to use a definitional move to try to implicitly prioritize or deprioritize research. I think I shouldn't have written: "I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment." That said, I do think it's important that these seem like conceptually different problems and that different people can have different views about their relative importance---I really want to discuss them separately, try to solve them separately, compare their relative values (and separate that from attempts to work on either). I don't think it's obvious that alignment is higher priority than these problems, or than other aspects of safety. I mostly think it's a useful category to be able to talk about separately. In general I think that it's good to be able to separate conceptually separate categories, and I care about that particularly much in this case because I care particularly much about this problem. But I also grant that the term has inertia behind it and so choosing its definition is a bit loaded and so someone could object on those grounds even if they bought that it was a useful separation. (I think that "defending their values against manipulation from other AIs" wasn't include under any of the definitions of "alignment" proposed by Rob in our email discussion about possible definitions, so it doesn't seem totally correct to refer to this as "moving" those subproblems, so much as there already existing a mess of imprecise definitions some of which included and some of which excluded those subproblems.)
Aside from the practical issue of confusion between different usages of "AI alignment" (I think others like MIRI had been using "AI alignment" in a broader sense before Paul came up with his narrower definition)

I switched to this usage of AI alignment in 2017, after an email thread involving many MIRI people where Rob suggested using "AI alignment" to refer to what Bostrom calls the "second principal-agent problem" (he objected to my use of "control"). I think I misunderstood what Rob intended in that discussion, but my definition is meant to be in line with that---if the agent is trying to do what the principal wants, it seem like you've solved the principal-agent problem. I think the main way this definition is narrower than what was discussed in that email thread is by excluding things like boxing.

In practice, essentially all of MIRI's work seems to fit within this narrower definition, so I'm not too concerned at the moment with this practical issue (I don't know of any work MIRI feels strongly about that doesn't fit in this definition). We had a thread about this after it came up on LW in April, where we... (read more)

4Wei Dai3yNote that Arbital defines "AI alignment [https://arbital.com/p/ai_alignment/]" as: and "total alignment [https://arbital.com/p/total_alignment/]" as: I think this clearly includes the kinds of problems I'm talking about in this thread. Do you agree? Also supporting my view is the history of "Friendliness" being a term that included the problem of better understanding the user's values (as in CEV) and then MIRI giving up that term in favor of "alignment" as an apparently exact synonym. See this MIRI post [https://forum.effectivealtruism.org/posts/GxmJ2ntyMiaG2PPSu/miri-2017-fundraiser-and-strategy-update] which talks about "full alignment problem for fully autonomous AGI systems" and links to Arbital. I think you may have misunderstood what I meant by "practical issue". My point was that if you say something like "I think AI alignment is the most urgent problem to work on" the listener could easily misinterpret you as meaning "alignment" in the MIRI/Arbital sense. Or if I say "AI alignment is the most urgent problem to work on" in the MIRI/Arbital sense of alignment, the listener could easily misinterpret as meaning "alignment" your sense. Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don't know how to do.)
2Paul Christiano3yI think MIRI's first use of this term was here [https://intelligence.org/files/TechnicalAgenda.pdf] where they said “We call a smarter-than-human system that reliably pursues beneficial goals aligned with human interests' or simply aligned.' ” which is basically the same as my definition. (Perhaps slightly weaker, since "do what the user wants you to do" is just one beneficial goal.) This talk [https://intelligence.org/wp-content/uploads/2016/10/fundamental-difficulties-handout.pdf] never defines alignment, but the slide introducing the big picture says "Take-home message: We’re afraid it’s going to be technically difficult to point AIs in an intuitively intended direction" which also really suggests it's about trying to point your AI in the right direction. The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction, though I suppose that may merely be an instance of suggestively naming the field "alignment" and then defining it to be "whatever is important" as a way of smuggling in the connotation that pointing your AI in the right direction is the important thing. All of the topics in the "AI alignment" domain (except for mindcrime, which is borderline) all fit under the narrower definition; the list of alignment researchers are all people working on the narrower problem. So I think the way this term is used in practice basically matches this narrower definition. As I mentioned, I was previously happily using the term "AI control." Rob Bensinger suggested that I stop using that term and instead use AI alignment, proposing a definition of alignment that seemed fine to me. I don't think the very broad definition is what almost anyone has in mind when they talk about alignment. It doesn't seem to be matching up with reality in any particular way, except insofar as its capturing the problems that a certain group of people work on." I don't really see any argument in favor except the historical precedent,
2Wei Dai3yBut the page includes: which seems to be outside of just "pointing an AI in a direction" I think so, at least for certain kinds of predictions that seem especially important (i.e., may lead to x-risk if done badly), see this Arbital page [https://arbital.com/p/Vingean_reflection/] which is under AI Alignment [https://arbital.com/explore/2v]: It seems to me that Rohin's proposal of distinguishing between "motivation" and "capabilities" is a good one, and then we can keep using "alignment" for the set of broader problems that are in line with the MIRI/Arbital definition and examples. It seems fine to me to include 1) problems that are greatly exacerbated by AI and 2) problems that aren't caused by AI but may be best solved/ameliorated by some element of AI design, since these are problems that AI researchers have a responsibility over and/or can potentially contribute to. If there's a problem that isn't exacerbated by AI and does not seem likely to have a solution within AI design then I'd not include that. Sure, agreed.
2Rohin Shah3yYeah, that seems right. I would probably defend the claim that motivation contains the most urgent part in the same way that Paul has done in the past -- it seems likely to be easy to get a well motivated AI system to realize that it should help us understand our values, and that it should not do irreversible high-impact actions until then. I'm less optimistic about defending values against manipulation, because you probably need to be very competent for that, and you can't take your time to become more competent, but that seems like a further-away problem to me and so less urgent. (I don't think I have much to add over the discussions you and Paul have had in the past, but I'm happy to clarify my opinion if it seems useful to you -- perhaps my way of stating things will click where Paul's way didn't, idk. Or I might have different opinions and not realize it.) I would support the idea of having this idea simply as a decomposition and not also pack in the implication that motivation/competence corresponds to urgent/not-urgent, though I suspect it is quite hard to do that now.

I’m happy to clarify my opinion if it seems useful to you—perhaps my way of stating things will click where Paul’s way didn’t

I would highly welcome that. BTW if you see me argue with Paul in the future (or in the past) and I seem to be not getting something, please feel free to jump in and explain it a different way. I often find it easier to understand one of Paul's ideas from someone else's explanation.

it seems likely to be easy to get a well motivated AI system to realize that it should help us understand our values

Yes, that seems easy, but actually helping seems much harder.

and that it should not do irreversible high-impact actions until then

How do you determine what is "high-impact" before you have a utility function? Even "reversible" is relative to a utility function, right? It doesn't mean that you literally can reverse all the consequences of an action, but rather that you can reverse the impact of that action on your utility?

It seems to me that "avoid irreversible high-impact actions" would only work if one had a small amount of uncertainty over one's utility function, in which case you could just avoid actions that are considered "irreversible high-impact" by

How to prevent "aligned" AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic "power corrupts" problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.

My position on this (that might be clear... (read more)

3Paul Christiano3yIf you think this risk is very large, presumably there is some positive argument for why it's so large? That seems like the most natural way to run the argument. I agree it's not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern. In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: "(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don't have sufficiently good understanding." (Each of those claims obviously requires more argument, etc.) One version of the case for worrying about value corruption would be: * It seems plausible that the values pursued by humans are very sensitive to changes in their environment. * It may be that historical variation is itself problematic, and we care mostly about our particular values.Or it may be that values are "hardened" against certain kinds of environment shift that occur in nature, and that they will go to some lower "default" level of robustness under new kinds of shifts.Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK. * If so, the rate of change in subjective time could be reasonably high---perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn't endorsed for decision-theoretic reasons / hardened against). * It's pl
1Wei Dai3yYeah, I didn't literally mean that I don't have any arguments, but rather that we've discussed it in the past and it seems like we didn't get close to resolving our disagreement. I tend to think that Aumann Agreement doesn't apply to humans, and it's fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don't think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there's too many people working on this problem, which might never actually happen). I don't understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can't do anything about it, so 2% is how much you expect we can potentially "save" from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn't care about averting drift/corruption, then however their values drift that doesn't constitute any loss? I don't understand "better" in what sense. Whatever it is, why wouldn't it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we're not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using "AI alignment" as the broader term (in line with the MIRI/Arbital definition and examples) and "AI motivation" as the narrower term (as suggested by Rohin)?
4Paul Christiano3yYes: * The vast majority of existing usages of "alignment" should then be replaced by "motivation," which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that "A" should be the one that keeps the old word. * The word "alignment" was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it's a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here [https://intelligence.org/files/TechnicalAgenda.pdf], where they said "We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”") Everywhere that anyone talks about alignment they use the analogy with "pointing," and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction. * In contrast, "alignment" doesn't really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term "beneficial AI," which really means exactly that. In explaining why MIRI doesn't like that term, Rob said * [continuing last point] The proposed usage of "alignment" doesn't meet this desiderata though, it has exactly the same problem as "beneficial AI," except that it's historically associated with this community. In particular it absolutely includes "garden-variety machine ethics and moral philosophy." Yes, there is all sorts of stuff that MIRI or I wouldn't care about that is relevant to "beneficial" AI, but under the proposed definition of alignment it's also relevant to "aligned" AI. (This statement by Rob also makes me think that you wouldn't i
1Paul Christiano3y"Do what H wants me to do" seems to me to be an example of a beneficial goal, so I'd say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it's wrong about what H wants or has other mistaken empirical beliefs. I don't think anyone could be advocating the definition "pursues no harmful sub goals," since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you? I've been assuming that "reliably pursues beneficial goals" is weaker than the definition I proposed, but practically equivalent as a research goal. I think it's reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying "AI alignment" regardless of how the term was defined, I normally clarify by saying something like "an AI which is at least trying to help us get what we want." My guess is that MIRI folks won't like the "beneficial AI" term because it is too broad a tent. (Which is also my objection to the proposed definition of "AI alignment," as "overarching research topic of how to develop sufficiently advanced machine intelligences [https://arbital.com/p/sufficiently_advanced_ai/] such that running them produces good [https://arbital.com/p/beneficial/] outcomes in the real world.") My sense is that if that were their position, then you would also be unhappy with their proposed usage of "AI alignment," since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right? (They might also dislike "beneficial AI" because of random contingent facts about how it's been used in the past, and so might want a different term with the same meaning.) My own feeling is that using "beneficial AI" to mean
2Wei Dai3yI guess both "reliable" and "beneficial" are matters of degree so "aligned" in the sense of "reliably pursues beneficial goals" is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas "Successfully figuring out and satisfying H's true/normative values" would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do. That seems good too. This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world", which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I'm fine with calling this "beneficial AI" instead of "AI alignment" if everyone can coordinate on this (but I don't know how MIRI people feel about this). I don't understand why you think 'MIRI folks won’t like the “beneficial AI” term because it is too broad a tent' given that someone from MIRI gave a very broad definition to "AI alignment". Do you perhaps think that Arbital article was written by a non-MIRI person?
1Paul Christiano3yIn what sense is that a more beneficial goal? * "Successfully do X" seems to be the same goal as X, isn't it? * "Figure out H's true/normative values" is manifestly a subgoal of "satisfy H's true/normative values." Why would we care about that except as a subgoal? * So is the difference entirely between "satisfy H's true/normative values" and "do what H wants"? Do you disagree with one of the previous two bullet points? Is the difference that you think "reliably pursues" implies something about "actually achieves"? If the difference is mostly between "what H wants" and "what H truly/normatively values", then this is just a communication difficulty. For me adding "truly" or "normatively" to "values" is just emphasis and doesn't change the meaning. I try to make it clear that I'm using "want" to refer to some hard-to-define idealization rather than some narrow concept, but I can see how "want" might not be a good term for this, I'd be fine using "values" or something along those lines if that would be clearer. (This is why I wrote: )
3Wei Dai3yAh, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant "want" in an idealized sense but then forgot and didn't re-read the post to pick up that understanding again.) ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by "values" you mean something like "current understanding of values" or "interim values" rather than "true/normative values" since it doesn't seem to make sense to want one's true/normative values to change over time. I don't think "values" is good either. Both "want" and "values" are commonly used words that typically (in everyday usage) mean something like "someone's current understanding of what they want" or what I called "interim values". I don't see how you can expect people not to be frequently confused if you use either of them to mean "true/normative values". Like the situation with de re / de dicto alignment, I suggest it's not worth trying to economize on the adjectives here. Another difference between your definition of alignment and "reliably pursues beneficial goals" is that the latter has "reliably" in it which suggests more of a de re reading. To use your example "Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges." I think most people would call an A that correctly understands H's preferences (and gets oranges) more reliably pursuing beneficial goals. Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words "want", "values", or "alignment" from other people's.
1Paul Christiano3y10x worse was originally my estimate for cost-effectiveness, not for total value at risk. People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
2Wei Dai3yIt's not obvious that applies here. If people don't care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people's values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible. As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
1Paul Christiano3yI agree that: * If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever). * A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care). I think that both * (a) Trying to have influence over aspects of value change that people don't much care about, and * (b) better understanding the important processes driving changes in values are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it's worth being thoughtful about that.) (I don't agree with the sign of the effect described in your comment, but don't think it's an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
1Paul Christiano3yI don't see why the anti-realist version is any easier, my preferences about how my values evolve are complex and can depend on the endpoint of that evolution process and on arbitrarily complex logical facts. I think the analogous non-realistic mathematical framing is fine. If anything the realist versions seem easier to me (and this is related to why mathematics seems so much easier than morality), since you can anchor changing preferences to some underlying ground truth and have more potential prospect for error-correction, but I don't think it's a big difference. It doesn't sound that way to me, but I'm happy to avoid framings that might give people the wrong idea. My main complaint with this framing (and the reason that I don't use it) is that people respond badly to invoking the concept of "corruption" here---it's a fuzzy category that we don't understand, and people seem to interpret it as the speaker wanting values to remain static. But in terms of the actual meanings rather than their impacts on people, I'd be about as happy with "avoiding corruption of values" as "having our values evolve in a positive way." I think both of them have small shortcomings as framings. My main problem with corruption is that it suggests an unrealistically bright line / downplays our uncertainty about how to think about changing values and what constitutes corruption.
1Wei Dai3yIt seems easier in that the AI / AI designer doesn't have to worry about the user being wrong about how they want their values to evolve. But you're right that the realist version might be easier in other ways, so perhaps what I should say instead is that the problem definitely seems harder if we also include the subproblem of figuring out what the right metaethics is in the first place, and (by implicitly assuming a subset of all plausible metaethical positions) the statement of the problem that you proposed also does not convey a proper amount of uncertainty in its difficulty. That's a good point that I hadn't thought of. (I guess talking about "drift" has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.) If you or anyone else have a suggestion about how to phrase the problem so as to both avoid this issue and address my concerns about not assuming a particular metaethical position, I'd highly welcome that.
1Paul Christiano3yThat may be a connotation of the "preferences about how their values evolve," but doesn't seem like it follows from the anti-realist position. I have preferences over what actions my robot takes. Yet if you asked me "what action do you want the robot to take?" I could be mistaken. I need not have access to my own preferences (since they can e.g. depend on empirical facts I don't know). My preferences over value evolution can be similar. Indeed, if moral realists are right, "ultimately converge to the truth" is a perfectly reasonable preference to have about how my preferences evolve. (Though again this may not be captured by the framing "help people's preferences evolve in the way they want them to evolve.") Perhaps the distinction is that there is some kind of idealization even of the way that preferences evolve, and maybe at that point it's easier to just talk about preservation of idealized preferences (though that also has unfortunate implications and at least some minor technical problems). I agree that drift is also problematic.
1Wei Dai3yWould you agree with this way of stating it: There are more ways for someone to be wrong about their values under realism than under anti-realism. Under realism someone could be wrong even if they correctly state their preferences about how they want their values to evolve, because those preferences could themselves be wrong. So assuming an anti-realist position makes the problem sound easier because it implies there are fewer ways for the user to be wrong for the AI / AI designer to worry about.
1Paul Christiano3yCould you give an example of a statement you think could be wrong on the realist perspective, for which there couldn't be a precisely analogous error on the non-realistic perspective? There is some uninteresting semantic sense in which there are "more ways to be wrong" (since there is a whole extra category of statements that have truth values...) but not a sense that is relevant to the difficulty of building an AI. I might be using the word "values" in a different way than. I think I can say something like "I'd like to deliberate in way X" and be wrong. I guess under non-realism I'm "incorrectly stating my preferences" and under realism I could be "correctly stating my preferences but be wrong," but I don't see how to translate that difference into any situation where I build an AI that is adequate on one perspective but inadequate on the other.
1Wei Dai3yI'm not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved? Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its "interim" values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only "safe" in the sense of having passed certain tests for a very narrow distribution of inputs. Clearly it's not safe for a much more powerful outer AI to query the human about arbitrary actions that it's considering, right? Instead, if the human is to contribute anything at all to safety in this situation, the outer AI has to figure out how to generate a bunch of smaller queries that the human can safely handle, from which it would then infer what the human would say if it could safely consider the actual choice under consideration. If the AI is bad at this "competence" problem it could send unsafe queries to the human and corrupt the human, and/or infer the wrong thing about what the human would approve of. Is it clearer now why this doesn't seem like an easy problem to me? I'm not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us? It seems far from easy to do this in a robust way. I mean this classifier would be facing lots of unpredictable distributional shifts... I guess you made a similar point when you said "On the other hand, there may be similar types of events in the future that we can’t back out by looking at the past." ETA: Do you expect that different AIs would do dif
1Rohin Shah3yI don't know, I want to outsource that decision to humans + AI at the time where it is relevant. Perhaps it involves stopping technological development. Perhaps it means continuing technological development, but not doing any space colonization. My point is simply that if humans agree that metaphilosophy needs to be solved, and the AI is trying to help humans, then metaphilosophy will probably be solved, even if I don't know how exactly it will happen. Yes. It seems to me like you're considering the case where a human has to be able to give the correct answer to any question of the form "is this action a good thing to do?" I'm claiming that we could instead grow the set of things the AI does gradually, to give time for humans to figure out what it is they want. So I was imagining that humans would answer the AI's questions in a frame where they have a lot of risk aversion, so anything that seemed particularly impactful would require a lot of deliberation before being approved. I was thinking more of the case where a single human amassed a lot of power. Humans haven't seemed to solve the problem of predicting how new technologies/choices would change human values, so that seems like quite a hard problem to solve (but perhaps AI could do it). I meant more that conditional on the AI knowing how some new technology or choice would affect us, it seems not too hard to figure out whether we would view it as a good thing. Yes. Kind of? I'd amend that slightly to say that to the extent that I think it is a problem (I'm not sure), I want to solve it in some way that is not technical research. (Possibilities: convince everyone to be cautious, obtain a decisive strategic advantage and enforce that everyone is cautious.) Same as above. Same as above. All of these problems that you're talking about would also apply to technology that could make a human smarter. It seems like it would be easiest to address on that level, rather than trying to build an AI system that can deal
2Wei Dai3yWhy isn't that also an argument against the urgency of solving AI motivation? I.e., we don't need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure? It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are. Aside from that, I think it's also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That's why I was asking you for details of what you think the social solutions would look like. I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend. Ok, I appreciate that.
1Rohin Shah3yTwo reasons come to mind: * Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse. * Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you're going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we're going to have lives of leisure due to automation, or to delay full automation if we're soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target. I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation. I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator's wishes-upon-reflection. (If your technical solution is in line with the operator's wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator's wishes-upon-reflection?) as well as ha
3Wei Dai3yDo you think that at the time when AI development wasn't an already-running process, and AI was still a new thing that the public could be expected to be risk-averse about (when would you say that was?), the argument "working on alignment isn't urgent because humans can probably coordinate to stop AI development" would have been a good one? Same question here. Back when "don't develop AI" was still a binding on our future selves, should we have expected that we will coordinate to stop AI development, and it's just bad luck that we haven't succeeded in doing that? Can you be more specific? What global agreement do you think would be reached, that is both realistic and would solve the kinds of problems that I'm worried about (e.g., unintentional corruption of humans by "aligned" AIs who give humans too much power or options that they can't handle, and deliberate manipulation of humans by unaligned AIs or AIs aligned to other users)? For example, create an AI that can help the user with philosophical questions at least as much as technical questions. (This could be done for example by figuring out how to better use Iterated Amplification to answer philosophical questions, or how to do imitation learning of human philosophers, or how to apply inverse reinforcement learning to philosophical reasoning.) Then the user could ask questions like "Am I likely to be corrupted by access to this technology? What can I do to prevent that while still taking advantage of it?" Or "Is this just an extremely persuasive attempt at manipulation or an actually good moral argument?" As another example, solve metaethics and build that into the AI so that the AI can figure out or learn the actual terminal values of the user, which would make it easier to protect the user from manipulation and self-corruption. And even if the human user is corrupted, the AI still has the correct utility function, and when it has made enough technological progress it can uncorrupt the human. Can you point
1Wei Dai3yI forgot to followup on this important part of our discussion: It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I'm talking about (which are largely caused by technological progress outpacing philosophical/moral progress). I could make some arguments about this, but I'm curious if this doesn't seem obvious to you. Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it's a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this (especially with regard to the particular problems that I'm pointing out).
1Rohin Shah3yYes, I agree with this. The reason I mentioned that was to make the point that the problems are a function of progress in general and aren't specific to AI -- they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI. This seems true. Just to make sure I'm not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?
1Wei Dai3yThis doesn't make much sense to me. Why is this any kind of reason to expect that solutions are likely to come from outside of AI? Can you give me an analogy where this kind of reasoning more obviously makes sense? Right, this argument wasn't targeted to you, but I think there are other reasons for you to personally prioritize this. See my comment in the parallel thread.
1Alex Turner3yFrom the AUP perspective [https://www.alignmentforum.org/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure] , this only seems true in a way analogous to the statement that "any hypothesis can have arbitrarily long description length". It’s possible to make practically no assumptions about what the true utility function is and still recover a sensible notion of "low impact". That is, penalizing shifts in attainable utility for even random or simple functions still yields the desired behavior; I have experimental results to this effect which aren’t yet published. This suggests that the notion of impact captured by AUP isn’t dependent on realizability of the true utility, and hence the broader thing Rohin is pointing at should be doable. While it’s true that some complex value loss is likely to occur when not considering an appropriate distribution over extremely complicated utility functions, it seems by-and-large negligible. This is because such loss occurs either as a continuation of the status quo or as a consequence of something objectively mild, which seems to correlate strongly with reasonably human-values mild.

This is a great comment, and maybe it should even be its own post. It clarified a bunch of things for me, and I think was the best concise argument for "we should try to build something that doesn't look like an expected utility maximizer" that I've read so far.

4Rohin Shah3yThanks! The hope is to write something a bit more comprehensive that expands on many of these points, which would be its own post (or sequence).
1Wei Dai2yAnother con of the motivation-competence decomposition: unlike definition-optimization, it doesn't actually seem to be a clean decomposition of the larger task, such that we can solve each subtask independently and then combine the solutions. For example one way we could solve the motivation problem is by building a perfect human imitation (of someone who really wants to help H do what H wants), but then we seem to be stuck on the "competence" front, and there's no clear way to plug this solution of "motivation" into a better generic solution to "competence" to get a more competent intent-aligned agent. Instead it seems like we have to solve the competence problem that is particular to the specific solution to motivation, or solve motivation and competence together as one large problem. In contrast, the problem of specifying an aligned utility function and the problem of building a safe EU maximizers seem to be naturally independent problems, such that once we have a specification of an aligned utility function (or a method of specifying aligned utility functions), we can just plug that into more and more powerful and robust EU maximizers. Furthermore I think this lack of clean decomposition shows up at the conceptual level too, not just the pragmatic level. For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn't very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both? It seems arguable or hard to say. In contrast, in a system that is built using the definition-optimization decomposition, it seems like it would be easy to trace any safety failures to either the "definition" solution or the "optimization" solution.
1Rohin Shah2yI overall agree that this is a con. Certainly there are AI systems that are weak enough that you can't talk coherently about their "motivation". Probably all deep-learning-based systems fall into this category. I also agree that (at least for now, and probably in the future as well) you can't formally specify the "type signature" of motivation such that you could separately solve the competence problem without knowing the details of the solution to the motivation problem. My hope here would be to solve the motivation problem and leave the competence problem for later, since by my view that solves most of the problem (I'm aware that you disagree with this). I don't agree that it's not clean at the conceptual level. It's perhaps less clean than the definition-optimization decomposition, but not much less. This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don't want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
2Wei Dai2yIt also seems like a failure of motivation though, because as soon as the Oracle started to do malign optimization, the system as a whole is no longer trying to do what H wants. Or is the idea that as long as the top-level or initial optimizer is trying (or tried) to do what H wants, then all subsequent failures of motivation don't count, so we're excluding problems like inner alignment from motivation / intent alignment? I'm unsure what your answer would be, and what Paul's answer would be, and whether they would be the same, which at least suggests that the concepts haven't been cleanly decomposed yet. ETA: Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won't cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)? It seems really counterintuitive if the answer is "no".
2Rohin Shah2yOh, I see, you're talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn't apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I'd say it fails motivation, mostly because the system doesn't really have a single "motivation"). It doesn't seem like the definition-optimization decomposition helps either? I don't know whether I'd call that a failure of definition or optimization. I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn't apply this framework to the system (and I also wouldn't apply definition-optimization to the system).
3Wei Dai2yThis was an unexpected answer. Isn't HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn't what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn't apply to IDA either. I'm pretty sure Paul would give a different answer, if we ask him about "intent alignment". It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
4Paul Christiano2yYes, I'd say that to the extent that "trying to do X" is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent. Even a very theoretically simple system like AIXI doesn't seem to be "trying" to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to "know" that its actions won't lead to reward. You could say that AIXI is "optimizing" the right thing and just messing up when it suffers inner alignment failures, but I'm not convinced that this division is actually doing much useful work. I think it's meaningful to say "defining what we want is useful," but beyond that it doesn't seem like a workable way to actually analyze the hard parts of alignment or divide up the problem. (For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity [https://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/], but I've mostly stopped working along these lines because it no longer seems directly useful.) I agree. Of course, it also seems quite likely that AIs of the kind that will probably be built ("by default") also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
1Wei Dai2ySo how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn't want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we've had a similar discussion before and either it didn't get resolved or I didn't understand your position. I didn't see a direct attempt to answer this in the comment I'm replying to, and it's fine if you don't want to go down this road again but I want to convey my continued confusion.) I don't understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin's. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn't realize.) This makes sense.
2Paul Christiano2yThe oracle is not aligned when asked questions that cause it to do malign optimization. The human+oracle system is not aligned in situations where the human would pose such questions. For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do. For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign [https://ai-alignment.com/benign-ai-e4eb6ec6d68e] when talking about possibly-incoherent systems, or things that don't even resemble optimizers. The definition in this post is a bit sloppy here, but I'm usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in "what H wants it to do" (such that there can be several different preferences that are "what H wants") we could say something like: That's not great either though (and I think the original post is more at an appropriate level of attempted-precision).
1Wei Dai2y(In the following I will also use "aligned" to mean "intent aligned".) Ok, sounds like "intent aligned at some points in time and not at others" was the closest guess. To confirm, would you endorse "the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question"? Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the "intent alignment problem" has been solved for an AI, or when would you call an AI (such as IDA) itself "intent aligned"? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use "intent alignment" you always have some specific situation or set of situations in mind?
1Rohin Shah2yFwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment: Yes, I shouldn't have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is "trying to do", i.e. I wouldn't say it has a single "motivation". This allows you to say "the system is not intent-aligned", even though you can't say "the system is trying to do X". Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn't make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement: Also, I want to note strong agreement with this:
1Wei Dai2yHCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) "What is a good approximation of the user's utility function?" followed by "What action would maximize EU according to this utility function?" ETA: If this isn't clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.

In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.

Rohin Shah's comment on the essay (which I believe is endorsed by Paul) reframes it as a particular way to decompose the AI safety problem. An often used decomposition is "definition-optimization": first we define what it means for an AI to be safe, then we understand how to implement a safe AI. In contrast, Paul's definition of alignment decomposes the AI safety problem as "motivation-competence": first we learn how to design AIs with good motivations, then we learn how to make them competent. Both Paul and Rohin argue that the "motivation" is the urgent part of the problem, the part on which technical AI safety research should focus.

In contrast, I will argue that the "motivation-competence

4Rohin Shah2yAgreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly. If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned. That, or it could depend on the agent's counterfactual behavior in other situations. I agree it can't be just the action chosen in the particular state. I guess you wouldn't count universality [https://ai-alignment.com/towards-formalizing-universality-409ab893a456]. Overall I agree. I'm relatively pessimistic about mathematical formalization. (Probably not worth debating this point; feels like people have talked about it at length in Realism about rationality [https://www.lesswrong.com/posts/suxvE2ddnYMPJN9HD/realism-about-rationality#YMNwHcPNPd4pDK7MR] without making much progress.) I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid no-free-lunch theorems.

This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.

Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.

The acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intent-alignment is insufficient and you need "corrigibility" as another condition (also it is not so clear to me what this condition means).

If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Now, I do believe that if you set up the prior correctly then i

3Rohin Shah2ySorry, that's right. Fwiw, I do think subjective regret bounds are significantly better than the thing I meant by definition-optimization. Why doesn't this also apply to subjective regret bounds? My guess at your answer is that Alpha wouldn't take the irreversible action as long as the user believes that Alpha is not in Beta-simulation-world. I would amend that to say that Alpha has to know that [the user doesn't believe that Alpha is in Beta-simulation-world]. But if Alpha knows that, then surely Alpha can predict that the user would not confirm the irreversible action? It seems like for subjective regret bounds, avoiding this scenario depends on your prior already "knowing" that the user thinks that Alpha is not in Beta-simulation-world (perhaps by excluding Beta-simulations). If that's true, you could do the same thing with intent alignment / corrigibility. It isn't equivalent to intent alignment; but it is meant to be used as part of an argument for safety, though I guess it could be used in definition-optimization too, so never mind. That is hard to say. I would want to have the reaction "oh, if I built that system, I expect it to be safe and competitive". Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space. I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a well-specified hypothesis space for rewards / beliefs / observation models. Tbc, I also don't think intent alignment will lead to a mathematical formalization I'm happy with -- it "passes the buck" to the problem of defining what "trying" is, or what "corrigibility" is.
3Vanessa Kosoy2yIn order to get a subjective regret bound you need to consider an appropriate prior. The way I expect it to work is, the prior guarantees that some actions are safe in the short-term: for example, doing nothing to the environment and asking only sufficiently quantilized queries from the user (see this [https://www.alignmentforum.org/posts/5bd75cc58225bf0670375510/catastrophe-mitigation-using-drl] for one toy model of how "safe in the short-term" can be formalized). Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior. Now, you can say "with the right prior intent-alignment also works". To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work. Indeed, we can imagine that the ontology on which the prior is defined includes a "true reward" symbol s.t., by definition, the semantics is whatever the user truly wants. An agent that maximizes expected true reward then can be said to be intent-aligned. If it's doing something bad from the user's perspective, then it is just an "innocent" mistake. But, unless we bake some specific assumptions about the true reward into the prior, such an agent can be anything at all. This is related to what I call the distinction between "weak" and "strong feasibility". Weak feasibility means algorithms that are polynomial time in the number of states and actions, or the number of hypotheses. Strong feasibility is supposed to be something like, polynomial time in the description length of the hypothesis. It is true that currently we only have strong feasibility results for relatively simple hypothesis spaces (such as, support vector machines). But, this seems to me just a symptom of advances in heuristics outpacing the theory. I don't see any reason of principle that significantly limits the strong feasibility results we can expect. Ind
3Rohin Shah2yI completely agree with this, but isn't this also true of subjective regret bounds / definition-optimization? Like, when you write (emphasis mine) Isn't the assumption about the prior "doing all the work"? Maybe your point is that there are failure modes that aren't covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places. Just picking one sentence (emphasis mine): And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world. That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world. The abstract argument is an attempt to explain it; but I wouldn't have much faith in the abstract argument by itself (which is trying to quantify over all possible ways of getting a strong feasibility result). Idk, you could have a nondisclosure-by-default policy if you were worried about this. Maybe this can't work for you though. (As an aside, I hope this is what MIRI is doing, but they probably aren't.) Basically what you said right after:
3Vanessa Kosoy2yThe idea is, we will solve the alignment problem by (i) formulating a suitable learning protocol (ii) formalizing a set of assumptions about reality and (iii) proving that under these assumptions, this learning protocol has a reasonable subjective regret bound. So, the role of the subjective regret bound is making sure that the what we came up with in i+ii is sufficient, and also guiding the search there. The subjective regret bound does not tell us whether particular assumptions are realistic: for this we need to use common sense and knowledge outside of theoretical computer science (such as: physics, cognitive science, experimental ML research, evolutionary biology...) I disagree with the OP that (emphasis mine): I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem. I don't think strong feasibility results will have to talk about the environment, or rather, they will have to talk about it on a very high level of abstraction. For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result. We can then debate whether an using such a smooth approximation is sufficient for superhuman performance, but establishing this requires different tools, like I said above. The way I imagine it, AGI theory should ultimately arrive at some class of priors that are on the one hand rich enough to deserve to be called "general" (or, practically speaking, rich enough to produce superhuman agents) and on the other hand narrow enough to allow for efficient algorithms. For example the Solomonoff prior is too r
I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem.

Okay, so there seem to be two disagreements:

• How bad is it that intent alignment is ill-defined
• Is work on intent alignment urgent

The first one seems primarily about our disagreements on the utility of theory, which I'll get to later.

For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely (maybe the first few AGIs don't think about simulations; maybe it's impossible to construct such a convincing hypothesis). I especially don't see the argument that it is more likely than the failure mode in which a goal-directed AGI is optimizing for something different from what humans want.

(You might respond that intent alignment brings risk down from say 10% to 3%, whereas your agenda brings risk down from 10% to 1%. My response would be that once we have successfully figured out... (read more)

For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely.

First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk. Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is. Hence, I am skeptical about the usefulness of intent-alignment.

For a more "mundane" example, take IRL. Is IRL intent aligned? What if its assumptions about human behavior are inadequate and it ends up inferring an entirely wrong reward function? Is it still intent-aligned since it is trying to do what the user wants, it is just wrong about what the user wants? Where is the line between "being wrong about what the user wants" and optimizing something completely unrelated to what the user wants?

It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual

3Rohin Shah2yOkay. What's the argument that the risk is great (I assume this means "very bad" and not "very likely" since by hypothesis it is unlikely), or that we need a lot of time to solve it? I agree with this; I don't think this is one of our cruxes. (I do think that in most cases, if we have all the information about the situation, it will be fairly clear whether something is intent aligned or not, but certainly there are situations in which it's ambiguous. I think corrigibility is better-informally-defined, though still there will be ambiguous situations.) Depends on the details, but the way you describe it, no, it isn't. (Though I can see the fuzziness here.) I think it is especially clear that it is not corrigible. Yup, I agree (with the caveat that it doesn't have to be a human's interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe. I meant that K was set considering wind forces, cars, etc. and was set too low to account for resonance, because you didn't think about resonance beforehand. (I guess resonance doesn't involve large forces, it involves coordinated forces. The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn't think of it.) I'm not denying that? I'm not arguing against theory in general; I'm arguing against theoretical safety guarantees. I think in practice our confidence in safety often comes from empirical tests. Probably? Honestly, I'm don't think you even need to prove the subjective regret bound; if you wrote down assumptions that I agree are realistic and capture safety (such that you could write code that determines whether or not an AI system is safe) that alone would qualify. It would be fine if it sometimes said things are unsafe when they are safe, as long as it isn't too conservative;
3Vanessa Kosoy2yThe reasons the risk are great are standard arguments, so I am a little confused why you ask about this. The setup effectively allows a superintelligent malicious agent (Beta) access to our universe, which can result in extreme optimization of our universe towards inhuman values and tremendous loss of value-according-to-humans. The reason we need a lot of time to solve it is simply that (i) it doesn't seem to be an instance of some standard problem type which we have standard tools to solve and (ii) some people have been thinking on these questions for a while by now and did not come up with an easy solution. Then, I don't understand why you believe that work on anything other than intent-alignment is much less urgent? "Resonance" is not something you need to explicitly include in your model, it is just a consequence of the equations of motion for an oscillator. This is actually an important lesson about why we need theory: to construct a useful theoretical model you don't need to know all possible failure modes, you only need a reasonable set of assumptions. I think that in practice our confidence in safety comes from a combination of theory and empirical tests. And, the higher the stakes and the more unusual the endeavor, the more theory you need. If you're doing something low stakes or something very similar to things that have been tried many times before, you can rely on trial and error. But if you're sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. Yes, you will test the modules on Earth in conditions as similar to the real environment as you can (respectively, you will do experiments with narrow AI). But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate. I disagree. For example, suppose that we have a theorem saying that an ANN with particular architecture and learning algorithm can learn any function inside some space
2Rohin Shah2ySorry, I meant what are the reasons that the risk greater than the risk from a failure of intent alignment? The question was meant to be compared to the counterfactual of work on intent alignment, since the underlying disagreement is about comparing work on intent alignment to other AI safety work. Similarly for the question about why it might take a long time to solve. I'm claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution. Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines: 1. Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip. 2. Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it. In this situation, machine A is a much better plan. (The example is meant to illustrate the phenomenon by which you might want to choose a riskier but easier-to-create option; it's not meant to properly model intent alignment vs. other stuff on other axes.) I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs. (Maybe today bridge-building technology has advanced such that we are able to do such proofs, I don't know, but at least in the past that would not have been the case.) In this case, we either fail to prove anything, or we make unrealistic assumptions that do not hold in reality and get a proof of safety. Similarly, I think in many cases involving properties about a complex real environment, your two options are 1. don't prove things or 2. prove things with unrealistic assumptions that
2Vanessa Kosoy2yI am struggling to understand how does it work in practice. For example, consider dialogic [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform#Wi65Ahs9abL63gPSe] RL [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform#Dhd7eFegPFNHSm2kj] . It is a scheme intended to solve AI alignment in the strong sense. The intent-alignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as: 1. Basic question-answer protocol 2. Natural language annotation 3. Quantilization of questions 4. Debate over annotations 5. Dealing with no user answer 6. Dealing with inconsistent user answers 7. Dealing with changing user beliefs 8. Dealing with changing user preferences 9. Self-reference in user beliefs 10. Quantilization of computations (to combat non-Cartesian daemons, this is not in the original proposal) 11. Reverse questions 12. Translation of counterfactuals from user frame to AI frame 13. User beliefs about computations EDIT: 14. Confidence threshold for risky actions Which of these features are necessary for intent-alignment and which are only necessary for strong alignment? I can't tell. I am not an expert but I expect that bridges are constructed so that they don't enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation). We want bridges that don't fall, don't we? On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions. I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just unders
2Rohin Shah2yAs far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is "doesn't manipulate the user" or something like that.) I'm not sure what 9, 11 and 13 are about. For the others, I'd say they're all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have. This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven't yet seen a bridge fall down from resonance, and so you don't think about it. Maybe I'm falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant. Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a "mathematical theory of how to write code". I'm curious what you think doesn't require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don't have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that's how I interpret the rocket alignment and security mindset posts.)
3Vanessa Kosoy2yHmm. I appreciate the effort, but I don't understand this answer. Maybe discussing this point further is not productive in this format. Yes, and in that perspective, the mathematical model can tell me about resonance. It's actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn't mean we should be careless about AGI as well. I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of unrealistic mathematical models, empirical data and informal reasoning. But I don't understand why should we give up so soon? We can work towards realistic mathematical models and prepare fallbacks, and even if we don't arrive at a realistic mathematical model it is likely that the effort will produce valuable insights. First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don't literally start proving a theorem with pencil and paper. However, my not-fully-formal reasoning is such that I could prove a theorem if I wanted to. My model is not exactly "intuitive": I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don't write proofs that are machine verifiable (some people do that today, but it's a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive). S
2Rohin Shah2yYou made a claim a few comments above: I'm struggling to understand what you mean by "theory" here, and the programming example was trying to get at that, but not very successfully. So let's take the sandwich example: Presumably the ingredients were in a slightly different configuration than you had ever seen them before, but you were still able to "extrapolate" to figure out how to make a sandwich anyway. Why didn't you need theory for that extrapolation? Obviously this is a silly example, but I don't currently see any qualitative difference between sandwich-making-extrapolation, and the sort of extrapolation we do when we make qualitative arguments about AI risk. Why trust the former but not the latter? One is answer is that the latter is more complex, but you seem to be arguing something else.
1Vanessa Kosoy2yI decided that the answer deserves its own post [https://www.alignmentforum.org/posts/qpbYwTqKQG8G7mdFK/the-reasonable-effectiveness-of-mathematics-or-ai-vs] .

I hadn't realized this post was nominated, partially because of my comment, so here's a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.

Since writing the comment, I've come across another argument for thinking about intent alignment -- it seems like a "generalization" of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explicitly maintains a distribution over possible human reward functions, and instrumentally gathers information about human preferences by interacting with the human. With intent alignment, since the agent is trying to help the human, we expect the agent to instrumentally maintain a belief over what the human cares about, and gather information to refine this belief. We might hope that there are ways to achieve intent alignment that instrumentally incentivizes all the nice behaviors of assistance games, without requiring the modeling assumptions that CIRL does (e.g. that the human has a fixed known reward function).

I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.

Worth noting here that (it looks like) Paul eventually settled upon "intent alignment" as the term for this.

I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.

I think it would be helpful for understanding your position and what you mean by "AI alignment" to have a list or summary of those other subproblems and why you think they're much less urgent. Can you link to or give one here?

Also, do you have a prefered term for the broader definition, or the de re reading? What should we call those things if not "AI alignment"?

3Paul Christiano3yOther problems related to alignment, which would be included by the broadest definition of "everything related to making the future good." * We face a bunch of problems other than AI alignment (e.g. other destructive technologies, risk of value drift), and depending on the competencies of our AI systems they may be better or worse than humans at helping handle those problems (relative to accelerating the kinds of progress that force us to confront those problems). So we'd like AI to be better at (helping us with) {diplomacy, reflection, institution design, philosophy...} relative to {physical technology, social manipulation, logistics...} * Beyond alignment, AI may provide new advantages to actors who are able to make their values more explicit, or who have explicit norms for bargaining/aggregation, and so we may want to figure out how to make more things more explicit. * AI could facilitate social control, manipulation, or lock-in, which may make it more important for us to have more robust or rapid forms of deliberation (that are robust to control/manipulation, or that can run their course fast enough to prevent someone from making a mistake). This also may increase the incentives for ordinary conflict amongst actors with differing long-term values. * AI will tend to empower groups with few people (but lots of resources), making it easier for someone to destroy the world and so requiring stronger enforcement/stabilization. * AI may be an unusually good opportunity for world stabilization, e.g. because its associated with a disruptive transition, in which case someone may want to take that opportunity. (Though I'm concerned about this because, in light of disagreement/conflict about stabilization itself, someone attempting to do this or being expected to attempt to do this could undermine our ability to solve alignment.) That's a very partial list. This is for the broadest definition of "everythi

Crystallized my view of what the "core problem" is (as I explained in a comment on this post). I think I had intuitions of this form before, but at the very least this post clarified them.

Nominating this primarily for Rohin’s comment on the post, which was very illuminating.