Thanks to Rebecca Gorman for discussions that lead to these insights.
How can you get a superintelligent AI aligned with human values? There are two pathways that I often hear discussed. The first sees a general alignment problem - how to get a powerful AI to safely do anything - which, once we've solved, we can point towards human values. The second perspective is that we can only get alignment by targeting human values - these values must be aimed at, from the start of the process.
I'm of the second perspective, but I think it's very important to sort this out. So I'll lay out some of the arguments in its favour, to see what others think of it, and so we can best figure out the approach to prioritise.
More strawberry, less trouble
As an example of the first perspective, I'll take Eliezer's AI task, described here:
- "Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level." A 'safely' aligned powerful AI is one that doesn't kill everyone on Earth as a side effect of its operation.
If an AI accomplishes this limited task without going crazy, this shows several things:
- It is superpowered; the task described is beyond current human capabilities.
- It is aligned (or at least alignable) in that it can accomplish a task in the way intended, without wireheading the definitions of "strawberry" or "cellular".
- It is safe, in that it has not heavily dramatically reconfigured the universe to accomplish this one goal.
Then, at that point, we can add human values to the AI, maybe via "consider what these moral human philosophers would conclude if they thought for a thousand years, and do that".
I would agree that, in most cases, an AI that accomplished that limited task safely would be aligned. One might quibble that it's only pretending to be aligned, and preparing a treacherous turn. Or maybe the AI was boxed in some way and accomplished the task with the materials at hand within the box.
So we might call an AI "superpowered and aligned" if it accomplished the strawberry copying task (or a similar one) and if it could dramatically reconfigure the world but chose not to.
Values are needed
I think that an AI could not be "superpowered and aligned" unless it is also aligned with human values.
The reason is that the AI can and has to interact with the world. It has the capability to do so, by assumption - it is not contained or boxed. It must do so because any agent affects the world, through chaotic effects if nothing else. A superintelligence is likely to have impacts in the world simply through its existence being known, and if the AI finds it efficient to have interactions with the world (eg. ordering some extra resources) then it will do so.
So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, "dangerous" and "large" are defined by human values.
Suppose that the AI realises that its actions have slightly imbalanced the Earth in one direction, and that, within a billion years, this will cause significant deviations in the orbits of the planets, deviations it can estimate. Compared with that amount of mass displaced, the impact of killing all humans everywhere is a trivial one indeed. We certainly wouldn't want it to kill all humans in order to be able to carefully balance out its impact on the orbits of the planets!
There are very "large" impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people's personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.). If the AI accomplishes its task with a universal constructor or unleashing hordes of nanobots that gather resources from the world (without disrupting human civilization), it still has to decide whether to allow humans access to the constructors or nanobots after it has finished copying the strawberry - and which humans to allow this access to.
So every decision the AI makes is a tradeoff in terms of its impact on the world. Navigating this requires it to have a good understanding of our values. It will also need to estimate the value of certain situations beyond the human training distribution - if only to avoid these situations. Thus a "superpowered and aligned" AI needs to solve the problem of model splintering, and to establish a reasonable extrapolation of human values.
Model splintering sufficient?
The previous sections argue that learning human values (including model splintering) is necessary for instantiating an aligned AI; thus the "define alignment and then add human values" approach will not work.
Thus, if you give this argument much weight, learning human values is necessary for alignment. I personally feel that it's also (almost) sufficient, in that the skill in navigating model splintering, combined with some basic human value information (as given, for example, by the approach here) is enough to get alignment even at high AI power.
Which path to pursue for alignment
It's important to resolve this argument, as the paths for alignment that the two approaches suggest are different. I'd also like to know if I'm wasting my time on an unnecessary diversion.
I often say things that I think you would interpret as belonging to the first category ("general alignment plus human values").
This feels like the crux. I certainly agree that "dangerous" and "large" are not orthogonal to / independent of human values, and that as a result any realistic safe AI system will contain some information about human values.
But this seems like a very weak conclusion to me. Of course a superintelligent AI will contain some information about human values. GPT-3 isn't superintelligent and it already contains tons of knowledge about human values; possibly more than I do. You'd have to try really hard to prevent it from containing information about human values.
It seems like you conclude something much stronger, which is something like "we must build in all of human values". I don't see why we can't instead have our AI systems do whatever a well-motivated human would do in a similar principal-agent problem. This certainly involves knowing some amount about human values, but not some extraordinarily large amount that means we might as well just learn everything including in exotic philosophical cases.
(I think my position is pretty similar to Steve's.)
From a later comment:
The same way a well-motivated personal assistant would deal with it. Tell the human of these two possibilities, and ask them which one should be done. Help them with this decision by providing them with true, useful information about what consequences arise from each of the possibilities.
If you are able to perfectly predict their responses in all possible situations, and the final answer depends on (say) the order in which you ask the questions, then go up a meta level: ask them for their preferences about how you go about eliciting information from them and/or helping them with reflection.
If going up meta levels doesn't solve the problem either, then pick randomly amongst the options, or take an average.
If there's time pressure and you can't get their opinions, take your best guess as to which one they'd prefer, and do that one. (One assumes that such a scenario doesn't come up often.)
Generally with these sorts of hypotheticals (EDIT: where a very competent AI fails due to not understanding all of human values), it feels to me like it either (1) isn't likely to come up, or (2) can be solved by deferring to the human, or (3) doesn't matter very much.
What do you think about the following examples:
In addition to
I should probably add a fourth:
I think unless we make sure the AI can distinguish between "correct philosophy" or "well-intentioned philosophy" and "philosophy optimized for persuasion", each human will become either compromised (if they're not very cautious and read such messages) or isolated from the rest of humanity with regard to philosophical discussion (if they are cautious and discard such messages). This doesn't seem like an ok outcome to me. Can you explain more why you aren't worried?
I can imagine that if you subscribe to a metaethics in which a person can't be wrong about morality (i.e., some version of anti-realism), then you might think it's fine to lock in whatever values one currently thinks they ought to have. Is this your reason for "seems fine", or something else? (If so, I think nobody should be so certain about metaethics at this point.)
If the AI isn't very good at dealing with "exotic philosophical cases" then it's not going to be of much help with this problem, and a lot of humans (including me) probably aren't very good at thinking about this either, so we probably end up with a lot of humans succumbing to such acausal attacks.
Do you have any suggestions for this? Or some other reason to think that AIs that are aligned with different humans will find ways to cooperate (as efficiently as merging utility functions probably will be), without either a full understanding of human values or risking permanent loss of some parts of their complex values?
Agreed that's a possible good outcome, but seems far from a sure thing. Such laws would have to more intrusive than anything people are currently used to, since attackers can create simulated suffering within the "privacy" of their own computers or minds. I suppose if such threats become a serious problem that causes a lot of damage, people might agree to trade off their privacy for security. The law might then constitute a risk in itself, as the implementation mechanism might be subverted/misused to create a form of totalitarianism.
Another issue is if there are powerful unaligned AIs or rogue states who think they can use such threats to asymmetrically gain advantage, they won't agree to such laws.
I think COVID shows that we often can't do this even when it's relatively trivial (or can only do it with a huge time delay). For example COVID could have been solved at very low cost (relative to the actual human and economic damage it inflicted) if governments had stockpiled enough high filtration elastomeric respirators for everyone, mailed them out at the start of the pandemic, and mandated their use. (Some EAs are trying to convince some governments to do this now, in preparation for the next pandemic. I'm not sure how much success they're having.)
To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn't know human values, rather than cases where the AI knows what the human would want but still a failure occurs. (I didn't state this explicitly because I was replying to the post, which focuses specifically on the problem of not knowing all of human values.) I think most of your failures are of the latter type, and I wouldn't make a similar claim for such failures -- they seem plausible and worth attention. I do still think they are not as important as intent alignment. Talking about each one individually:
Mostly I'd hope that AI can tell what philosophy is optimized for persuasion, or at least is capable of presenting counterarguments persuasively as well. But if your AI can't even tell what is persuasive then you're in trouble, but I'm not sure why to expect that.
No, it just doesn't seem hugely terrible for a few people to lock in bad values, as long as the vast majority do not. (And I don't expect a large number of people to explicitly try to lock in their values.)
It seems odd to me that it's sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don't particularly expect this to happen.
I agree it is far from a sure thing.
I'm not sure I understand the distinction that you're drawing here. (It seems like my scenarios could also be interpreted as failures where AI don't know enough human values, or maybe where humans themselves don't know enough human values.) What are some examples of what your claim was about?
As in, the total expected value lost through such scenarios isn't as large as the expected value lost through the risk of failing to solve intent alignment? Can you give some ballpark figures of how you see each side of this inequality?
How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?
You mean every time you hear a philosophical argument, you ask you AI to produce some counterarguments optimized for persuasion? If so, won't your friends be afraid to send you any arguments they think of, for fear of your AI superhumanly persuading you to the opposite conclusion?
A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn't they ask their AI for help with this? Or do you imagine them asking for something like "more faith", but AIs understand human values well enough to not interpret that as "lock in values"?
The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality. For example, how much should I care about my copies in the simulations or my subjective future experience, versus the value that would be lost in the base reality if I were to give in to the simulators' demands? Should I make a counterthreat? Are there any thoughts I or my AI should avoid having, or computations we should avoid doing?
I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.
I don't think it's so much that the coordination involving humans is a lot of work, but rather that we don't know how to do it in a way that doesn't cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I'd need to see more details before I update.)
In contrast, something like a threat doesn't count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don't know how to act in a way that both disincentivizes threats and also doesn't lead to (too many) threats being enforced. In particular, the problem is not that you don't know which outcomes are bad.
No, the expected value of marginal effort aimed at solving these problems isn't as large as the expected value of marginal effort on intent alignment.
(I don't like talking about "expected value lost" because it's not always clear what does and doesn't count as part of that. For example I think it's nearly inevitable that different people will have different goals and so the future will not be exactly as any one of them desired; should I say that there's a lot of expected value lost from "coordination problems" for that reason? It seems a bit weird to say that if you think there isn't a way to regain that "expected value".)
Uh, idk. It's not something I have numbers on. But I suppose I can try and make up some very fake numbers for, say, AI persuasion. (Before I actually do the exercise, let me note that I could imagine the exercise coming out with numbers that favor persuasion over intent alignment; this probably won't change my mind and would instead make me distrust the numbers, but I'll publish them anyway.)
To change an existentially bad outcome from AI persuasion, I'd imagine first figuring out some solutions, then figuring out how to implement them, and then getting the appropriate people to implement them; seems like you need all of these steps in order to make any difference. (Could be technical or governance though.) It feels especially hard to do so at the moment, given how little we know about future AI capabilities and when each potential capability arrives. Maybe:
So I guess overall I'm at ~100x, very very roughly?
Putting on my "what would I do" hat, I'm imagining that the AI doesn't know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren't being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments. Or it says that there are other counterfactual letters you could have received, such that after you read them you'd be convinced of the opposite position, and then it asks whether you still want to read the letter.
If your AI doesn't know about the counterarguments, and the letter is persuasive even to the AI, then it seems like you're hosed, but I'm not sure why to expect that.
I totally expect them to ask AI for help with such games. I don't expect (most of) them to lock in their values such that they can't change their mind.
I'm not entirely sure why you do expect this. Maybe you're viewing them as consciously optimizing for winning status games + a claim that the optimal policy is to lock in your values? But what if the values rewarded by the status games (as they already seem to, e.g. moving from atheism to anti-racism)? In that case it seems like you don't want to lock in your values, to better play the status game in the future.
If you've determined a set of universes to care about, then shouldn't at least decision theory reduce to mathematical / empirical matters about which decision procedure gets the most value across the set of universes? I do agree that moral questions are still not mathematical / empirical, but I don't find that all that persuasive. I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.
I'm guessing you feel better about mathematical / empirical reasoning because there's a ground truth that says when that reasoning is done well. I don't particularly find the existence of a ground truth to be all that big a deal -- it probably helps but doesn't seem tremendously important.
Fair enough -- I agree this is plausible (though only plausible; it doesn't seem like we've sacrificed everything to Moloch yet).
If we're imagining coordination between a billion AIs, that seems less obviously doable. I still think that, if you've solved intent alignment, it seems much easier. Democratic elections are not just about coordination -- they're also about alignment, since politicians have to optimize for being re-elected. It seems like you could do a lot better if you didn't have to worry about the alignment part.
I see, but I think at least part of the problem with threats is that I'm not sure what I care about, which greatly increases my "attack surface". For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn't be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).
This seems really extreme, if I'm not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?
Given that humans are liable to be persuaded by bad counterarguments too, I'd be concerned that the AI will always "know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments." Since it's not safe to actually look the counterarguments found by your own AI, it's not really helping at all. (Or it makes things worse if the user isn't very cautious and does look at their AI's counterarguments and gets persuaded by them.)
I think most people don't think very long term and aren't very rational. They'll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling "purity spirals" of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)
This seems plausible to me, but I don't see how one can have enough confidence in this view that one isn't very worried about the opposite being true and constituting a significant x-risk.
I broadly agree with the things you're saying; I think it mostly comes down to the actual numbers we'd assign.
Yeah, that's about right. I'd note that it isn't totally clear what the absolute risk number is meant to capture -- one operationalization is that it is P(existential catastrophe occurs, and if we had solved AI persuasion but the world was otherwise exactly the same, then no existential catastrophe occurs) -- I realize I didn't say exactly this above but that's the one that is mutually exhaustive across risks, and the one that determines expected value of solving the problem.
To justify the absolute number of 1/1000, I'd note that:
The 100x differential between alignment and persuasion comes mostly because points (2) and (3) above don't apply to alignment, point (1) applies only in part -- given my state of knowledge, the case for alignment failure seems much less speculative (though obviously still speculative), though it is still quite conjunctive.
That's a fair point, I agree this is a way in which full knowledge of human values can help avoid potentially significant risks in a way that intent alignment doesn't.
More detailed response: https://www.lesswrong.com/posts/DjTKMEwRqpuKkJzTo/are-there-alternative-to-solving-value-transfer-and
Under my current understanding of the "general alignment" angle, a core argument goes like:
I don't necessarily think this is the best framing, and I don't necessarily agree with it (e.g. whether the agent has direct read-access to its own values is an important distinction, and separately there's an argument that an AGI will be better-equipped to figure out its own succession problem than we will). I also don't know whether this is an accurate representation of anybody's view.
The successor problem is important, but it assumes we have the values already.
I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).
Here are three things that I believe:
For example, I feel like it should be possible to make an AGI that understands human values and preferences well enough to reliably and conservatively avoid doing things that humans would see as obviously or even borderline unacceptable / problematic. So if you put it in the trolley problem, it says "I don't know, neither of those options seems obviously acceptable, so I am going to default to NOOP and let my supervisor take actions." Meanwhile, the AGI is also motivated to make me a cup of tea. Such an AGI seems pretty good to me. But it's contrary to (3).
I think this post is mainly arguing in favor of (2), and maybe weakly / implicitly arguing against (1). Is that right? And I'm not sure whether it's for or against (3).
I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.
Ah, so you are arguing against (3)? (And what's your stance on (1)?)
Let's say you are assigned to be Alice's personal assistant.
Hmm, I'm probably misunderstanding, but I feel like maybe you're making an argument like this:
But I don't think it's like that. For example, I think if an AGI watches a bunch of YouTube videos, it will be able to form a decent concept of "doing things that people would widely regard as uncontroversial and compatible with prevailing norms", and we can make it motivated to restrict its actions to that subspace via a constant amount of value-loading effort, i.e. with an amount of value-loading effort that does not scale with how complex those prevailing norms are. (More complex prevailing norms would require having the AGI watch more YouTube videos before it understands the prevailing norms, but it would not require more value-loading effort, i.e. the step where we edit the AGI's motivation such that it wants to follow prevailing norms would not be any harder.)
But I think it would take a lot more value-loading effort than that to really get a particular person's preferences, including all its idiosyncrasies and edge-cases.
Thanks for developing the argument. This is very useful.
The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.
But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?
(At some point we do want an AI that can run the electric grid and drive a car etc. But maybe we can bootstrap our way there, and/or use less-powerful narrow AIs in the meantime.)
(BTW a lot of my thinking here came straight out of reading your model splintering posts. But maybe I've kinda wandered off in a different direction.)
So then in the scenario you mentioned, let's assume that we've set up the AI such that actions that pattern-match to "push the world into uncharted territory" are treated as unacceptable (which I guess seems like a plausibly good idea). But the AI is also motivated to get something done—say, solve global warming. And it finds a possible course of action which pattern-matches very well to "solve global warming", but alas, it also pattern-matches to "push the world into uncharted territory". The AI could reason that, if it queries the human (by "pressing the button" to send the data-dump), there's at least a chance that the human would edit its systems such that this course of action would no longer be unacceptable. So it would presumably do so.
In other words, this is a situation where the AI's motivational system is sending it mixed signals—it does want to "solve global warming", but it doesn't want to "push the world into uncharted territory", but this course of action is both. And let's assume that the AI can't easily come up with an alternative course of action that would solve global warming without any problematic aspects. So the AI asks the human what they think about this plan. Seems reasonable, I guess.
I haven't thought this through very much and look forward to you picking holes in it :)
My take is that if you gave an optimization process access to some handwritten acceptability criteria and searched for the nearest acceptable points to random starting points, you would get adversarial examples that violate unstated criteria. In order for the handwritten acceptability criteria to be useful, they can't be how the AI generates its ideas in the first place.
So: what is the base level that we would find if we peeled away the value learning scheme that you lay out? Is it a very general, human-agnostic AI with some human-value constraints on top? Or will we peel away a layer that gets information from humans just to reveal another layer that gets information from humans (e.g. learning a "human distribution")?
I'm with Steve on the idea that there's a difference between broad human preferences (something like common sense?) and particular and exact human preferences (what would be needed for ambitious value learning).
Still, you (Stuart) made me realize that I didn't think explicitly about this need for broad human preferences in my splitting of the problem (be able to align, then point to what we want), but it's indeed implicit because I don't care about being able to do "anything", just the sort of things humans might want.
More detailed response: https://www.lesswrong.com/posts/DjTKMEwRqpuKkJzTo/are-there-alternative-to-solving-value-transfer-and
If we have an algorithm that aligns an AI with X values, then we can add human values to get an AI that is aligned with human values.
On the other hand, I agree that it doesn't really make sense to declare an AI safe in the abstract, rather than in respect to say human values. (Small counterpoint: in order to be safe, it's not just about alignment, you also need to avoid bugs. This can be defined without reference to human values. However, this isn't sufficient for safety).
I suppose this works as a criticism of approaches like quantisers or impact-minimisation which attempt abstract safety. Although I can't see any reason why it'd imply that it's impossible to write an AI that can be aligned with arbitrary values.
I don't think we are indifferent to these outcomes. We leave them to luck, but that's a fact about our limited capabilities, not about our values. If we had enough control over "chaotic weather changes" to steer a hurricane away from a coastal city, we would very much care about it. So if a strong AI can reason through these impacts, it suddenly faces a harder task than a human: "I'd like this apple to fall from the table, and I see that running the fan for a few minutes will achieve that goal, but that's due to subtly steering a hurricane and we can't have that".
Yes, but we would be mostly indifferent to shifts in the distribution that preserve most of the features - eg if the weather was the same but delayed or advanced by six days.