Wei Dai

Comments

General alignment plus human values, or alignment via human values?

In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.

I see, but I think at least part of the problem with threats is that I'm not sure what I care about, which greatly increases my "attack surface". For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn't be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).

Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).

This seems really extreme, if I'm not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?

Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.

Given that humans are liable to be persuaded by bad counterarguments too, I'd be concerned that the AI will always "know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments." Since it's not safe to actually look the counterarguments found by your own AI, it's not really helping at all. (Or it makes things worse if the user isn't very cautious and does look at their AI's counterarguments and gets persuaded by them.)

I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.

I think most people don't think very long term and aren't very rational. They'll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling "purity spirals" of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)

I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.

This seems plausible to me, but I don't see how one can have enough confidence in this view that one isn't very worried about the opposite being true and constituting a significant x-risk.

General alignment plus human values, or alignment via human values?

To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn’t know human values, rather than cases where the AI knows what the human would want but still a failure occurs.

I'm not sure I understand the distinction that you're drawing here. (It seems like my scenarios could also be interpreted as failures where AI don't know enough human values, or maybe where humans themselves don't know enough human values.) What are some examples of what your claim was about?

I do still think they are not as important as intent alignment.

As in, the total expected value lost through such scenarios isn't as large as the expected value lost through the risk of failing to solve intent alignment? Can you give some ballpark figures of how you see each side of this inequality?

Mostly I’d hope that AI can tell what philosophy is optimized for persuasion

How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?

or at least is capable of presenting counterarguments persuasively as well.

You mean every time you hear a philosophical argument, you ask you AI to produce some counterarguments optimized for persuasion? If so, won't your friends be afraid to send you any arguments they think of, for fear of your AI superhumanly persuading you to the opposite conclusion?

And I don’t expect a large number of people to explicitly try to lock in their values.

A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn't they ask their AI for help with this? Or do you imagine them asking for something like "more faith", but AIs understand human values well enough to not interpret that as "lock in values"?

It seems odd to me that it’s sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don’t particularly expect this to happen.

The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality. For example, how much should I care about my copies in the simulations or my subjective future experience, versus the value that would be lost in the base reality if I were to give in to the simulators' demands? Should I make a counterthreat? Are there any thoughts I or my AI should avoid having, or computations we should avoid doing?

I don’t expect AIs to have clean crisp utility functions of the form “maximize paperclips” (at least initially).

I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.

I expect this to be way less work than the complicated plans that the AI is enacting, so it isn’t a huge competitiveness hit.

I don't think it's so much that the coordination involving humans is a lot of work, but rather that we don't know how to do it in a way that doesn't cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I'd need to see more details before I update.)

Morality is Scary

This seems interesting and novel to me, but (of course) I'm still skeptical.

I gave the relevant example of relatively well-understood values, preference for lower x-risks.

Preference for lower x-risk doesn't seem "well-understood" to me, if we include in "x-risk" things like value drift/corruption, premature value lock-in, and other highly consequential AI-enabled decisions (potential existential mistakes) that depend on hard philosophical questions. I gave some specific examples in this recent comment. What do you think about the problems on that list? (Do you agree that they are serious problems, and if so how do you envision them being solved or prevented in your scenario?)

General alignment plus human values, or alignment via human values?
  1. Your AI should tell you that it’s worried about your friend being compromised, make sure you have an understanding of the consequences, and then go with your decision.

I think unless we make sure the AI can distinguish between "correct philosophy" or "well-intentioned philosophy" and "philosophy optimized for persuasion", each human will become either compromised (if they're not very cautious and read such messages) or isolated from the rest of humanity with regard to philosophical discussion (if they are cautious and discard such messages). This doesn't seem like an ok outcome to me. Can you explain more why you aren't worried?

  1. Seems fine. Maybe your AI warns you about the risks before helping.

I can imagine that if you subscribe to a metaethics in which a person can't be wrong about morality (i.e., some version of anti-realism), then you might think it's fine to lock in whatever values one currently thinks they ought to have. Is this your reason for "seems fine", or something else? (If so, I think nobody should be so certain about metaethics at this point.)

  1. Seems like an important threat that you (and your AI) should try to resolve.

If the AI isn't very good at dealing with "exotic philosophical cases" then it's not going to be of much help with this problem, and a lot of humans (including me) probably aren't very good at thinking about this either, so we probably end up with a lot of humans succumbing to such acausal attacks.

  1. Mostly I would hope that this situation doesn’t arise, because none of the humans can come up with utility functions in this way, and the AIs that are aligned with humans have other ways of cooperating that don’t require eliciting a utility function over universe histories.

Do you have any suggestions for this? Or some other reason to think that AIs that are aligned with different humans will find ways to cooperate (as efficiently as merging utility functions probably will be), without either a full understanding of human values or risking permanent loss of some parts of their complex values?

  1. Idk, seems pretty unclear, but I’d hope that these situations can’t come up thanks to laws that prevent people from enforcing such threats.

Agreed that's a possible good outcome, but seems far from a sure thing. Such laws would have to more intrusive than anything people are currently used to, since attackers can create simulated suffering within the "privacy" of their own computers or minds. I suppose if such threats become a serious problem that causes a lot of damage, people might agree to trade off their privacy for security. The law might then constitute a risk in itself, as the implementation mechanism might be subverted/misused to create a form of totalitarianism.

Another issue is if there are powerful unaligned AIs or rogue states who think they can use such threats to asymmetrically gain advantage, they won't agree to such laws.

(4) can be solved through governance (laws, regulations, norms, etc)

I think COVID shows that we often can't do this even when it's relatively trivial (or can only do it with a huge time delay). For example COVID could have been solved at very low cost (relative to the actual human and economic damage it inflicted) if governments had stockpiled enough high filtration elastomeric respirators for everyone, mailed them out at the start of the pandemic, and mandated their use. (Some EAs are trying to convince some governments to do this now, in preparation for the next pandemic. I'm not sure how much success they're having.)

General alignment plus human values, or alignment via human values?

Generally with these sorts of hypotheticals, it feels to me like it either (1) isn’t likely to come up, or (2) can be solved by deferring to the human, or (3) doesn’t matter very much.

What do you think about the following examples:

  1. AI persuasion - My AI receives a message from my friend containing a novel moral argument relevant to some decision I'm about to make, but it's not sure if it's safe to show the message to me, because my friend may have been compromised by a hostile AI and is now in turn trying to compromise me.
  2. User-requested value lock-in - I ask my AI to help me become more faithful to some religion/cause/morality.
  3. Acausal attacks - I (or my AI) become concerned that I may be living in a simulation and will be punished unless I do what my simulators want.
  4. Bargaining/coordination - Many AIs are merging together for better coordination and economy of scale, for example by setting the utility function of the merged AI to a weighted average of their individual utility functions, so I have to come up with a utility function (or whatever the merger will be based on) if I want to join the bargain. If I fail to do this, I risk falling behind in subsequent military or economic competition.
  5. Threats - Someone (in the same universe) communicates to me that unless I do what they want (i.e., hand most of my resources to them), they'll create a vast amount of simulated suffering.
Risks from AI persuasion

Differentially make progress on alignment, decreasing the difficulty gap between training a model to be persuasive versus training a model to give a correct explanation. Currently, it is much easier to scale the former (just ask labellers if they were persuaded) than the latter (you need domain experts to check that the explanation was actually correct).

AFAICT, the biggest difficulty gap is (and probably will be) in philosophy, since it's just as easy as any other area to ask labellers if they are persuaded by some philosophical argument, but we have little idea (both compared to other areas, and in an absolute sense) what constitutes "philosophical truth" or what makes an explanation "correct" in philosophy. So I see solving these metaphilosophical problems as crucial to defending against AI persuasion. Do you agree, and if so why no mention of metaphilosophy in this otherwise fairly comprehensive post on AI persuasion?

ARC's first technical report: Eliciting Latent Knowledge

Thanks, very helpful to understand your motivations for that section better.

In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer.

Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI's current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn't need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way, the future human may exist in a world with strange/unfamiliar (from the AI's perspective) features that make it hard for the AI to predict correctly.

Ideally I’d even like all of the humans involved in the process to be indistinguishable from the “real” humans, so that no human ever looks at their situation and thinks “I guess I’m one of the humans responsible for figuring out the utility function, since this isn’t the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically.”

How do you envision extracting or eliciting from the future human H_limit an opinion about what the current human should do, given that H_limit's mind is almost certainly entirely focused on their own life and problems? One obvious way I can think of is to make a copy of H_limit, put the copy in a virtual environment, tell them about H's situation, then ask them what to do. But that seems to run into the same kind of issue, as the copy is now aware that they're not living in the real world.

Morality is Scary

Well, I linked my toy model of partiality before. Are you asking about something more concrete?

Yeah, I mean aside from how much you care about various other people, what concrete things do you want in your utopia?

ARC's first technical report: Eliciting Latent Knowledge

Can you talk about the advantages or other motivations for the formulation of indirect normativity in this paper (section "Indirect normativity: defining a utility function"), compared to your 2012 formulation? (It's not clear to me what problems with that version you're trying to solve here.)

ARC's first technical report: Eliciting Latent Knowledge

Ok, this all makes sense now. I guess when I first read that section I got the impression that you were trying to do something more ambitious. You may want to consider adding some clarification that you're not describing a scheme designed to block only manipulation while letting helpful arguments through, or that "letting helpful arguments through" would require additional ideas outside of that section.

Load More