Wei Dai


Decoupling deliberation from competition

Current human deliberation and discourse are strongly tied up with a kind of resource gathering and competition, and because of this I don't have a good picture of how things will look after the two are decoupled, nor know how to extrapolate past performance (how well human deliberation worked in the past and present) into this future.

Currently, people's thinking and speech are in large part ultimately motivated by the need to signal intelligence, loyalty, wealth, or other "positive" attributes, which help to increase one's social status and career prospects, and attract allies and mates, which are of course hugely important forms of resources, and some of the main objects of competition among humans.

Once we offload competition to AI assistants, what happens to this motivation behind discourse and deliberation, and how will that affect discourse and deliberation itself? Can you say more about what you envision happening in your scenario, in this respect?

Decoupling deliberation from competition

As another symptom what's happening (the rest of this comment is in a "paste" that will expire in about a month, to reduce the risk of it being used against me in the future)

Some Thoughts on Metaphilosophy

having AIs derive their terminal goals from simulated humans who live in a safe virtual environment.

There has been some subsequent discussion (expressing concern/doubt) about this at https://www.lesswrong.com/posts/7jSvfeyh8ogu8GcE6/decoupling-deliberation-from-competition?commentId=bSNhJ89XFJxwBoe5e

Decoupling deliberation from competition

Here's an idea of how random drift of epistemic norms and practices can occur. Beliefs (including beliefs about normative epistemology) function in part as a signaling device, similar to clothes. (I forgot where I came across this idea originally, but a search produced a Robin Hanson article about it.) The social dynamics around this kind of signaling produces random drift in epistemic norms and practices, similar to random drift in fashion / clothing styles. Such drift coupled with certain kinds of competition could have produced the world we have today (i.e., certain groups happened upon especially effective norms/practices by chance and then spread their influence through competition), but may lead to disaster in the future in the absence of competition, as it's unclear what will then counteract future drift that will cause continued deterioration in epistemic conditions.

Another mechanism for random drift is technological change that disrupts previous epistemic norms/practices without anyone specifically intending to. I think we've seen this recently too, in the form of, e.g., cable news and social media. It seems like you're envisioning that future humans will deliberately isolate their deliberation from technological advances (until they're ready to incorporate those advances into how they deliberate), so in that scenario perhaps this form of drift will stop at some point, but (1) it's unclear how many people will actually decide to do that, and (2) even in that scenario there will still be a large amount of drift between the recent past (when epistemic conditions still seemed reasonably ok, although I had my doubts even back then), which (together with other forms of drift) might never be recovered from.

Decoupling deliberation from competition

We’ve talked about this a few times but I still don’t really feel like there’s much empirical support for the kind of permanent backsliding you’re concerned about being widespread.

I'm not claiming direct empirical support for permanent backsliding. That seems hard to come by, given that we can't see into the far future. I am observing quite severe current backsliding. For example, explicit ad hominem attacks, as well as implicitly weighing people's ideas/arguments/evidence differently, based on things like the speaker's race and sex, have become the norm in local policy discussions around these parts. AFAICT, this originated from academia, under "standpoint epistemology" and related ideas.

On the other side of the political spectrum, several people close to me became very sure that "the election was stolen" due to things like hacked Dominion machines and that the military and/or Supreme Court was going to intervene in favor of Trump (to the extent that it was impossible for me to talk them out of these conclusions). One of them, who I had previously thought was smart/sane enough to entrust a great deal of my financial resources with, recently expressed concern for my life because I was going to get the COVID vaccine.

Is this an update for you, or have you already observed such things yourself or otherwise known how bad things have become?

There are some fuzzy borders here, and unclarity about how to define the concept, but maybe I’d guess 10% from “easy” failures to deliberate (say those that could be avoided by the wisest existing humans and which might be significantly addressed, perhaps cut in half, by competitive discipline) and a further 10% from “hard” failures (most of which I think would not be addressed by competition).

Given these numbers, it seems that you're pretty sure that almost everyone will eventually "snap out of" any bad ideas they get talked into, or they talk themselves into. Why? Is this based on some observations you've made that I haven't seen, or history that you know about that I don't? Or do you have some idea of a mechanism by which this "snapping out of" happens?

Decoupling deliberation from competition

I reasonably often find myself grateful that some dysfunctional norms or epistemic practices will most likely become obsolete. It’s a bit scary to think about a world where the only solution is waiting for someone to snap out of it.

I've been thinking a lot about this lately, so I'm glad to see that it's on your mind too, although I think I may still be a bit more concerned about it than you are. Couple of thoughts:

  1. What if our "deliberation" only made it as far as it did because of "competition", and that nobody or very few people knows how to deliberate correctly in the absence of competitive pressures? Basically, our current epistemic norms/practices came from the European Enlightenment, and they were spread largely via conquest or people adopting them to avoid being conquered or to compete in terms of living standards, etc. It seems that in the absence of strong competitive pressures of a certain kind, societies can quickly backslide or drift randomly in terms of epistemic norms/practices, and we don't know how to prevent this.

  2. What's your expectation of the fraction of total potential value that will be lost due to people failing to deliberate correctly (e.g., failing to ever "snap out of it", or getting "persuaded" by bad memes and then asking their AIs to lock in their beliefs/values)? It seems to me that it's very large, easily >50%. I'm curious how others would answer this question as well.

Alice and Bob can try to have an agreement to avoid racing ahead or engaging in some kinds of manipulation, and analogous a broader society could adopt such norms or divide into communities with internal agreements of this form.

In a sane civilization, tons of people would already be studying how to make and enforce such agreements, e.g., how to define what kinds of behaviors count as "manipulation", and more generally what are good epistemic norms/practices and how to ensure that many people adopt such norms/practices. If this problem is solved, then maybe we don't need to solve metaphilosophy (in the technical or algorithmic sense), as far as preventing astronomical waste arising from bad deliberation. Unfortunately it seems there's approximately zero people working on either problem.

Another (outer) alignment failure story

This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity.

How will your AI compute "the extent to which M leads to deviation from idealized deliberation"? (I'm particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that's not the kind of approach you're imagining.)

If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.

The attack I have in mind is to imitate a normal human conversation about philosophy or about what's normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion. This may well involve convincing you of a false claim, but of a philosophical nature such that you and your AI can't detect the error (unless you've solved the problem of metaphilosophy and knows what kinds of reasoning reliably leads to true and false conclusions about philosophical problems).

Another (outer) alignment failure story

Trying to imagine myself how an automated filter might work, here's a possible "solution" I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I've been compromised by an AI-powered attack and is now trying to attack you. (I'm talking in binary terms of "compromised" and "uncompromised" for simplicity but of course it will be more complicated than that in reality.)

Is this close to what you're thinking? (If not, apologies for going off on a tangent.) If so, given that I would "naturally" change my mind over time (i.e., based on my own thinking or talking with other uncompromised humans), it seems that your AI has to model that as well. I can imagine that in such a scenario, if I ever changed my mind in an unexpected (by the AI model) direction and wanted to talk to you about that, my own AI might say something like "If you say this to Paul, his AI will become more suspicious that you've been compromised by an AI-powered attack and your risk of getting blocked now or in the future increases by Y. Are you sure you still want to say this to Paul?" So at this point, collective human philosophical/moral progress would be driven more by what AI filters expect and let pass, than by what physical human brains actually compute, so we better get those models really right, but that faces seemingly difficult problems I mentioned at Replicate the trajectory with ML? and it doesn't seem like anyone is working on such problems.

If we fail to get such models good enough early on, that could lock in failure as it becomes impossible to meaningfully collaborate with other humans (or human-AI systems) to try to improve such models, as you can't distinguish whether they're genuinely trying to make better models with you, or just trying to change your models as part of an attack.

Another (outer) alignment failure story

Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it.

This seems like the most important part so I'll just focus on this for now. I'm having trouble seeing how this can work. Suppose that I, as an attacker, tell my AI assistant, "interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends" (while implicitly minimizing the chances of these messages/interactions being flagged by your automated systems as adversarial). How would your automation distinguish between me doing this, versus me trying to have a normal human conversation with you about various topics, including what's moral/normative? Or if the automation isn't trying to directly make this judgment, what is it telling you to allow you to make this judgment? Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?

Another (outer) alignment failure story

(Apologies for the late reply. I've been generally distracted by trying to take advantage of perhaps fleeting opportunities in the equities markets, and occasionally by my own mistakes while trying to do that.)

It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)

How are people going to avoid contact with adversarial content, aside from "go into an info bubble with trusted AIs and humans and block off any communications from the outside"? (If that is happening a lot, it seems worthwhile say so explicitly in the story since that might be surprising/unexpected to a lot of readers?)

I think they do, but it’s not clear whether any of them change the main dynamic described in the post.

Ok, in that case I think it would be useful to say a few words in the OP about why in this story, they don't have the desired effect, like, what happened when the safety researchers tried this?

I’d like to have a human society that is free to grow up in a way that looks good to humans, and which retains enough control to do whatever they decide is right down the line (while remaining safe and gradually expanding the resources available to them for continued growth). When push comes to shove I expect most people to strongly prefer that kind of hope (vs one that builds a kind of AI that will reach the right conclusions about everything), not on the basis of sophisticated explicit reasoning but because that’s the only path that can really grow out of the current trajectory in a way that’s not super locally super objectionable to lots of people, and so I’m focusing on people’s attempts and failures to construct such an AI.

I can empathize with this motivation, but argue that "a kind of AI that will reach the right conclusions about everything" isn't necessarily incompatible with "humans retain enough control to do whatever they decide is right down the line" since such an AI could allow humans to retain control (and merely act as an assistant/advisor, for example) instead of forcibly imposing its decisions on everyone.

I don’t know exactly what kind of failure you are imagining is locked in, that pre-empts or avoids the kind of failure described here.

For example, all or most humans lose their abilities for doing philosophical reasoning that will eventually converge to philosophical truths, because they go crazy from AI-powered memetic warfare, or come under undue influence of AI advisors who lack such abilities themselves but are extremely convincing. Or humans lock in what they currently think are their values/philosophies in some form (e.g., as utility functions in AI, or asking their AIs to help protect the humans themselves from value drift while unable to effectively differentiate between "drift" and "philosophical progress") to try to protect them from a highly volatile and unpredictable world.

Load More