Persuasion Tools: AI takeover without AGI or agency?

Daniel Kokotajlo

[epistemic status: speculation]

I'm envisioning that in the future there will also be systems where you can input any conclusion that you want to argue (including moral conclusions) and the target audience, and the system will give you the most convincing arguments for it. At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.

--Wei Dai

What if most people already live in that world? A world in which taking arguments at face value is not a capacity-enhancing tool, but a security vulnerability? Without trusted filters, would they not dismiss highfalutin arguments out of hand, and focus on whether the person making the argument seems friendly, or unfriendly, using hard to fake group-affiliation signals?

--Benquo

1. AI-powered memetic warfare makes all humans effectively insane.

--Wei Dai, listing nonstandard AI doom scenarios

This post speculates about persuasion tools—how likely they are to get better in the future relative to countermeasures, what the effects of this might be, and what implications there are for what we should do now.

To avert eye-rolls, let me say up front that I don’t think the world is likely to be driven insane by AI-powered memetic warfare. I think progress in persuasion tools will probably be gradual and slow, and defenses will improve too, resulting in an overall shift in the balance that isn’t huge: a deterioration of collective epistemology, but not a massive one. However, (a) I haven’t yet ruled out more extreme scenarios, especially during a slow takeoff, and (b) even small, gradual deteriorations are important to know about. Such a deterioration would make it harder for society to notice and solve AI safety and governance problems, because it is worse at noticing and solving problems in general. Such a deterioration could also be a risk factor for world war three, revolutions, sectarian conflict, terrorism, and the like. Moreover, such a deterioration could happen locally, in our community or in the communities we are trying to influence, and that would be almost as bad. Since the date of AI takeover is not the day the AI takes over, but the point it’s too late to reduce AI risk, these things basically shorten timelines.

Six examples of persuasion tools

Analyzers: Political campaigns and advertisers already use focus groups, A/B testing, demographic data analysis, etc. to craft and target their propaganda. Imagine a world where this sort of analysis gets better and better, and is used to guide the creation and dissemination of many more types of content.

Feeders: Most humans already get their news from various “feeds” of daily information, controlled by recommendation algorithms. Even worse, people’s ability to seek out new information and find answers to questions is also to some extent controlled by recommendation algorithms: Google Search, for example. There’s a lot of talk these days about fake news and conspiracy theories, but I’m pretty sure that selective/biased reporting is a much bigger problem.

Chatbot: Thanks to recent advancements in language modeling (e.g. GPT-3) chatbots might become actually good. It’s easy to imagine chatbots with millions of daily users continually optimized to maximize user engagement--see e.g. Xiaoice. The systems could then be retrained to persuade people of things, e.g. that certain conspiracy theories are false, that certain governments are good, that certain ideologies are true. Perhaps no one would do this, but I’m not optimistic.

Coach: A cross between a chatbot, a feeder, and an analyzer. It doesn’t talk to the target on its own, but you give it access to the conversation history and everything you know about the target and it coaches you on how to persuade them of whatever it is you want to persuade them of. [EDIT 5/21/2021: For a real-world example (and worrying precedent!) of this, see the NYT's getting-people-to-vaccinate persuasion tool, and this related research]

Drugs: There are rumors of drugs that make people more suggestible, like scopolomine. Even if these rumors are false, it’s not hard to imagine new drugs being invented that have a similar effect, at least to some extent. (Alcohol, for example, seems to lower inhibitions. Other drugs make people more creative, etc.) Perhaps these drugs by themselves would be not enough, but would work in combination with a Coach or Chatbot. (You meet target for dinner, and slip some drug into their drink. It is mild enough that they don’t notice anything, but it primes them to be more susceptible to the ask you’ve been coached to make.)

Imperius Curse: These are a kind of adversarial example that gets the target to agree to an ask (or even switch sides in a conflict!), or adopt a belief (or even an entire ideology!). Presumably they wouldn’t work against humans, but they might work against AIs, especially if meme theory applies to AIs as it does to humans. The reason this would work better against AIs than against humans is that you can steal a copy of the AI and then use massive amounts of compute to experiment on it, finding exactly the sequence of inputs that maximizes the probability that it’ll do what you want.

We might get powerful persuasion tools prior to AGI

The first thing to point out is that many of these kinds of persuasion tools already exist in some form or another. And they’ve been getting better over the years, as technology advances. Defenses against them have been getting better too. It’s unclear whether the balance has shifted to favor these tools, or their defenses, over time. However, I think we have reason to think that the balance may shift heavily in favor of persuasion tools, prior to the advent of other kinds of transformative AI. The main reason is that progress in persuasion tools is connected to progress in Big Data and AI, and we are currently living through a period of rapid progress those things, and probably progress will continue to be rapid (and possibly accelerate) prior to AGI.

However, here are some more specific reasons to think persuasion tools may become relatively more powerful:

Substantial prior: Shifts in the balance between things happen all the time. For example, the balance between weapons and armor has oscillated at least a few times over the centuries. Arguably persuasion tools got relatively more powerful with the invention of the printing press, and again with radio, and now again with the internet and Big Data. Some have suggested that the printing press helped cause religious wars in Europe, and that radio assisted the violent totalitarian ideologies of the early twentieth century.

Consistent with recent evidence: A shift in this direction is consistent with the societal changes we’ve seen in recent years. The internet has brought with it many inventions that improve collective epistemology, e.g. google search, Wikipedia, the ability of communities to create forums... Yet on balance it seems to me that collective epistemology has deteriorated in the last decade or so.

Lots of room for growth: I’d guess that there is lots of “room for growth” in persuasive ability. There are many kinds of persuasion strategy that are tricky to use successfully. Like a complex engine design compared to a simple one, these strategies might work well, but only if you have enough data and time to refine them and find the specific version that works at all, on your specific target. Humans never have that data and time, but AI+Big Data does, since it has access to millions of conversations with similar targets. Persuasion tools will be able to say things like "In 90% of cases where targets in this specific demographic are prompted to consider and then reject the simulation argument, and then challenged to justify their prejudice against machine consciousness, the target gets flustered and confused. Then, if we make empathetic noises and change the subject again, 50% of the time the subject subconsciously changes their mind so that when next week we present our argument for machine rights they go along with it, compared to 10% baseline probability."

Plausibly pre-AGI: Persuasion is not an AGI-complete problem. Most of the types of persuasion tools mentioned above already exist, in weak form, and there’s no reason to think they can’t gradually get better well before AGI. So even if they won't improve much in the near future, plausibly they'll improve a lot by the time things get really intense.

Language modelling progress: Persuasion tools seem to be especially benefitted by progress in language modelling, and language modelling seems to be making even more progress than the rest of AI these days.

More things can be measured: Thanks to said progress, we now have the ability to cheaply measure nuanced things like user ideology, enabling us to train systems towards those objectives.

Chatbots & Coaches: Thanks to said progress, we might see some halfway-decent chatbots prior to AGI. Thus an entire category of persuasion tool that hasn’t existed before might come to exist in the future. Chatbots too stupid to make good conversation partners might still make good coaches, by helping the user predict the target’s reactions and suggesting possible things to say.

Minor improvements still important: Persuasion doesn’t have to be perfect to radically change the world. An analyzer that helps your memes have a 10% higher replication rate is a big deal; a coach that makes your asks 30% more likely to succeed is a big deal.

Faster feedback: One way defenses against persuasion tools have strengthened is that people have grown wise to them. However, the sorts of persuasion tools I’m talking about seem to have significantly faster feedback loops than the propagandists of old; they can learn constantly, from the entire population, whereas past propagandists (if they were learning at all, as opposed to evolving) relied on noisier, more delayed signals.

Overhang: Finding persuasion drugs is costly, immoral, and not guaranteed to succeed. Perhaps this explains why it hasn’t been attempted outside a few cases like MKULTRA. But as technology advances, the cost goes down and the probability of success goes up, making it more likely that someone will attempt it, and giving them an “overhang” with which to achieve rapid progress if they do. (I hear that there are now multiple startups built around using AI for drug discovery, by the way.) A similar argument might hold for persuasion tools more generally: We might be in a “persuasion tool overhang” in which they have not been developed for ethical and riskiness reasons, but at some point the price and riskiness drops low enough that someone does it, and then that triggers a cascade of more and richer people building better and better versions.

Speculation about effects of powerful persuasion tools

Here are some hasty speculations, beginning with the most important one:

Ideologies & the biosphere analogy:

The world is, and has been for centuries, a memetic warzone. The main factions in the war are ideologies, broadly construed. It seems likely to me that some of these ideologies will use persuasion tools--both on their hosts, to fortify them against rival ideologies, and on others, to spread the ideology.

Consider the memetic ecosystem--all the memes replicating and evolving across the planet. Like the biological ecosystem, some memes are adapted to, and confined to, particular niches, while other memes are widespread. Some memes are in the process of gradually going extinct, while others are expanding their territory. Many exist in some sort of equilibrium, at least for now, until the climate changes. What will be the effect of persuasion tools on the memetic ecosystem?

For ideologies at least, the effects seem straightforward: The ideologies will become stronger, harder to eradicate from hosts and better at spreading to new hosts. If all ideologies got access to equally powerful persuasion tools, perhaps the overall balance of power across the ecosystem would not change, but realistically the tools will be unevenly distributed. The likely result is a rapid transition to a world with fewer, more powerful ideologies. They might be more internally unified, as well, having fewer spin-offs and schisms due to the centralized control and standardization imposed by the persuasion tools. An additional force pushing in this direction is that ideologies that are bigger are likely to have more money and data with which to make better persuasion tools, and the tools themselves will get better the more they are used.

Recall the quotes I led with:

... At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.

--Wei Dai

What if most people already live in that world? A world in which taking arguments at face value is not a capacity-enhancing tool, but a security vulnerability? Without trusted filters, would they not dismiss highfalutin arguments out of hand … ?

--Benquo

1. AI-powered memetic warfare makes all humans effectively insane.

--Wei Dai, listing nonstandard AI doom scenarios

I think the case can be made that we already live in this world to some extent, and have for millenia. But if persuasion tools get better relative to countermeasures, the world will be more like this.

This seems to me to be an existential risk factor. It’s also a risk factor for lots of other things, for that matter. Ideological strife can get pretty nasty (e.g. religious wars, gulags, genocides, totalitarianism), and even when it doesn’t, it still often gums things up (e.g. suppression of science, zero-sum mentality preventing win-win-solutions, virtue signalling death spirals, refusal to compromise). This is bad enough already, but it’s doubly bad when it comes at a moment in history where big new collective action problems need to be recognized and solved.

Obvious uses: Advertising, scams, propaganda by authoritarian regimes, etc. will improve. This means more money and power to those who control the persuasion tools. Maybe another important implication would be that democracies would have a major disadvantage on the world stage compared to totalitarian autocracies. One of many reasons for this is that scissor statements and other divisiveness-sowing tactics may not technically count as persuasion tools but they would probably get more powerful in tandem.

Will the truth rise to the top: Optimistically, one might hope that widespread use of more powerful persuasion tools will be a good thing, because it might create an environment in which the truth “rises to the top” more easily. For example, if every side of a debate has access to powerful argument-making software, maybe the side that wins is more likely to be the side that’s actually correct. I think this is a possibility but I do not think it is probable. After all, it doesn’t seem to be what’s happened in the last two decades or so of widespread internet use, big data, AI, etc. Perhaps, however, we can make it true for some domains at least, by setting the rules of the debate.

Data hoarding: A community’s data (chat logs, email threads, demographics, etc.) may become even more valuable. It can be used by the community to optimize their inward-targeted persuasion, improving group loyalty and cohesion. It can be used against the community if someone else gets access to it. This goes for individuals as well as communities.

Chatbot social hacking viruses: Social hacking is surprisingly effective. The classic example is calling someone pretending to be someone else and getting them to do something or reveal sensitive information. Phishing is like this, only much cheaper (because automated) and much less effective. I can imagine a virus that is close to as good as a real human at social hacking while being much cheaper and able to scale rapidly and indefinitely as it acquires more compute and data. In fact, a virus like this could be made with GPT-3 right now, using prompt programming and “mothership” servers to run the model. (The prompts would evolve to match the local environment being hacked.) Whether GPT-3 is smart enough for it to be effective remains to be seen.

Implications

I doubt that persuasion tools will improve discontinuously, and I doubt that they’ll improve massively. But minor and gradual improvements matter too.

Of course, influence over the future might not disappear all on one day; maybe there’ll be a gradual loss of control over several years. For that matter, maybe this gradual loss of control began years ago and continues now...

--Me, from a previous post

I think this is potentially (5% credence) the new Cause X, more important than (traditional) AI alignment even. It probably isn’t. But I think someone should look into it at least, more thoroughly than I have.

To be clear, I don’t think it’s likely that we can do much to prevent this stuff from happening. There are already lots of people raising the alarm about filter bubbles, recommendation algorithms, etc. so maybe it’s not super neglected and maybe our influence over it is small. However, at the very least, it’s important for us to know how likely it is to happen, and when, because it helps us prepare. For example, if we think that collective epistemology will have deteriorated significantly by the time crazy AI stuff starts happening, that influences what sorts of AI policy strategies we pursue.

Note that if you disagree with me about the extreme importance of AI alignment, or if you think AI timelines are longer than mine, or if you think fast takeoff is less likely than I do, you should all else equal be more enthusiastic about investigating persuasion tools than I am.

Thanks to Katja Grace, Emery Cooper, Richard Ngo, and Ben Goldhaber for feedback on a draft. This research was conducted at the Center on Long-Term Risk and the Polaris Research Institute.

Related previous work:

Epistemic Security report

Aligning Recommender Systems

Stuff I’d read if I was investigating this in more depth:

Not Born Yesterday

The stuff here and here

EDIT: This ultrashort sci-fi story by Jack Clark illustrates some of the ideas in this post:

The Narrative Control Department
[A beautiful house in South West London, 2030]

“General, we’re seeing an uptick in memes that contradict our official messaging around Rule 470.” “What do you suggest we do?”
“Start a conflict. At least three sides. Make sure no one side wins.”
“At once, General.”

And with that, the machines spun up – literally. They turned on new computers and their fans revved up. People with tattoos of skeletons at keyboards high-fived eachother. The servers warmed up and started to churn out their fake text messages and synthetic memes, to be handed off to the ‘insertion team’ who would pass the data into a few thousand sock puppet accounts, which would start the fight.

Hours later, the General asked for a report.
“We’ve detected a meaningful rise in inter-faction conflict and we’ve successfully moved the discussion from Rule 470 to a parallel argument about the larger rulemaking process.”
“Excellent. And what about our rivals?”
“We’ve detected a few Russian and Chinese account networks, but they’re staying quiet for now. If they’re mentioning anything at all, it’s in line with our narrative. They’re saving the IDs for another day, I think.”

That night, the General got home around 8pm, and at the dinner table his teenage girls talked about their day.
“Do you know how these laws get made?” the older teenager said. “It’s crazy. I was reading about it online after the 470 blowup. I just don’t know if I trust it.”
“Trust the laws that gave Dad his job? I don’t think so!” said the other teenager.
They laughed, as did the General’s wife. The General stared at the peas on his plate and stuck his fork into the middle of them, scattering so many little green spheres around his plate.

EDIT: Finally, if you haven't yet, you should read this report of a replication of the AI Box Experiment.

I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don't shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more 'sticky'. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordination harder. (Analogy to how vaccine and mask hesitancy during Covid was partly due to insufficient trust in public health advice). Or more speculatively I could also imagine an extreme version of sticky, splintered epistemic bubbles this leading to moral stagnation/value lock-in.

Minor question on framing: I'm wondering why you chose to call this post "AI takeover without AGI or agency?" given that the effects of powerful persuasion tools you talk about aren't what (I normally think of as) "AI takeover"? (Rather, if I've understood correctly, they are "persuasion tools as existential risk factor", or "persuasion tools as mechanism for power concentration among humans".)

Somewhat related: I think there could be a case made for takeover by goal-directed but narrow AI, though I haven't really seen it made. But I can't see a case for takeover by non-goal-directed AI, since why would AI systems without goals want to take over? I'd be interested if you have any thoughts on those two things.

Thanks! The post was successful then. Your point about stickiness is a good one; perhaps I was wrong to emphasize the change in number of ideologies.

The "AI takeover without AGI or agency" bit was a mistake in retrospect. I don't remember why I wrote it, but I think it was a reference to this post which argues that what we really care about is AI-PONR, and AI takeover is just a prominent special case. It also might have been due to the fact that a world in which an ideology uses AI tools to cement itself and take over the world, can be thought of as a case of AI takeover, since we have AIs bossing everyone around and getting them to do bad things that ultimately lead to x-risk. It's just a weird case in which the AIs aren't agents or general intelligences. :)

I also don't really see the situation as about AI at all. It's a structural advantage for certain kinds of values that tend to win out in memetic competition / tend to be easiest to persuade people to adopt / etc. Let's call such values themselves "attractive."

The most attractive values given a new technological/social situation are likely to be similar to those given the immediately preceding situation, so I'd generally expect the most attractive values to generally be endemic anyway or close enough to endemic values that they don't look like they are coming out of left field.

And of course for any given zero-sum conflict and any given human, one of the participants in that conflict would prefer push the human towards more attractive values, so they would be introduced even if not initially endemic.

I don't think you can get paperclips this way, because people trying to get humans to maximize paperclips would be at a big disadvantage in memetic competition compared with the most attractive values (or even compared to more normal human values, which are presumably more attractive than random stuff).

Then the usual hope is that we are happy with attractive values, e.g. because deliberation and intentional behavior by humans makes "smarter" forms of current values more attractive relative to random bad stuff. And your concern is basically that under distributional shift, why should we think that?

Or perhaps more clearly: if which values are "most attractive" depends on features of the technological landscape, then it's hard to see why we should be happy just to "take the hand we're dealt" and be happy with the values that are most attractive on some default technological trajectory. Instead, we would end up with preferences over the technological trajectory.

This is not really distinctive to persuasion, it applies just as well to any changes in the environment that would change the process of deliberation/discussion. The hypothesis seems to be that "how good humans are at persuasion" is just a particularly important/significant kind of shift.

But it seems like what really matters is some ratio between how good you are at persuasion and how good you are at other skills that shape the future (or else perhaps you should be much more concerned about other increases in human capability, like education, that make us better at arguing). And in this sense it's less clear whether AI is better or worse than the status quo. I guess the main thing is that it's a candidate for a sharp distributional change and so that's the kind of thing that you would want to be unusually cautious about.

I mostly think the most robust thing is that it's reasonable to be very interested in the trajectory of values, to think about how much you like the process of deliberation and discourse and selection and so on that shapes those values, and to think of changes as potentially irreversible (since future people would have no interest in reversing them).

The usual response to this argument is that perhaps future values are basically unrelated to present values anyway (since they will also converge to whatever values are most attractive given future technological situations). But this seems relatively unpersuasive because eventually you might expect to have many agents who try to deliberately make the future good rather than letting what happens, happen, and that this could eventually drive the rate of drift to 0. This seems fairly likely to happen eventually, but you might think that it will take long enough that existing value changes will still wash out.

Then we end up with a complicated set of moral / decision-theoretic questions about which values we are happy enough with. It's not really clear to me how you should feel about variation across humans, or across cultures, or for humans in new technological situations, or for a particular kind of deep RL, or what. It seems quite clear that we should care some, and I think given realistic treatments of moral uncertainty you should not care too much more about preventing drift than about preventing extinction given drift (e.g. 10x seems very hard to justify to me). But it generally seems like one of the more pressing questions in moral philosophy, and even if you care equally about those two things (suggesting that you'd value some drifted future population's values 50% as much as some kind of hypothetical ideal realization) you could still get much more traction by trying to prevent forms of drift that we don't endorse.

given realistic treatments of moral uncertainty you should not care too much more about preventing drift than about preventing extinction given drift (e.g. 10x seems very hard to justify to me).

I think you already believe this, but just to clarify: this "extinction" is about the extinction of Earth-originating intelligence, not about humans in particular. So AI alignment is an intervention to prevent drift, not an intervention to prevent extinction. (Though of course, we could care differently about persuasion-tool-induced drift vs unaliged-AI-induced drift.)

Thanks for this! Re: it's not really about AI, it's about memetics & ideologies: Yep, totally agree. (The OP puts the emphasis on the memetic ecosystem & thinks of persuasion tools as a change in the fitness landscape. Also, I wrote this story a while back.) What follows is a point-by-point response:

The most attractive values given a new technological/social situation are likely to be similar to those given the immediately preceding situation, so I'd generally expect the most attractive values to generally be endemic anyway or close enough to endemic values that they don't look like they are coming out of left field.

Maybe? I am not sure memetic evolution works this fast though. Think about how biological evolution doesn't adapt immediately to changes in environment, it takes thousands of years at least, arguably millions depending on what counts as "fully adapted" to the new environment. Replication times for memes are orders of magnitude faster, but that just means it should take a few orders of magnitude less time... and during e.g. a slow takeoff scenario there might just not be that much time. (Disclaimer: I'm ignorant of the math behind this sort of thing). Basically, as tech and economic progress speeds up but memetic evolution stays constant, we should expect there to be some point where the former outstrips the latter and the environment is changing faster than the attractive-memes-for-the-environment can appear and become endemic. Now of course memetic evolution is speeding up too, but the point is that until further argument I'm not 100% convinced that we aren't already out-of-equilibrium.

And of course for any given zero-sum conflict and any given human, one of the participants in that conflict would prefer push the human towards more attractive values, so they would be introduced even if not initially endemic.

Not sure this argument works. First of all, very few conflicts are actually zero sum. Usually there are some world-states that are worse by both players' lights than some other world-states. Humans being in the most attractive memetic state may be like this.

I don't think you can get paperclips this way, because people trying to get humans to maximize paperclips would be at a big disadvantage in memetic competition compared with the most attractive values (or even compared to more normal human values, which are presumably more attractive than random stuff).

Agreed.

Then the usual hope is that we are happy with attractive values, e.g. because deliberation and intentional behavior by humans makes "smarter" forms of current values more attractive relative to random bad stuff. And your concern is basically that under distributional shift, why should we think that?

Agreed. I would add that even without distributional shift it is unclear why we should expect attractive values to be good. (Maybe the idea is that good = current values because moral antirealism, and current values are the attractive ones for the current environment via the argument above? I guess I'd want that argument spelled out more and the premises argued for.)

Or perhaps more clearly: if which values are "most attractive" depends on features of the technological landscape, then it's hard to see why we should be happy just to "take the hand we're dealt" and be happy with the values that are most attractive on some default technological trajectory. Instead, we would end up with preferences over the technological trajectory.

Yes.

This is not really distinctive to persuasion, it applies just as well to any changes in the environment that would change the process of deliberation/discussion. The hypothesis seems to be that "how good humans are at persuasion" is just a particularly important/significant kind of shift.

Yes? I think it's particularly important for reasons discussed in the "speculation" section, and because it seems to be in our immediate future and indeed our present. Basically, persuasion tools make ideologies (:= a particular kind of memeplex) stronger and stickier, and they change the landscape so that the ideologies that control the tech platforms have a significant advantage.

But it seems like what really matters is some ratio between how good you are at persuasion and how good you are at other skills that shape the future (or else perhaps you should be much more concerned about other increases in human capability, like education, that make us better at arguing). And in this sense it's less clear whether AI is better or worse than the status quo. I guess the main thing is that it's a candidate for a sharp distributional change and so that's the kind of thing that you would want to be unusually cautious about.

Has education increased much recently? Not in a way that's made us significantly more rational as a group, as far as I can tell. Changes in the US education system over the last 20 years presumably made some difference, but they haven't exactly put us on a bright path towards rational discussion of important issues. My guess is that the effect size is swamped by larger effects from the Internet.

I mostly think the most robust thing is that it's reasonable to be very interested in the trajectory of values, to think about how much you like the process of deliberation and discourse and selection and so on that shapes those values, and to think of changes as potentially irreversible (since future people would have no interest in reversing them).

The usual response to this argument is that perhaps future values are basically unrelated to present values anyway (since they will also converge to whatever values are most attractive given future technological situations). But this seems relatively unpersuasive because eventually you might expect to have many agents who try to deliberately make the future good rather than letting what happens, happen, and that this could eventually drive the rate of drift to 0. This seems fairly likely to happen eventually, but you might think that it will take long enough that existing value changes will still wash out.

Then we end up with a complicated set of moral / decision-theoretic questions about which values we are happy enough with. It's not really clear to me how you should feel about variation across humans, or across cultures, or for humans in new technological situations, or for a particular kind of deep RL, or what. It seems quite clear that we should care some, and I think given realistic treatments of moral uncertainty you should not care too much more about preventing drift than about preventing extinction given drift (e.g. 10x seems very hard to justify to me). But it generally seems like one of the more pressing questions in moral philosophy, and even if you care equally about those two things (suggesting that you'd value some drifted future population's values 50% as much as some kind of hypothetical ideal realization) you could still get much more traction by trying to prevent forms of drift that we don't endorse.

I agree that way of thinking about it seems useful and worthwhile. Are you also implying that thinking specifically about the effects of persuasion tools is not so useful or worthwhile?

I should say btw that you've been talking about values but I meant to talk about beliefs as well as values. Memes, in general. Beliefs can get feedback from reality more easily and thus hopefully the attractive beliefs are more likely to be good than the attractive values. But even so, there is room to wonder whether the attractive beliefs for a given environment will all be true... so far, for example, plenty of false beliefs seem to be pretty attractive...

To elaborate on this idea a bit more:

If a very persuasive agent AGI were to take over the world by persuading humans to do its bidding (e.g. maximize paperclips), this would count as an AI takeover scenario. The boots on the ground, the "muscle," would be human. And the brains behind the steering wheels and control panels would be human. And even the brains behind the tech R&D, the financial management, etc. -- even they would be human! The world would look very human and it would look like it was just one group of humans conquering the others. Yet it would still be fair to say it was an AI takeover... because the humans are ultimately controlled by, and doing the bidding of, the AGI.

OK, now what if it isn't an agent AGI at all? What if it's just a persuasion tool, and the humans (stupidly) used it on themselves, e.g. as a joke they program the tool to persuade people to maximize paperclips, and they test it on themselves, and it works surprisingly well, and in a temporary fit of paperclip-maximization the humans decide to constantly use the tool on themselves & upgrade it, thus avoiding "value drift" away from paperclip-maximization... Then we have a scenario that looks very similar to the first scenario, with a growing group of paperclip-maximizing humans conquering the rest of the world, all under the control of an AI, except that whereas in the first scenario the muscle, steering, and R&D was done by humans rather than AI, in this scenario the "agenty bits" such as planning and strategic understanding are also done by humans! It still counts as an AI takeover, I say, because an AI is making a group of humans conquer the world and reshape it according to inhuman values.

Of course the second scenario is super unrealistic -- humans won't be so stupid as to use their persuasion tools on themselves, right? Well... they probably won't try to persuade themselves to maximize paperclips, and if they did it probably wouldn't work because persuasion tools won't be that effective (at least at first.) But some (many?) humans probably WILL use their persuasion tools on themselves, to persuade themselves to be truer, more faithful, more radical believers in whatever ideology they already subscribe to. Persuasion tools don't have to be that powerful to have an effect here; even a single-digit-percentage-point effect size on various metrics would have a big impact, I think, on society.

Persuasion tools will take as input a payload-- some worldview, some set of statements, some set of goals/values -- and then work to create an expanding faction of people who are dogmatically committed to that payload. (The people who are using said tools with said input on themselves.)

I think it's an understatement to say that the vast majority of people who use persuasion tools on themselves in this manner will be imbibing payloads that aren't 100% true and good. Mistakes happen; in the past, even the great philosophers were wrong about some things, surely we are all wrong about some things today, even some things we feel very confident are true/good. I'd bet that it's not merely the vast majority, but literally everyone!

So this situation seems both realistic to me (unfortunately) and also fairly described as a case of AI takeover (though certainly a non-central case. And I don't care about the terminology we use here, I just think it's amusing.)

You mention "defenses will improve" a few times. Can you go into more detail about this? What kind of defenses do you have in mind? I keep thinking that in the long run, the only defenses are either to solve meta-philosophy so our AIs can distinguish between correct arguments and merely persuasive ones and filter out the latter for us (and for themselves), or go into an info bubble with trusted AIs and humans and block off any communications from the outside. But maybe I'm not being imaginative enough.

I think I mostly agree with you about the long run, but I think we have more short-term hurdles that we need to overcome before we even make it to that point, probably. I will say that I'm optimistic that we haven't yet thought of all the ways advances in tech will help collective epistemology rather than hinder it. I notice you didn't mention debate; I am not confident debate will work but it seems like maybe it will.

In the short run, well, there's also debate I guess. And the internet having conversations being recorded by default and easily findable by everyone was probably something that worked in favor of collective epistemology. Plus there is wikipedia, etc. I think the internet in general has lots of things in it that help collective epistemology... it just also has things that hurt, and recently I think the balance is shifting in a negative direction. But I'm optimistic that maybe the balance will shift back. Maybe.

This fits with discussions I've been having with researchers about recommenders systems like youtube, and all sorts of risks related to them. I'm glad this post try to push the discussion around the subject!

Thanks! The post was successful then. Your point about stickiness is a good one; perhaps I was wrong to emphasize the change in number of ideologies.

given realistic treatments of moral uncertainty you should not care too much more about preventing drift than about preventing extinction given drift (e.g. 10x seems very hard to justify to me).

The most attractive values given a new technological/social situation are likely to be similar to those given the immediately preceding situation, so I'd generally expect the most attractive values to generally be endemic anyway or close enough to endemic values that they don't look like they are coming out of left field.

And of course for any given zero-sum conflict and any given human, one of the participants in that conflict would prefer push the human towards more attractive values, so they would be introduced even if not initially endemic.

I don't think you can get paperclips this way, because people trying to get humans to maximize paperclips would be at a big disadvantage in memetic competition compared with the most attractive values (or even compared to more normal human values, which are presumably more attractive than random stuff).

Agreed.

Then the usual hope is that we are happy with attractive values, e.g. because deliberation and intentional behavior by humans makes "smarter" forms of current values more attractive relative to random bad stuff. And your concern is basically that under distributional shift, why should we think that?

Or perhaps more clearly: if which values are "most attractive" depends on features of the technological landscape, then it's hard to see why we should be happy just to "take the hand we're dealt" and be happy with the values that are most attractive on some default technological trajectory. Instead, we would end up with preferences over the technological trajectory.

Yes.

This is not really distinctive to persuasion, it applies just as well to any changes in the environment that would change the process of deliberation/discussion. The hypothesis seems to be that "how good humans are at persuasion" is just a particularly important/significant kind of shift.

But it seems like what really matters is some ratio between how good you are at persuasion and how good you are at other skills that shape the future (or else perhaps you should be much more concerned about other increases in human capability, like education, that make us better at arguing). And in this sense it's less clear whether AI is better or worse than the status quo. I guess the main thing is that it's a candidate for a sharp distributional change and so that's the kind of thing that you would want to be unusually cautious about.

I mostly think the most robust thing is that it's reasonable to be very interested in the trajectory of values, to think about how much you like the process of deliberation and discourse and selection and so on that shapes those values, and to think of changes as potentially irreversible (since future people would have no interest in reversing them).

The usual response to this argument is that perhaps future values are basically unrelated to present values anyway (since they will also converge to whatever values are most attractive given future technological situations). But this seems relatively unpersuasive because eventually you might expect to have many agents who try to deliberately make the future good rather than letting what happens, happen, and that this could eventually drive the rate of drift to 0. This seems fairly likely to happen eventually, but you might think that it will take long enough that existing value changes will still wash out.

Then we end up with a complicated set of moral / decision-theoretic questions about which values we are happy enough with. It's not really clear to me how you should feel about variation across humans, or across cultures, or for humans in new technological situations, or for a particular kind of deep RL, or what. It seems quite clear that we should care some, and I think given realistic treatments of moral uncertainty you should not care too much more about preventing drift than about preventing extinction given drift (e.g. 10x seems very hard to justify to me). But it generally seems like one of the more pressing questions in moral philosophy, and even if you care equally about those two things (suggesting that you'd value some drifted future population's values 50% as much as some kind of hypothetical ideal realization) you could still get much more traction by trying to prevent forms of drift that we don't endorse.

I agree that way of thinking about it seems useful and worthwhile. Are you also implying that thinking specifically about the effects of persuasion tools is not so useful or worthwhile?

To elaborate on this idea a bit more:

27

Persuasion Tools: AI takeover without AGI or agency?

27

Six examples of persuasion tools

We might get powerful persuasion tools prior to AGI

Speculation about effects of powerful persuasion tools

Implications