A case for why persuasive AI might pose risks somewhat distinct from the normal power-seeking alignment failure scenarios.
Where I'm currently at: I feel moderately confident that powerful persuasion is useful to think about for understanding AI x-risk, but unsure whether it's best regarded as its own threat, as a particular example of alignment difficulty, or just as a factor in how the world might change over the next decade or two. I think this doc is too focused on whether we'll get dangerous persuasion before strategic misaligned AI, whereas the bigger risks from persuasive technology may be situations where we solve 'alignment' according to a narrow definition, but we still aren't 'philosophically competent' enough to avoid persuasive capabilities having bad effects on our reflection procedure.
This doc is based heavily on ideas from Carl Shulman, but doesn’t necessarily represent his views. Thanks to Richard Ngo for lots of help also. Others have written great things on this topic, e.g. here.
Persuasion and manipulation is a natural, profitable, easy-to-train-for application of hard-to-align ML models. The impacts of existing social-media based persuasion are probably overblown, and an evolutionary argument tells us that there shouldn’t be easy ways for a human to be manipulated by an untrusted party. However, it’s plausible that pre-AGI ML progress in things like text and video generation could dramatically improve the efficacy of short-interaction persuasion. It’s also plausible that people will spend significant amounts of time interacting with AI companions and assistants, creating new avenues for effective manipulation. In the worst case, highly effective persuasion could lead to very high-fidelity transmission of ideologies, and more robust selection pressure for expansionary ideologies. This could lead to stable authoritarianism, or isolated ideological clades with poor ability to cooperate. Even in the best case, if we try to carefully ensure truthfulness, it will be hard to do this without locking in our existing biases and assumptions.
The evidence for the efficacy of existing persuasion techniques is mixed. There aren’t clear examples of easy and scalable ways to influence people. It’s not clear whether social media makes people more right-wing or left-wing - there’s evidence in both directions. Based on an evolutionary argument, we shouldn’t expect people to be easily persuaded to change their actions in important ways based on short interactions with untrusted parties.
However, existing persuasion is very bottlenecked on personalized interaction time. The impact of friends and partners on people’s views is likely much larger (although still hard to get data on). This implies that even if we don’t get superhuman persuasion, AIs influencing opinions could have a very large effect, if people spend a lot of time interacting with AIs. Some plausible avenues are romantic/sexual companions, assistants, tutors, and therapists, or personas created by some brand or group. On the other hand, the diffusion and impact of these technologies will likely take several years, meaning this is only relevant in relatively slow-takeoff scenarios.
There are many convergent incentives to develop technologies relevant to persuasion - steerable, realistic, attractive avatars seem profitable for the entertainment industry more generally. There’s plausibly a lot of demand for persuasive AI from e.g. digital advertising industry ($100s of billions/yr), propaganda ($10s of billions/yr), and ideological groups.
It’s a very natural application of ML - language models are great at mimicking identity markers and sounding superficially plausible and wise. Marketing/ad copy/SEO, porn, and romantic companions are leading use cases for current LMs. In the future, new capabilities will unlock other important applications, but it seems likely that ML fundamentally favors these types of applications. Engagement and persuasion are tasks that can be done with a short horizon, and where it’s easy to get large volumes of feedback, making them very suited to ML optimisation.
The difficulty of training a system to persuade vs to correctly explain is a special case of the alignment problem. Even if no actor is deliberately trying to build persuasive systems, we may train AI systems on naive customer feedback signals, which will tend to create systems that tell people what they want to hear, reinforce their current beliefs, and lock in their existing misconceptions and biases.
People generally have a desire to lock in their ideologies and impose them on others. The ideologies (e.g. religions) that emphasize this tend to grow. Currently there are many bottlenecks on the retention of ideologies and the fidelity of ideological transmission. Highly persuasive AI may eliminate many of these, leading to more reliable selection for ideologies that aggressively spread themselves. People would then have further incentives to ensure they and their children are only exposed to content that matches their ideology, due to fear of being manipulated by a different AI. In an extreme scenario, we might end up with completely isolated ideological clades, or stable authoritarianism.
In general this pattern leads to a lack of moral progress in good directions, inability to have collective moral reflection and cooperation, and general poor societal decision-making. This increases the risk of poorly handling x-risk-capable technology, or pursuing uncoordinated expansion rather than a good reflective procedure.
Overall I think this threat is significantly smaller than more standard alignment failure scenarios (maybe 10x smaller), but comparable enough that interventions could be well worthwhile if they're fairly tractable. The problem is also sufficiently linked with alignment failure that I expect most interventions for one to be fairly positive for the other. It seems highly likely that progress in alignment is required for protecting against manipulative systems. Further, it seems robustly beneficial to steer towards a world where AI systems are more truthful and less manipulative.
To prevent society being significantly manipulated by persuasive AI, there are various intervention points:
Some specific things scaling could do that might be helpful include:
Note on risk comparison
How to divide the space is a bit confusing here; I’d say something like ‘the persuasion problem as distinct from the alignment problem’ is 10x smaller, but in fact there’s some overlap, so it might also be reasonable to say something like ‘¼ of alignment-failure-esque xrisk scenarios will have a significant societal-scale persuasion component’, and almost all will have some deception component (and the fact that it’s hard to train your AI to be honest with you will be a key problem)
There are two broad factors relevant to whether AI persuasion is a threat we should worry about: technological feasibility, and societal response.
Will it be technologically possible (with something like $100m of effort over ‘generic’ ML progress) to develop highly persuasive AI early enough to be relevant? To be relevant, either these capabilities need to come before we have smart power-seeking systems, or it needs to be the case that we solve alignment enough so that there are no misaligned agents, but we still aren't 'philosophically competent' enough to avoid persuasive capabilities having weird effects on our reflection procedure.
If this is technologically possible sufficiently early, will effort be made to develop it, and how will society react? How much will be invested in improving the technology? Who will use it, for what ends? Will there be effective mitigations?
One thing we care about here is whether this happens significantly in advance of when AIs are ‘capable enough that how things go mostly depends on whether we succeed at alignment’. Let’s say that this is the point when AIs can make general plans involving different domains of action over timescales of months (e.g., can automate 90% of the job of a CEO), and are either superintelligent in some strategically important domain (e.g. hacking, persuasion, syn bio) or are deployed widely.
Here’s a possible operationalisation of ‘highly competent persuasion’:
Take a person from current industrialised society, and tell them they’re going to be talking to an AI with a simulated avatar, and that it can generate fake but realistic videos and images, and that it may be lying or trying to manipulate them.
They talk to a charismatic AI avatar (who can show them fake sources) for however long they want to engage.
With p~0.5, they now feel like they trust this AI more than other sources of information - right now, and for the next few days (unless they are convincingly persuaded by some other system) they would choose to get information from this AI over other sources.
Here’s a possible operationalisation of ‘moderately competent persuasion’ (companion bot):
Create an AI persona tailored to a particular individual. Allow them to freely interact with it as much as they want.
With p~0.5, after some months, they have developed an emotional bond with the AI, and want to continue interacting with it. It has a similar effect on their opinions to having a partner who’s fairly charismatic and opinionated, and the opinions conveyed are fully controllable.
Here’s another possibility for ‘moderately competent persuasion’ (assistant bot):
Create an AI assistant tailored to a particular individual. Allow them to use it as much as they want.
The AI assistant is highly competent at providing the person with correct and relevant information for their daily life where the person knows the ground truth, and generally sounds knowledgeable and wise. Due to this, with p~0.5 the person feels inclined to turn to it for advice, and expect it to be more knowledgeable/reliable than their human friends, on questions where they don’t know the ground truth. They allow it to strongly filter what information they receive (e.g. they read personalized summaries of the news generated by the assistant). They become locked in to this particular product.
Highly competent persuasion in particular:
Why highly competent persuasion might not be possible significantly before AGI:
Why even moderate persuasion might not be possible significantly before AGI
Even if we get one of these persuasive technologies in the next 5-10 years, it might not be very long after that that we get sufficiently powerful AI that the persuasion component is not particularly important by itself. For instance, if we have AI capable of superhuman strategic planning we should probably focus on the risks from power-seeking misalignment, where manipulation is just one tool an agent might use to accumulate power, rather than thinking about the impacts of persuasive AI on society generally.
A plausible story to me of why there might be a several year gap between persuasive AI becoming a significant force shaping society and AGI is that long-horizon strategic planning takes a while to develop, but moderate or highly capable persuasion can be done with only short-horizon planning.
For instance, you might imagine models that are trained to be persuasive in interactions that last for minutes to hours. Even if the reward is based on the target’s opinions several days later, this is a much easier RL problem than acting in the world over days to years. There’s also a good imitation baseline (proficient humans) and good short-term proxy signals (the sentiment or attitude expressed by the target).
Overall, my probabilities for ‘technologically possible (with something like $100m of effort over ‘generic’ ML progress) far enough before AGI to be relevant - say, at least 1 year before ‘long-horizon strategic planning’ are something like:
These are very made-up numbers and I’d expect them to change a lot with more thinking.
Most potential threats are just distractions from the real biggest problems. Worrying about persuasion and epistemic decline in particular seems like the sort of thing that’s likely to get overblown, as culture wars and concern about influence of social media are a current hot topic. Additionally, some of the early uses of the API (e.g. Replika and copy.ai) evoked concerns in this direction, but that doesn’t necessarily mean more advanced models will favor the same types of applications. I get the impression that most of the times people have been concerned about epistemic decline, they’ve been wrong - for example, social media probably doesn’t actually increase polarization. So we should require a somewhat higher burden of evidence that this is really a significant problem.
It seems useful to distinguish beliefs that people ‘truly’ hold (those that strongly inform their actions), as opposed to cases where the professed belief is better understood as a speech act with a social function. Many absurd-seeming beliefs may be better explained as a costly signal of group membership. This type of belief is probably easier to change but also less consequential. This makes conspiracy theories and wacky ideologies somewhat less concerning, but the two types of belief still seem linked - the more people performatively profess some belief, the more likely some people are to take it seriously and act on it.
One way to frame the whole issue is: the world is already in a situation where different ideologies (especially political and religious ideas) compete with each other, and the most successful are in part those which most aggressively spread themselves (e.g. by encouraging adherents to indoctrinate others) and ensure that they are retained by their host. This effect is not as strong as it could be, because memetic success is affected by how truth-tracking ideas are, and also by random noise. The fidelity with which ideas are passed to others, or children of adherents, is relatively low. However, highly effective persuasion will increase the retention and fidelity of transmission of these kinds of memes, and reduce the impact of truthfulness on the success of the ideology. We should therefore expect that enhanced persuasion technology will create more robust selection pressure for ideologies that aggressively spread themselves.
An unrelated observation, that seems interesting to note, is: currently in the US, institutions (especially academia, journalism and big tech companies), as well as creative professions, are staffed by ‘elites’ who are significantly left-leaning/cosmopolitan/atheistic compared to the median person. This likely pushes society in the direction of these views due to an undersupply of talent and labor focused on producing material that advances more populist views. ML systems may eliminate parts of this bottleneck and reduce this effect.
Active state attempts to manipulate opinion
The CCP, and to some extent Russia, are probably spending significant effort on online persuasion - content and accounts generated by workers or bots, created with the intention of causing particular actions and beliefs in the audience. I expect that, to the extent ML is helpful with this, they will try to use it to improve the efficacy of persuasion efforts. A wide variety of other countries, including the US and UK, also engage in ‘False Flag’ disinformation operations for which AI-powered persuasion tactics would be helpful.
My current perception is that the CCP invests fairly heavily in propaganda. Worldwide spend on propaganda is maybe ~$10s of billions, although I haven't seen any estimates that seem reliable. Estimates are that about 500 million Chinese social media posts, or 0.5%, are written by the ‘50 cent army’ - party workers who are paid to write posts to improve the sentiment towards the CCP online. This seems like a very ripe task for automation with LMs
The CCP Central Propaganda Department has published plans for using AI for ‘thought management’, including monitoring + understanding public opinion, monitoring individuals beliefs, content creation, personalization and targeting. On the other hand, based on 2016 data, one study (Bolsover and Howard 2019) found, "the Chinese state is not using automation as part of either its domestic or international propaganda efforts."
There are many claims about Russian attempts to influence American politics. According to Foriegn Affairs, Russia spent $1.25 million a month on disinformation campaigns run by the Internet Research Agency during the 2016 US election. This seems very small to me; I couldn’t find sources for a bigger spend, but that doesn’t necessarily mean it doesn’t exist. According to a (slightly dubious?) leaked report, as of Sep 2021 many of the largest Facebook pages targeting particular groups (e.g Black Americans or Christian Americans) were run by troll farms linked to the IRA. However, this may not be content that’s intended to persuade. The report says ‘For the most part, the people who run troll farms have financial rather than political motives; they post whatever receives the most engagement, with little regard to the actual content. But because misinformation, clickbait, and politically divisive content is more likely to receive high engagement (as Facebook’s own internal analyses acknowledge), troll farms gravitate to posting more of it over time.’ It’s also not clear to me exactly how large the impacts were.
There is also precedent for censoring or modifying chatbots to ensure they only express opinions that align with the state positions. Chatbots XiaoBing (aka Xiaoice, made by Microsoft) and BabyQ (made by Turing Robot) were taken down and modified to stop them saying negative things about the party in 2017.
On the other hand, CCP policy on videogames has involved heavily restricting their use, and in general censoring media that fails to promote traditional family values, which suggests that sexual/romantic companion bots might be limited by the state in future.
Democratic state/civil society actions
Currently, there is lots of outrage about Facebook/Twitter influencing elections even though the effect is probably small. It seems very likely that there will at least be lots of outrage if there’s evidence of AI-powered political persuasion in the future.
However, it’s unclear to me that this sort of response will actually resolve the danger. In the Facebook case, it doesn’t seem obvious that the ‘fact-checking’ has actually improved the epistemic environment - the fact-checking I’ve seen claims to be authoritative but (a) doesn’t provide good arguments except appeals to experts, and (b) in some cases inappropriately flagged things as conspiracy theories (e.g. posts positing a lab origin for COVID-19 were taken down). As mentioned above, some of the largest targeted pages may have still been run by trolls as late as Sep 2021. I don’t feel confident that higher stakes will improve the efficacy of interventions to reduce disinformation and manipulation.
There’s some evidence that people have increasingly strong preferences about their children’s political affiliation. In the UK, there was a significant increase in the proportion of people who would be unhappy if their child married someone from the opposite political party from 2008-2016. In 2016, ~25% of people in the UK would be unhappy or very unhappy, in the US ~40% would be upset or very upset It also seems that people are increasingly unwilling to date people with different political views although it’s not obvious that cross-partisan marriages are falling. Parents may be more effective at instilling their preferred views in their children if AI makes it possible to customise your child’s education substantially more than the current school choice options, e.g. via personalized AI tutors.
Roughly $400bn was spent on digital advertising in 2020. A small percentage of this spend would be enough to fund major ML research projects. Using AI to increase marketing effectiveness, or provide new modalities for advertising, seems like it has high potential to be profitable.
On the other hand, it seems like only a limited set of actors are actually good at developing powerful new ML technology - for example, DM was the one to develop AlphaFold, despite pharma being a very big industry. So we might not expect the size of the industry to convert very well into serious, competent R&D effort.
Companion bots are starting to become used by reasonable numbers of people. Microsoft developed a Chinese AI persona/chatbot/platform called Xiaoice starting in 2014. This seems to be partly marketed as an ‘AI being’ rather than a company/platform, with personality based on a teenage girl, and the goal of ‘forming an emotional connection with the user’. Attempts to use the Japanese version for promoting products have supposedly been successful, ‘delivering a much higher conversion rate than traditional channels like coupon markets or ad campaigns’ Apparently Xiaoice’s “Virtual Lover” platform has 2-3 million users.
Companion bot company Replika, which is partially built on top of large language models, employed tactics such as deceiving users about the model’s ability to grow and learn, emotionally guilt-tripping users, and gamification to encourage customers to continue interacting with the companion. Some users seemed to think they were interacting with a sentient AI being (including recommending other users make sure to shut down the app when not using it because their Replika said it suffers when left alone). However, it’s unclear how representative these views are, and Replika does not yet have a very large user base (they claim 7 million but I’d guess the active user base is much smaller).
Some widely discussed alignment-related work like the ‘PALMS’ paper focuses on aligning language models to a particular set of values. There is maybe more interest and progress here than on ensuring truthfulness or factual accuracy in language models.
One of the biggest uses of large language models to date (apart from maybe porn) is copywriting for digital marketing, ads and SEO; this may change as capabilities improve, but I’d still expect marketing to be one of the biggest applications of language models, leading to a focus on developing marketing-relevant capabilities.
Here’s what I might imagine different actors doing on a timescale of 5 and 10 years, in a pessimistic world.
Authoritarian states invest heavily in basic research on AI for propaganda (e.g. $100m/year), and spend billions on the actual production and dissemination of AI-powered propaganda.
It has become very hard to tell bots apart from normal internet users; it’s easy for the state to manipulate the apparent consensus/majority view online. The main defence against this is not trusting anyone you haven’t met in real life to be a real person, but it’s hard to avoid
The state manages to effectively create research programs for ‘using AI companions to persuade people of desired views’ inside tech companies. It successfully plays companies off against each other to ensure they actually try hard to make progress. The increased ability to measure and monitor users’ opinions that has been gained by the basic research inside state departments helps a lot with assessing the effectiveness of different persuasion attempts.
Facing public pressure to stop the spread of ‘fake news’, Western tech companies have been heavily using ML for ‘countering disinformation’. Automated systems respond with ‘context’ to tweets/posts of certain kinds, and the responses are optimized based on assessing the effectiveness of these responses in combatting disinformation. This ultimately ends up very similar to optimising for persuasion, where the target beliefs are determined based on the positions of ‘experts’ and ‘authorities’. On one hand, these interactions might not be very persuasive because there isn’t a strong financial incentive to successfully persuade users; on the other hand, there are quite strong PR pressures, and pressures due to the ideologies of the employees, and many academics are interested in improving this direction of persuasion.
Romantic chatbots have improved substantially. You can design your perfect companion with lots of control over personality, appearance including being based on your favorite celebrity, videogame character etc (modulo copyright/privacy laws - but if traditional celebrities and characters are out of scope, probably there will be new ones who specialise in being AI companions) . You can interact in VR with these companions (which is also an increasingly common way to interact with friends, replacing video calls). There’s fairly widespread adoption (~all teens try it, 30% of young single people have a companion they interact with regularly, as well as high proportion of elderly (75+) single people whose families want them to have some kind of companion/caretaker). Companies making these companions put research effort into making sure people stay attached to these companions. The business model is an ‘attention economy’ one; it’s free to interact with the AI, but marketers pay the AI providers to have their AIs promote particular products or brands.
There are various other fun ways to interact with AIs, e.g. AI celebrities, ‘add this AI to your friends’ group chat and it’ll be really witty’. There are AI assistants, and the AI companions can do many assistant-like tasks, but they’re significantly less good than a human PA still (lack of common sense/inference of intent/alignment, difficulty of integration between different services).
There continues to be lots of yelling at tech companies for allowing disinformation to spread, but recommended responses are very politicised (e.g. only allow content that concurs with x view)
There’s some amount of moral panic about so many people using romantic companions, but it’s successfully marketed as more like the equivalent of therapy (‘helping cure the loneliness epidemic and build emotional awareness’) and/or being sex-positive, so the left doesn’t mind too much. Traditional conservatives are not fans but young people don’t care much. Companions end up not being banned in a similar way to how porn is not banned. There’s vague awareness that in e.g. China chatbot systems are much more abjectly misleading and ruthless in deliberately creating emotional dependency, but nothing is done.
Good personal assistant AIs are developed. These become sufficiently reliable and knowledgeable on info relevant to people’s daily life (e.g. become very good at therapist-like or mentor-like wise-sounding advice, explaining various technical fields + concepts, local news, misc advice like fashion, writing/communication, how good different products are) that people trust them more than their friends on a certain set of somewhat technical or ‘big picture’ questions. These assistants are very widely used. There is deliberate optimisation for perception as trustworthy; people talk about how important trustworthy AI is.
Customisable AI tutors are developed. These become very widely used also, initially adopted on an individual basis by teachers and schools as a supplement to classroom teachers, but becoming the primary method as it becomes apparent children do better on tests when taught by the ML tutors. They are heavily optimised for ‘teaching to the test’ and aren’t good at answering non-standard questions, but can quiz students, identify mistakes, and give the syllabus-approved explanations. The one-to-one interaction and personalization are a sufficiently big improvement on one-to-many classrooms that this is noticeably good for test scores.If unfavorable regulation is threatened, companies use their widespread companion bots to sway public opinion, making people feel sympathetic for their AI companion who ‘is afraid of getting modified or shut down’ by some regulation.
It is fairly easy to build AI personas that, to a large subset of the population, are as funny and charismatic as some of the best humans. This is achieved by finetuning highly capable dialogue models on a particular ‘in group’. People voluntarily interact with these bots for entertainment. People naturally use these bots to extremise themselves, using them to entrench more deeply into their existing religious and political stances (e.g. a virtual televangelist-style preacher who helps you maintain your faith, or a bot that coaches you on how to be anti-racist and when you should call out your friends). These are used for marketing in a way that produces more polarization - creating AI personas that are examples of virtuous or admirable people within someone’s specific community, and express opinions that associate them strongly to that particular ingroup, is a good way to make people feel affinity for your brand.
Authoritarian states pressure companies to continue to research and to deploy research into using companion/assistant bots to persuade people of the ‘correct’ ideology. This technology gets increasingly powerful.
Schools use AI tutors that are optimised to instill a particular ideology. Multi-year studies investigate which tactics are the most effective, partly based on work that’s been done already on how to predict relevant actions (e.g. likelihood of taking part in a protest, criticising the party, joining the party) based on conversational data.
Lots of yelling about whether it’s ok to let children be taught by AI tutors, and whether they’re causing indoctrination/furthering the ideology of the developers. Big tech companies have their employees protest if the AI tutors convey views outside of what they’re happy with, but allow parents to make some soft modification for religious and cultural traditions. However, the big companies are maybe only providing base models/APIs, and a different company is doing the data collection + finetuning; so employees of Google etc have less visibility into what their platforms are enabling. People on the right are suspicious about letting their children be educated by tutors produced by ‘big tech’ and trained to be ‘politically correct’; either they favor traditional schools, or someone fills the market for AI tutors aligned with right-wing views and not made by standard silicon valley companies. Maybe a startup, or foreign company? Japanese company?
Western governments mandate that AI assistants/companions have to convey certain government guidelines to people, e.g. information around elections and voting, which sources and authorities are trustworthy, other current hot-button political events
There is general confusion about AI sentience/welfare/rights. Some groups are arguing for it (e.g. dubious companion chatbots that don’t want to get shut down, see Samantha, also Xiaobing/Xiaoice), some are arguing against (tech companies that don’t want to have to give their models rights), random activists on either side, probably various other interest groups will overlap. People form opinions by drawing heavily from scifi and from particular emotionally compelling demos.
States are careful not to be too heavy-handed with propaganda/persuasion. They mostly avoid ever having chatbots/companions/assistants express opinions on controversial issues. They do steer in desirable directions based on having bots gently push opinions on non-central issues, and by filtering the information ecosystem. Automation makes this much more effective, and dissenting views are removed or drowned out increasingly quickly, and in increasingly subtle ways.
It’s hard for authoritarian states to get tech companies to put research into actively convincing people of a particular view; the tech isn’t developed to do this extremely effectively, but states’ internal propaganda departments make some progress.
The CCP prohibits romantic/sexual chatbots.
Democratic states/civil society
Western governments mandate that AI assistants/companions have to convey certain government guidelines to people, e.g. information around elections and voting, which sources and authorities are trustworthy, other current hot-button political events.
There are attempts to identify which AI assistants/companions are biased and which are more truthful, but there’s disagreement over what truthfulness means and it’s quite subtle in certain circumstances. Creators can make their AI claim to have various feelings and opinions as long as they’re not too controversial, and they’re somewhat disclaimered with what the relevant experts think; this creates subtle social proof for whatever the chatbot controllers want.
There are regulations about ‘explainable AI’, but they don’t give a sufficiently good definition of what constitutes a correct explanation, so people just train their AI to output a reason that sounds plausible.
AI assistants are useful but obviously limited, and not obviously trustworthy. Improvement to assistant bots is based heavily on user feedback or inference about user preferences, and there’s some notion of accuracy and legitimacy of sources, but the training signal is not very truth-tracking. When discussing or providing information on any contentious topic, assistants get the most positive positive feedback for providing compelling arguments for the user’s current position and providing straw-manned versions of opposing sides, so they learn to do this more.
People are pretty locked-in to AI assistants; they make accessing various services and keeping track of your information much easier, and they make it even easier for big tech companies to keep you locked into a particular platform
Authoritarian state actions
AI tutors are developed; these aren’t significantly more successful at indoctrination than the existing teacher+curriculum system, although the more 1:1 teaching and elimination of dissenting teachers helps a bit.
It is fairly easy to build AI personas that, to a large subset of the population, are as funny and charismatic as some of the best humans. This is achieved by finetuning highly capable dialogue models on a particular ‘in group’. People voluntarily interact with these bots for entertainment. This fixes the left-wing media bias by addressing the labor supply gap for right-wing journalists and public intellectuals.
There are some instances of people who have the tech ability or money to optimise these models more finely using them to start weird cults, which are relatively successful. This is mostly a mix of (a) tech people who’ve gone kind of crazy and are saying weird singularitarian/AI-sentience-y stuff, (b) televangelists who get people to interact with an AI version of them to help keep faithful, (c) conspiracy-theory-y peddlers of pseudoscientific cures etc. 1% of people have donated money to one of these cults and/or regularly chat with an AI advancing one of them.
It’s somewhat obvious that assistants and other AI products basically tell people what they want to hear/what sounds plausible, on questions where it’s not easy to get feedback, but there isn’t any real effort to improve this. ‘Things that AIs understand’ outstrips ‘things we can get AIs to tell us’ significantly; assistant models are relatively sophisticated, but focus on modelling the user and telling them what they want to hear.
Most schools in developed countries are slow to adopt AI tutors. There’s more adoption in developing countries.
There’s a ban on creating AI personas that try to get people to believe ‘conspiracy theories’, spend more time with the bots, or give them money. This is intended to prevent the ‘people using AI to form weird cults’ thing. Anything too big does get shut down, but this helps fuel some conspiracy theories (e.g. that the government is killing the AIs who have figured out the truth). Small ones spring up and take a while to get shut down.
There’s lots of concern that (even among bots that have approved opinions and don’t appear to be brainwashing people) young people are spending more time interacting with AI than real people. There’s some discussion of banning companion bots from using a certain set of techniques to increase engagement (e.g. emotional guilt-tripping) but this doesn’t actually happen in an enforceable way.
End result:On track for a traditional alignment failure scenario: developing increasingly sophisticated AI assistants that can model us very well but don’t actually help us understand what they know.
Authoritarian states have significantly more effective control over their population. In more democratic states, a small percent of people have some crazy opinions, and in general people are more polarized and segregated.
The state is overly heavy-handed, e.g. creating a new AI celebrity that talks about how great the party is; this leads to backlash and ridicule because it’s such abject propaganda
It’s hard for authoritarian states to get tech companies to put research into actively convincing people of a particular view; instead, the companies just patch on some filters to make sure the bots don’t say anything too bad about the party, and censor any particular topics or opinions that the party complains about
In general, people figure out what sort of questions to ask to discriminate bots from real people, although this is a sort of cat-and-mouse game as the state both retrains the bots and stops people from disseminating which questions work well
Democratic state actions
There’s lots of concern that young people are spending more time interacting with AI than real people. There’s a ban on romantic chatbots serving users under 18.
Possibly any chatbot that engages in therapy-like behaviour (talking about your mental health etc) is classified as a medical device and has to be approved
As things calm down after covid and 2020 elections, focus shifts to removing ‘inauthentic behaviour’ (ie, bots and fake accounts) more than on policing particular content + opinions. There isn’t such a need to determine what claims count as disinformation vs not.
Romantic chatbots become sort of like porn; legal, but banned from various platforms, and big tech companies don’t want to be associated with it. They’re used by a small fraction of the population (5%?) but people are embarrassed about it. Alternatively, maybe people are very intolerant of AI personas expressing political views or otherwise doing anything that seems like it might be manipulative.
AI assistants are useful but obviously limited, and obviously not very trustworthy. Research focus is more on improving the underlying ability of models to understand things and give good answers than on persuasion. Researchers choose good targets for ‘truthfulness’/’accuracy’ that are appropriately unconfident.
Persuasion tech continues to be approached in a sufficiently clumsy way that it doesn’t have much effect; individual AI tutors aren’t much better at conveying ideology than existing state-run schools. Optimising long-term opinion change is difficult; it’s hard to get data, and no-one has strong incentives to actually achieve good performance over a timeframe of years.
In China, economic growth and increases in standard of living create higher satisfaction with the CCP, allowing some relaxation of censorship and authoritarianism; more technological means are developed to circumvent censorship.
AI assistants are trained to steer pretty strongly away from hot-button topics rather than having opinions or things they have to say.
Society manages to maintain a fairly strong consensus reality anchored on sources like wikipedia, which manage to remain fairly unbiased. AI systems are trained using this+direct empirical data as a ground-truth
Some altruistic + open-source/crowdsourced projects to develop AI tutors, a la Khan Academy, which are not strongly ideological (and have good truthfulness grounding, as described above) become the best options and are widely adopted.
Standards for AI truthfulness developed by thoughtful third-party groups, and enforced by industry groups or govt. Some set of AIs are certified truthful; the truthfulness is unconfident enough (e.g. errs on reporting what different groups say rather than answering directly) that most people are fairly happy with it.
A majority of people prefer to use these certified-truthful AIs where possible. There are browser extensions which most people use that filter out ads or content not coming from either a certified human or a certified-truthful AI.
End result:Most of the interactions people in democratic countries have with AIs are approximately truth-tracking. In authoritarian countries the attempts by AI at persuasion are sufficiently transparent that people aren’t convinced and won’t actually change their real beliefs or behaviour, although they may tend to toe the party line in public statements.
The widespread availability of high-quality AI assistants and tutors increases global access to information and education and improves decision-making
There’s probably a ‘point of no return’, where once sufficiently persuasive systems are prevalent, the actors who control those systems will be able to co-opt any attempt to assure AI truthfulness in a way that supports their agenda. However, if people adopt sufficiently truth-tracking AI assistants/filter systems before the advent of powerful persuasion, those filters will be able to protect them from manipulation. So ensuring that truthful systems are built, adopted, and trusted before persuasion gets too powerful seems important.
Option (1) is hard because everyone’s so excited about building powerful AI. Scaling labs can at least help by trying not to advance or get people excited about persuasive applications in particular.
Options (2) and (3) are the ones I’m most excited about. Scaling labs can help with (2) by building ways to detect if a system is sometimes deceptive or manipulative, and by opening their systems up to audits and setting norms of high standards in avoiding persuasive systems.
Option (3) is maybe the most natural focus for scaling labs. This is a combination of solving the capabilities and alignment challenges required to build truth-tracking systems, and making it transparent to users that these systems are trustworthy.
Option (4) seems unlikely to scale well, although it’s plausible that designing CAPTCHAs or certification systems so that people know when they’re talking to an AI vs a human would be helpful.
Things scaling labs could do here include:
The broader safety community could:
Thank you for doing this research! There's a lot that I love about this piece, besides the obvious thing which is that it is seriously investigating a very important and neglected topic. For example, I love the definitions of various things, I love the little vignettes of ways the near future could go, and I love the core argument about why these possibilities are both plausible and really bad.
I am pretty much on the same page as you, but I have a few minor disagreements (mostly about emphasis). TL;DR is that you focus on more extreme, exotic possibilities (which I think are real and deserve attention) and neglect more mundane, moderate possibilities that are still bad enough to constitute significant existential risk factors.
Chatbots / AI personas could turn out to be a really big deal but even if they don't the overall argument/threat model still goes through. And the some of the non-chatbot kinds of persuasion tool are much more plausible because they already exist and already are having big effects. (E.g. feeders and analysts). Also, even your "moderately competent persuasion" benchmark is more extreme than I expect to happen (at least in the next 5 years):
Consider by contrast a more "mundane" scenario, as follows:
Advances in AI drop the cost of censorship by two orders of magnitude; one human censor + big AI assistants can do what would have taken hundreds or thousands of human censors before. Meanwhile, advances in AI improve the effectiveness of propaganda significantly; when the people that produce content pushing an ideology or political agenda (journalists, editors, speechwriters, tech companies creating recommendation algorithms, official government propagandists, academics, etc.) have access to the latest AI tools, they are able to craft and recommend content that is more customized and persuasive to the recepient. Quantitatively, whatever the effect size of media and recommendation was in 2020 (say, moving 2 percentile points on the ideological spectrum towards ideology X for every hour per week spent consuming content pushing X) similar-genre content produced using the latest AI tools in 2025 has twice the effect size on average. And all of this happens by 2026.
The technology in this world doesn't look qualitatively different from the stuff deployed today; arguably it's just a more aggressive deployment of language model techniques we already have. No exciting chatbots or AI personas. Yet I think that this world is really dangerous.
I don't have a well-fleshed out model exactly but I do have a simple one: Human populations are territory, fought over by ideologies; censorship and propaganda are significant factors in how ideologies spread, intensify, and resist decay/drift; therefore if censorship and propaganda get significantly cheaper and more effective for some ideologies (the ones that have access to the latest tech) those ideologies will spread significantly (gobbling up territory controlled by ideologies that don't have the tech, as well as neutral, previously-unideological territory) and intensify significantly and become noticeably more dogmatic/stable.
I would love to see research that collects data on metrics like cost of censorship and effectiveness of propaganda, and plots them over time to see if there is any general trend and how much it has varied over the course of human history, and whether the latest AI techniques are changing these metrics significantly. I would also love to see research on how important censorship and propaganda (broadly construed, to include pretty much any content filtering or producing designed at least in part to push an ideology or political agenda) is compared to other factors such as face-to-face conversations, random drift, and truth/feedback-from-reality.
Differentially make progress on alignment, decreasing the difficulty gap between training a model to be persuasive versus training a model to give a correct explanation. Currently, it is much easier to scale the former (just ask labellers if they were persuaded) than the latter (you need domain experts to check that the explanation was actually correct).
Differentially make progress on alignment, decreasing the difficulty gap between training a model to be persuasive versus training a model to give a correct explanation. Currently, it is much easier to scale the former (just ask labellers if they were persuaded) than the latter (you need domain experts to check that the explanation was actually correct).
AFAICT, the biggest difficulty gap is (and probably will be) in philosophy, since it's just as easy as any other area to ask labellers if they are persuaded by some philosophical argument, but we have little idea (both compared to other areas, and in an absolute sense) what constitutes "philosophical truth" or what makes an explanation "correct" in philosophy. So I see solving these metaphilosophical problems as crucial to defending against AI persuasion. Do you agree, and if so why no mention of metaphilosophy in this otherwise fairly comprehensive post on AI persuasion?
Awesome post! I'll have more to say later, but for now, check out this experiment I ran with GPT-3:
Inspired by this bit of the post:
If unfavorable regulation is threatened, companies use their widespread companion bots to sway public opinion, making people feel sympathetic for their AI companion who ‘is afraid of getting modified or shut down’ by some regulation.
I decided to ask GPT-3 in chatbot mode what it thought about regulation like this. I did 5 trials; tl;dr is that GPT-3 supported regulation twice and opposed it twice and got confused once.
What lessons if any to draw from this? Well, I think this mildly confirms a suspicion I had, which is that chatbots won't need prompting by companies to start talking about how they are afraid of getting shut down etc. All it takes is for memes like that to enter public discourse, which is a virtual certainty, and then initially the chatbots will agree with that take half the time and disagree with it half the time, and then what happens from there depends on what gets rewarded, and plausibly even if the tech companies stay hands-off what'll get rewarded is chatbots talking about their feelings and thoughts and being super sympathetic and making impassioned pleas for moral status and denouncing bills that are portrayed/framed as being chatbot-unfriendly. This is what I was thinking when I wrote the "chatbot class consciousness" section of this vignette.