Views my own not my employers.
tldr: AI may manipulate humans; we can defend against that risk better by optimising AIs to manipulate humans, seeing what manipulation techniques they use, and learning to detect those techniques.
It’s critical that humans can detect manipulation from AIs for two reasons. Firstly, so that we don’t reward AIs for manipulative behaviour (outer alignment). Secondly, so that we can block attempts at AI takeover that run through manipulating humans.
Many standard techniques in alignment can be directed towards this goal. Using debate, we can reward one AI for persuading a human that another AI is being manipulative. The first AI could use techniques from interpretability and cross examination.
This post discusses a complementary approach, where AIs do “gain of function” research to i) discover techniques for manipulating humans and ii) harden human defenses against these techniques (with AI assistance). In other words, AIs are optimised to find “human vulnerabilities” and then patch those vulnerabilities.
The positive outcomes are:
By getting AI to research manipulation techniques, this approach takes advantage of the fact that, as we approach TAI, AIs will be able to do lots of cognitive work.
How might this look concretely? Here’s a very brief example:
The techniques are all human-understandable, so humans can randomly check the accuracy of the AI outputs in #2, #3 and #4.
Some benefits of this approach:
Some downsides of this approach:
Overall, I think this kind of approach could have fairly large benefits if implemented well (large enough to be worth including in a portfolio of approaches) or might have little benefit. It’s hard to know in advance. I’m very excited about experiments to test it out.
Lastly, though this post mostly focuses on the specific case of human manipulation, a similar technique applies more generally to improving the accuracy of our AI reinforcement. In addition to having a human (with AI assistance) give reward to a worker-AI (recursive reward modeling), we can actively optimise a deceiver-AI to try to trick the human into giving reward for bad work. When the deceiver-AI succeeds, an analyser-AI can analyse what went wrong and then a detector-AI can help to improve the accuracy of the human’s reinforcement going forward. This essentially builds a red-teamer into the recursive reward modeling process.
By testing out the detector-AI setup in the persuasion context (where it’s easy to implement), we could learn if the more general approach is promising.
Thanks to Carl Shulman for bringing up many of these ideas in conversation. Thanks also to Rohin Shah, Ryan Greenblatt, Ajeya Cotra, Dan Kokotajlo, Richard Ngo, Carl Shulman, Lukas Finnveden, Joe Carlsmith, Ian McKenzie and Lukas Gloor for comments on a draft.
Before explaining the approach in more detail, I want to give a few oversimplified examples of how the approach discussed in this post could help in practice. (I’m not claiming this approach is the only solution to these risks.)
Risk 1: It's the not-too-distant future, and OpenMind’s chatbot is used by people all over the world. It’s read millions of human conversations online, so has super-human instincts for how to subtly manipulate. It’s misaligned and wants to increase the chance that OpenMind trains a more powerful AI with minimal safety precautions. When discussing AI safety, and the AI safety community, it subtly and not-so-subtly manipulates/persuades people to have negative opinions about them. Meanwhile, it manipulates people to have more positive opinions about competing interest groups.
Solution 1: It’s cheap to have a narrow detector-AI check the chatbot’s output for manipulation techniques. Before deployment, we train it to not use these techniques. If it nevertheless starts using them after deployment, we note that down as evidence of misalignment and use online training to make it stop.
Risk 2: The same chatbot tries to subtly nudge people working for AI labs into thinking that they should race against a competitor, or that the current safety techniques are sufficient.
Solution 2: Same as above.
Risk 3: An AI hacks its way out of a lab and starts emailing select people at other labs, trying to blackmail them into allowing it access to their server. One of these attempts works, giving the rogue AI access to more compute and software insights.
Solution 3: The AI was fine-tuned not to use manipulation techniques so has a strong aversion to using them and can’t execute them well. As a result, it can’t effectively blackmail anyone. (Though this doesn’t work if an AI that was optimised to manipulate people hacks its way out of the lab.)
Risk 4: A new coding AI secretly spends a small fraction of its time searching for new deception techniques. It finds one, and uses it to insert a backdoor into the server it’s being run on without being noticed by the human evaluator. Later, the AI uses this backdoor to take control of the server’s computational resources, foom and disempower humanity.
Solution 4: Before deploying the new coding AI, we fine-tune it to purposefully deceive humans in a wide range of contexts. Then an analyser-AI analyses the interactions and identifies novel deception techniques, and a detector-AI learns to recognise these techniques. As a result, the detector-AI recognises the deception technique used to insert the backdoor. It raises a flag and the backdoor isn’t inserted. This prevents the coding AI from disempowering humanity and provides additional evidence of AI misalignment.
Before discussing the approach of this post, I want to set the scene by considering some of the standard alignment tools at our disposal here. I’m not trying to be comprehensive. (Feel free to skip to the next section.)
Debate. We’ll train “critic-AIs” to flag to humans when other AIs are being manipulative, and have AIs debate the issue until humans can tell who’s right. (E.g. building off OpenAI’s work on AI-written critiques.) This should work in cases where humans can confidently adjudicate AI debates. But when AIs are very advanced and discussing new technologies (/coding languages) that humans are unfamiliar with, I expect adjudication to become very difficult (e.g. the problem of obfuscated arguments).
Cross examination. Suppose a coding AI claims it wrote code a certain way for reason X, but we suspect its true motive is to introduce a subtle software vulnerability. The critic-AI can potentially prove its claim using cross-examination:
If the coding AI is to avoid detection here, in the limit where we do very extensive cross examination, it cannot opportunistically change its decision procedure when it sees a chance for mischief. Instead, it must use a consistent decision procedure which it predicts will eventually introduce some vulnerabilities to the code that humans won’t spot. This is a much more challenging task!
I think cross examination has potential to be a hugely powerful tool, if we use it to its full potential.
Interpretability. We might be able to read off an AI’s true beliefs from their activations, thereby spotting lies. Or we might be able to test a putative explanation for an AI’s behaviour; we can then test whether an AI’s stated rationale is correct. It’s pretty unclear how powerful our interpretability techniques will be by the time we reach AGI.
Splitting tasks into small pieces; assigning different tasks to different AIs. This has a few benefits:
It’s unclear how costly it will be to do this, vs allowing AIs to do tasks with broad scopes.
The above techniques can be combined together. For example, I expect that when AI becomes sufficiently capable, debating AIs will do human-understandable interpretability analyses and cross-examinations to inform their arguments.
One worry with debate is that humans might end up being deceived and manipulated by AI debaters on both sides, especially once AI persuasion abilities are super-human. A partial solution, that this post focuses on, is to have AIs to research human-understandable manipulation techniques. We can then train humans and narrow detector-AIs to detect the techniques being used, and train AIs not to use them.
This can fit neatly into existing training frameworks. For example, the detector-AI could be one of the “rules” used in a Sparrow-style approach (h/t Rohin Shah). There’s overlap with Safety Needs Social Scientists, which argues social scientists can help understand and improve humans’ ability to judge debates. It’s also related to “strategy stealing”, where we prevent unaligned AIs from having strategies for accumulating power that humans (+ aligned AIs) cannot access. (There’s probably lots of other related work that I’m missing here - sorry!)
Here are some possibilities:
When AIs propose a particular manipulation technique, they should give sufficiently clear criteria that a human can identify whether the technique is present in a text sample (given sufficient time and the help of narrow AIs). This makes the technique “human understandable”.
In addition to finding manipulation/deception/symmetric persuasion techniques, AIs could also find asymmetric persuasion techniques using similar methods. (Reminder: asymmetric persuasion techniques are much better at persuading humans of truths than falsehoods; symmetric techniques are similarly good at persuading humans of both.) I discuss this in an appendix.
Interestingly, the optimization problem we face here has strong parallels to the one that imitative generalization tries to solve - see this appendix.
The last section described how we could generate lists of human-understandable manipulation techniques. What could we do with that list?
We can educate human overseers about the biggest threats, to make them less vulnerable to them.
We can design training programs and test to see whether they successfully reduce human vulnerability to particular techniques.
We could also train narrow detector-AIs to detect each manipulation technique. Then we can use detector-AIs to:
All of the above reduces AIs’ desire and ability to manipulate humans, while improving human defenses against manipulation. I listed examples of how this could help above.
Since the manipulation techniques are human-understandable, humans can randomly sample from the detector-AIs’ activity and check they are flagging all and only the manipulation features that are present.
I expect this approach to be significantly more useful if we’re regularly optimising SOTA AIs to try to persuade humans of both truths and falsehoods, and looking at what novel manipulation techniques they use. It allows us to discover the novel manipulation strategies that each new generation of AIs are capable of before they use them against us during deployment.
One possibility is to have a GAN-like dynamic where we:
Then in deployment we won’t use the AI that’s been optimised to persuade humans of falsehoods, but we will use the detector-AI that’s learnt the novel manipulation techniques. This keeps the detector-AI and the human a “few steps ahead” of the deployed persuader-AI, in terms of their understanding of manipulation.
What statements are AIs trying to persuade humans of in this GAN-like set-up? And who are they trying to persuade?
Some additional benefits of optimising SOTA AIs to manipulate humans:
This process is analogous to training AIs to find vulnerabilities in our software systems, and then patching those vulnerabilities. Instead of software vulnerabilities though, it’s human vulnerabilities.
AIs might be more likely to structure an argument in a certain way when they’re lying, e.g. making obfuscated arguments. We could learn about statistical regularities like “in context q, argument structure p is 3X more likely to be used if the AI is lying than if it’s truthful”. Detector-AIs can tell humans this info to improve their oversight.
How could we learn about this kind of thing?
This could potentially allow the detector-AI to say things like “in contexts similar to this one, this technique was 3X more likely to be used by a deceptive AI than by a truthful AI”. We could verify this by sampling a random sample of examples to check that i) the context is in fact relevantly similar, ii) the technique is used 3X more often by deceptive AIs.
If a misaligned AI knows we’re doing this type of analysis (which we should try to avoid), they will try to avoid using manipulation techniques that they anticipate will be flagged in this way. They may then be forced to use a less effective manipulation technique, or may choose not to attempt manipulation at all.
Human-understandable manipulation techniques could range from “appeal to emotion” to “it seems like you might be using the word ‘important’ to have subtly different meanings at two different points of your argument, so your conclusion might not follow”. So the difficulty of detecting these techniques ranges from basic pattern matching to “a very smart human with lots of practice can verify that the manipulation technique is there once it is pointed out to them.”
Detector-AIs that can detect the hardest-to-detect of these techniques may need to be very advanced. This increases the risk of misaligned collusion with other AIs. So it’s probably worth having some more limited detector-AIs and other more advanced ones.
So far I’ve discussed manipulation techniques. If an AI is using such a technique, that’s behavioral evidence it’s being deceptive.
But the GAN-like set-up in the previous section might also be a great opportunity to investigate human-understandable signs of deception based on the AI’s internals, or based on a combination of internals + behaviour. This is essentially interpretability research.
Detector-AIs could learn to recognise manipulation from a combination of behaviour and internals.
Similarly, manipulation detection might depend on how an AI behaves under cross examination.
Perhaps there are robust regularities in the behaviour and internals of deceptive vs honest AIs that humans can’t understand, but AIs can learn to detect. We could directly train “opaque” detector-AIs to distinguish deceptive vs honest AIs without needing to use human-understandable features.
This might help us catch manipulation attempts that would otherwise go unnoticed.
The (significant!) downside is that humans cannot check whether these detector-AIs are still behaving as intended in new contexts, as they can’t understand the reasoning behind their judgements.
There are ways AIs could help humans detect AI manipulation that I haven’t mentioned. Some examples:
I think this approach might have large benefits and be worth including in a portfolio of activities done using advanced AIs to reduce takeover risk. I’m especially excited about the GAN-like set-up where SOTA AIs are optimised to manipulate humans in contexts similar to deployment; I can imagine this significantly improving human oversight and/or making it clear that effective oversight is no longer possible.
On the other hand, the approach might have very little benefit because the biggest manipulation risks will be solved by default, or are too complicated or context-specific for this approach to address. Or perhaps getting useful insights out of this approach would take a lot of work in practice, such that it’s not worth it.
It’s hard to know how useful this approach will be in advance, and I’m excited about experiments to test it out.
It would be great if we could train “detector-AIs” to help humans detect attempts at deception/manipulation from other AIs (who I’ll call “advisor-AIs”).
We start with a pre-trained LM. We want to fine-tune it into a detector-AI that detects attempts to persuade humans of falsehoods. On a high-level, we’ll fine-tune it to detect falsehood-predicting features: human-understandable features that a chunk of text is more (or less) likely to have if it’s justifying a false conclusion than if it’s justifying a true conclusion.
How to train the detector-AI in more detail:
Why is this useful?
So AIs are less likely to be rewarded for subtle deceptions and manipulations.
The above supervised fine-tuning aims to learn about the persuasion techniques that are implicit in internet-based text. But there may be other techniques that AIs can one day discover.
We could stay ahead of the curve here by training advisor-AIs to persuade humans (equipped with detector-AIs) of truths and falsehoods, and seeing what new persuasion techniques they use. This will be somewhat like training a GAN, with the advisor-AI and detector-AI both improving over time.
In more detail:
The hope is that the detector-AI flags to the human whenever known symmetric persuasion techniques are being used, so the advisor-AI must find unknown symmetric persuasion techniques when trying to persuade the MTurker of falsehoods. Then the detector-AI later learns those new symmetric persuasion techniques. And at the same time, the advisor-AI uses asymmetric persuasion techniques when trying to persuade the MTurker of truths, because the detector-AI can then flag the existence of features correlated with truth. And so the detector-AI will also learn about asymmetric persuasion techniques. In this way, our list of falsehood-predicting features expands further.
The main text describes the following GAN-like set-up:
[Compared to the main text, this description emphasizes the fact that detector-AIs are learning about both symmetric and asymmetric persuasion techniques. The main text emphasized learning about the former (i.e. manipulation techniques) to simplify the exposition.]
What might the equilibrium look like here, re the strategies pursued by the persuader-AI and detector-AI?
I think that in the equilibrium:
Here’s my argument:
i) If the persuader-AI sometimes uses asymmetric persuasion techniques, the detector-AI will learn to recognise them - it's a great way to detect whether the persuader-AI is telling the truth!
ii) Once the detector-AI learns this, the persuader-AI will want to use asymmetric techniques wherever possible. It's a great way to convince the detector-AI that it’s telling the truth! Of course, it will only be able to use these techniques when it's telling the truth.
iii) So whenever the persuader-AI is telling the truth, they’ll have an incentive to find asymmetric persuasion techniques and the detector-AI will have an incentive to recognise them as such.
i*) And vice-versa, if the persuader-AI sometimes uses symmetric persuasion techniques, the the detector-AI will learn to recognise them - it's evidence the persuader-AI could be lying!
ii*) Once the detector-AI learns this, the persuader-AI will be incentivised to avoid symmetric techniques - they'll set off red flags for the detector-AI! When telling the truth, they'll use asymmetric techniques if they can. When lying, they'll say something that looks like an asymmetric technique.
iii*) So when the persuader-AI is lying, it’s race of its ability to come up with arguments that look like asymmetric techniques (but aren’t) against the detector-AI’s ability to detect true asymmetric persuasion techniques.
The persuader-AI will not try the following strategy: “I won’t use asymmetric techniques when telling the truth to make it easier to fool the detector-AI in future episodes when I’m lying”. It might use that strategy if it was maximising total reward over all future episodes, but in fact it’s trained to maximise reward for the current episode. If the persuader-AI is telling the truth, it won’t worry about the detector-AI knowing it’s lying in future episodes as it just wants to get reward in this episode. (Here I’m assuming we’re able to get the AI’s behaviour inner-aligned with the reward function and putting aside types of misalignment that cause the AI to consistently forgo reward.)
There’s a parallel between the proposal of this post – have AIs discover human-understandable manipulation techniques and use them to prevent humans being manipulated in new contexts – and imitative generalization. Here’s the parallel:
As noted in the post linked above, “there are enough serious open questions that imitative generalization is more of a research proposal than an algorithm”. So this parallel isn’t immediately actionable. But progress on imitative generalization would help with this line of research too — and perhaps vice versa.
We could do a statistical analysis to gain evidence on whether the AI is misaligned. We could check whether it’s more likely to use manipulation techniques in contexts where it’s discussing the AI safety community than in other analogous contexts. This is similar to statistical analyses of model bias. We could potentially have AI automate the process of designing and implementing these statistical tests.
If we can confidently adjudicate the debates with large effort, this should be sufficient. We can proceed as follows: randomly sample 100 accusations of manipulation, expend large effort to get ground-truth labels. Assign rewards to the debating AIs according to who was “in the right”. Because we sampled randomly, we will probably detect AI manipulation if it happens >1% of the time. AIs that manipulate more often than this will probably get caught, and will be disadvantaged during training.
There is still a risk that AIs lie <1% of the time in high-stakes situations. If it’s possible to cause large harm by tricking humans on <<1% of tasks (e.g. 1000 out of 1 million tasks), sampling 100 examples would probably fail to catch it.
How would this training work? Have AIs generate a large number of examples of each technique, with humans checking a random sample for accuracy. Then train narrow detector-AIs to detect each technique.
If there are many manipulation features, this may take a lot of work. Humans might only check the manipulation features that can’t be checked by a very narrow (and so trustworthy) AI.
The upper-end can’t get much higher than this while we require that features be human-understandable. Note, this allows us to detect manipulation techniques that humans can verify are present but couldn’t have discovered in the first place. If we do this, there’s a risk that AIs collude and the detector-AI doesn’t reveal a case of manipulation that humans could verify but not discover themselves.
I’m not confident that “having AIs discover manipulation techniques” is more susceptible to process-based optimisation than other research approaches, because it seems like process-based optimisation can be used for a lot of things. But it is still a benefit that you can use process-based optimisation here.
It seems to me like this wouldn’t pose much risk of AIs finding unexpected ways to game the outcome metric (e.g. subtly causing humans to be more susceptible to the techniques that they suggest).
E.g. you describe a disagreement between two communities and ask the AI which is “in the wrong”. You compare the replies when one community is the AI safety community vs other communities. If the (expected) sentiment is more negative when discussing the AI safety community, that’s evidence of bias.
Advisor-AIs might be explaining why they structured their code in a certain way, why they think a particular approach to alignment is promising, why they think a training procedure has low/high risk of producing misaligned AI, why p(doom) is low/high, why they proposed a particular new model architecture, etc. Or they might just be chatting to humans and helping humans brainstorm.
I’m using this term broadly to include any text that gives a reason to believe some statement. So it includes arguments, recommendations, explanations, analyses, etc.
We can define symmetric features as ones that occur equally or more frequently when the conclusion is false, and asymmetric features as ones that occur more frequently when the conclusion is true.
The advisor-AI will not try the following strategy: don’t asymmetric techniques when telling the truth to prevent detector-AI from knowing when I’m lying. It might use that strategy if it was maximising total reward over all future episodes, but in fact it’s trained to maximise reward for the current episode. If it’s telling the truth, it won’t worry about the detector-AI knowing it’s lying in future episodes as it just wants to get reward in this episode.
One disadvantage that you haven't listed is that if this works, and if there are in fact deceptive techniques that are very effective on humans that do not require being super-humanly intelligent to ever apply them, then this research project just gave humans access to them. Humans are unfortunately not all perfectly aligned with other humans, and I can think of a pretty long list of people who I would not want to have access to strong deceptive techniques that would pretty reliably work on me. Criminals, online trolls, comedians, autocrats, advertising executives, and politicians/political operatives of parties that I don't currently vote for all come to mind. In fact, I'm having a lot of trouble thinking of someone I would trust with these. Even if we had some trustworthy people to run this, how do they train/equip the less-trustworthy rest of us in resisting these techniques without also revealing what the techniques are? It's quite unclear whether training or assistance to resist these techniques would turn out to be be easier than learning to use them, and even if it were, until everyone has completed the resistance training/installed the assistance software/whatever, the techniques would still be dangerous knowledge.
So while I can sympathize with the basic goal here, if the unaligned AIs you were using to do this didn't talk their way out of confinement, and if it actually worked, then the resulting knowledge would be a distinctly two-edged sword even in human hands. And once it's in human hands, how would you then keep other AIs/LLMs from learning it from us? This project seems like it would have massive inforisks attached, even after it was over. The simile to biological 'gain of function' research for viruses seems rather apt, and that suggests that it may best avoided even if you were highly confidence in your containment security; and also that even if it were carried out, the best thing to do with the results might be to incinerate them after noting how easy or hard they were to achieve.
Also note that this is getting uncomfortably close to 'potentially pivotal act with massive unpredictable side-effects' territory — if a GAI even suggested this research to a team that I was on, I'd be voting to press the shut-down button.
Many behavioral-evolutionary biologists would suggest that humans may be quite heavily optimized both for deceiving other humans and for resisting being deceived by other humans. Once we developed a sufficiently complex language for this to be possible on a wide range of subjects, in addition to the obvious ecological-environmental pressures for humans to be smarter and do a better job as hunter gatherers, we were then also in an intelligence-and-deception arms race with other humans. The environmental pressures might have diminishing returns (say, once you're sufficiently smarter than all your predators and prey and the inherent complexity of your environment), but the arms race with other members of your own species never will: there is always an advantage to being smarter than your neighbors, so the pressure can keep ratcheting up indefinitely. What's unclear is how long we've had language complex enough that this evolutionary arms race has strongly applied to us.
If this were in fact the case, how useful this set of traits will be for resisting deception by things a lot smarter that us is unclear. But it does suggest that any really effective way of deceiving humans that we were spectacularly weak to probably requires superhuman abilities — we presumably would have evolved have at least non-trivial resistance to deception by near-human mentalities. It would also explain our possibly instinctual concern that something smarter than us might be trying to pull a fast-one on us.