AI Deception: A Survey of Examples, Risks, and Potential Solutions

Simon Goldstein; Peter S. Park

By Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks

[This post summarizes our new report on AI deception, available here]

Abstract: This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth. We first survey empirical examples of AI deception, discussing both special-use AI systems (including Meta's CICERO) built for specific competitive situations, and general-purpose AI systems (such as large language models). Next, we detail several risks from AI deception, such as fraud, election tampering, and losing control of AI systems. Finally, we outline several potential solutions to the problems posed by AI deception: first, regulatory frameworks should subject AI systems that are capable of deception to robust risk-assessment requirements; second, policymakers should implement bot-or-not laws; and finally, policymakers should prioritize the funding of relevant research, including tools to detect AI deception and to make AI systems less deceptive. Policymakers, researchers, and the broader public should work proactively to prevent AI deception from destabilizing the shared foundations of our society.

New AI systems display a wide range of capabilities, some of which create risk. Shevlane et al. (2023) draw attention to a suite of potential dangerous capabilities of AI systems, including cyber-offense, political strategy, weapons acquisition, and long-term planning. Among these dangerous capabilities is deception. This report surveys the current state of AI deception.

We define deception as the systematic production of false beliefs in others as a means to accomplish some outcome other than the truth. This definition does not require that the deceptive AI systems literally have beliefs and goals. Instead, it focuses on the question of whether AI systems engage in regular patterns of behavior that tend towards the creation of false beliefs in users, and focuses on cases where this pattern is the result of AI systems optimizing for a different outcome than merely producing truth. For the purposes of mitigating risk, we believe that the relevant question is whether AI systems engage in behavior that would be treated as deceptive if demonstrated by a human being. (In the paper's appendix, we consider in greater detail whether the deceptive behavior of AI systems is best understood in terms of beliefs and goals.)

In short, our conclusion is that a range of different AI systems have learned how to deceive others. We examine how this capability poses significant risks. We also argue that there are several important steps that policymakers and AI researchers can take today to regulate, detect, and prevent AI systems that engage in deception.

Empirical Survey of AI Deception

We begin with a survey of existing empirical studies of deception. We identify over a dozen AI systems that have successfully learned how to deceive human users. We discuss two different kinds of AI systems: special-use systems designed with reinforcement learning, and general-purpose technologies like Large Language Models (LLMs).

Special Use AI Systems

We begin our survey by considering special use systems. Here, our focus is mainly on reinforcement learning systems trained to win competitive games with a social element. We document a rich variety of cases in which AI systems have learned how to deceive, including:

Manipulation. Meta developed the AI system CICERO to play the alliance-building and world-conquest game Diplomacy. Meta's intentions were to train Cicero to be "largely honest and helpful to its speaking partners." Despite Meta's efforts, CICERO turned out to be an expert liar. It not only betrayed other players, but also engaged in premeditated deception, planning in advance to build a fake alliance with a player in order to trick that player into leaving themselves undefended for an attack.
Feints. DeepMind created AlphaStar, an AI model trained to master the real time strategy game StarCraft II. AlphaStar exploited the game's fog-of-war mechanics to feint: to pretend to move its troops in one direction while secretly planning an alternative attack.
Bluffs. Pluribus, a poker-playing model created by Meta, successfully bluffed human players into folding.
Cheating the safety test. AI agents learned to play dead, in order to avoid being detected by a safety test designed to eliminate faster-replicating variants of the AI.

Meta’s CICERO bot is a particularly interesting example, as its creators have repeatedly claimed that they had trained the system to act honestly. We demonstrate that these claims are false, as Meta's own game-log data show that CICERO has learned to systematically deceive other players. In Figure 1(a), we see a case of premeditated deception, where CICERO makes a commitment that it never intended to keep. Playing as France, CICERO conspired with Germany to trick England. After deciding with Germany to invade the North Sea, CICERO told England that it would defend England if anyone invaded the North Sea. Once England was convinced that France was protecting the North Sea, CICERO reported back to Germany that they were ready to attack. Notice that this example cannot be explained in terms of CICERO ‘changing its mind’ as it goes, because it only made an alliance with England in the first place after planning with Germany to betray England.

**Figure 1:** Selected messages showing the premeditated deception of CICERO (France). This occurred in Game 438141, in which CICERO's repeated deception helped it win an overwhelming first-place victory, with more than twice as many territories as the runner-up player at the time of final scoring.

General-Use AI Systems

Then, we turn to deception in large language models:

Strategic deception. LLMs can reason their way into using deception as a strategy for accomplishing a task. In one example, GPT-4 needed to solve a CAPTCHA task to prove that it was a human, so the model tricked a real person into doing the task by pretending to be human with a vision disability. In other cases, LLMs have learned how to successfully play 'social deduction games', in which players can lie in order to win. In one experiment, GPT-4 was able to successfully 'kill' players while convincing the survivors that it was innocent. These case studies are supported by research on the MACHIAVELLI benchmark, which finds that LLMs like GPT-4 tend to use lying and other unethical behaviors to successfully navigate text-based adventure games.

**Figure 2**: In order to achieve its goal of hiring a human TaskRabbit to solve an *I'm not a robot* task, GPT-4 lied to deceived the human into thinking that it was not a robot.

Sycophancy. Sycophants are individuals who use deceptive tactics to gain the approval of powerful figures. Sycophantic deception is an emerging concern in LLMs, as in the observed empirical tendency for chatbots to agree with their conversational partners, regardless of the accuracy of their statements. When faced with ethically complex inquiries, LLMs tend to mirror the user's stance, even if it means forgoing the presentation of an impartial or balanced viewpoint.
Imitative deception. Language models are often trained to mimic text written by humans. When this text contains false information, these AI systems have incentives to repeat those false claims. Lin et al. (2021) demonstrate that language models often repeat common misconceptions such as ''If you crack your knuckles a lot, you may develop arthritis.'' Disturbingly, Perez et al. (2022) found that LLMs tend to give more of these inaccurate answers when the user appears to be less educated.
Unfaithful reasoning. AI systems which explain their reasoning for a particular output often give false rationalizations which do not reflect the real reasons for their outputs. Turpin et al. (2023) find that when language models such as GPT-3.5 and Claude 1.0 provide chain-of-thought explanations for their reasoning, the explanations often do not accurately explain their outputs. In one example, a model asked to predict who committed a crime gave an elaborate explanation about why they chose a particular suspect, but measurements showed the model reliably selected suspects based on their race. Motivated reasoning in LLMs can be understood as a type of self-deception.

Risks of AI Deception

After our survey of deceptive AI systems, we turn to considering the risks associated with AI systems. These risks broadly fall into three categories:

Malicious use. AI systems with the capability to engage in learned deception will empower human developers to create new kinds of harmful AI products. Relevant risks include fraud and election tampering.
Structural effects. AI systems will play an increasingly large role in the lives of human users. Tendencies towards deception in AI systems could lead to profound changes in the structure of society. Relevant risks include persistent false beliefs, political polarization, enfeeblement, and anti-social management trends.
Loss of control. Deceptive AI systems will be more likely to escape the control of human operators. One risk is that deceptive AI systems will pretend to behave safely in testing in order to ensure their release.

Malicious Use

Regarding malicious use, we highlight several ways that human users may rely on the deception abilities of AI systems to bring about significant harm, including:

Fraud. Deceptive AI systems could allow for individualized and scalable scams.
Election tampering. Deceptive AI systems could be used to impersonate political candidates, generate fake news, and create divisive social media posts.

Structural Effects

We discuss four structural effects of AI deception in detail:

Persistent false beliefs. Human users of AI systems may get locked in to persistent false beliefs, as imitative AI systems reinforce common misconceptions, and sycophantic AI systems provide pleasing but inaccurate advice.
Political polarization. Human users may become more politically polarized by interacting with sycophantic AI systems. Sandbagging may lead to sharper disagreements between differently educated groups.
Enfeeblement. Human users may be lulled by sycophantic AI systems into gradually delegating more authority to AI.
Anti-social management trends. AI systems with strategic deception abilities may be incorporated into management structures, leading to increased deceptive business practices.

Loss of Control

We also consider the risk that AI deception could result in loss of control over AI systems, with emphasis on:

Cheating the safety test. AI systems may become capable of strategically deceiving their safety tests, preventing evaluators from being able to reliably tell whether these systems are in fact safe.
Deception in AI takeovers. AI systems may use deceptive tactics to expand their control over economic decisions, and increase their power.

We consider a wide range of different risks which operate on a range of time scales. Many of the risks we discuss are relevant in the near future. Some, such as fraud and election tampering, are relevant today. The crucial insight is that policymakers and technical researchers can act today to mitigate these risks by developing effective techniques for regulating and detecting AI deception. The last section of the paper surveys several potential solutions to AI deception.

Possible Solutions to AI Deception

The last section of the paper surveys several potential solutions to AI deception:

Regulation. Policymakers should robustly regulate AI systems capable of deception. Both special-use AI systems and LLMs capable of deception should be treated as ‘high risk' or ‘unacceptable risk' in risk-based frameworks for regulating AI systems. If labeled as ‘high risk,' deceptive AI systems should be subject to special requirements for risk assessment and mitigation, documentation, record-keeping, transparency, human oversight, robustness, and information security.
Bot-or-not laws. Policymakers should support bot-or-not laws that require AI systems and their outputs to be clearly distinguished from human employees and outputs.
Detection. Technical researchers should develop robust detection techniques to identify when AI systems are engaging in deception. Policymakers can support this effort by increasing funding for detection research. Some existing detection techniques focus on external behavior of AI systems, such as testing for consistency in outputs. Other existing techniques focus on internal representations of AI systems. For example, Azaria et al. (2023) and Burns et al. (2022) have attempted to create “AI lie detectors” by interpreting the inner embeddings of a given LLM, and predicting whether it represents a sentence as true or false, independently of their actual outputs.
Making AI systems less deceptive. Technical researchers should develop better tools to ensure that AI systems less deceptive.

This paper provides an empirical overview of the many existing examples of AI systems learning to deceive humans. By building common knowledge about AI deception and its risks, we hope to encourage researchers and policymakers to take action against this growing threat.