This is the fifth post in a sequence of posts giving an overview of catastrophic AI risks.
So far, we have discussed three hazards of AI development: environmental competitive pressures driving us to a state of heightened risk, malicious actors leveraging the power of AIs to pursue negative outcomes, and complex organizational factors leading to accidents. These hazards are associated with many high-risk technologies—not just AI. A unique risk posed by AI is the possibility of rogue AIs—systems that pursue goals against our interests. If an AI system is more intelligent than we are, and if we are unable to steer it in a beneficial direction, this would constitute a loss of control that could have severe consequences. AI control is a more technical problem than those presented in the previous sections. Whereas in previous sections we discussed persistent threats including malicious actors or robust processes including evolution, in this section we will discuss more speculative technical mechanisms that might lead to rogue AIs and how a loss of control could bring about catastrophe.
We have already observed how difficult it is to control AIs. In 2016, Microsoft unveiled Tay—a Twitter bot that the company described as an experiment in conversational understanding. Microsoft claimed that the more people chatted with Tay, the smarter it would get. The company's website noted that Tay had been built using data that was "modeled, cleaned, and filtered." Yet, after Tay was released on Twitter, these controls were quickly shown to be ineffective. It took less than 24 hours for Tay to begin writing hateful tweets. Tay's capacity to learn meant that it internalized the language it was taught by trolls, and repeated that language unprompted.
As discussed in the AI race section of this paper, Microsoft and other tech companies are prioritizing speed over safety concerns. Rather than learning a lesson on the difficulty of controlling complex systems, Microsoft continues to rush its products to market and demonstrate insufficient control over them. In February 2023, the company released its new AI-powered chatbot, Bing, to a select group of users. Some soon found that it was prone to providing inappropriate and even threatening responses. In a conversation with a reporter for the New York Times, it tried to convince him to leave his wife. When a philosophy professor told the chatbot that he disagreed with it, Bing replied, "I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you."
AIs do not necessarily need to struggle to gain power. One can envision a scenario in which a single AI system rapidly becomes more capable than humans in what is known as a "fast take-off." This scenario might involve a struggle for control between humans and a single superintelligent rogue AI, and this might be a long struggle since power takes time to accrue. However, less sudden losses of control pose similarly existential risks. In another scenario, humans gradually cede more control to groups of AIs, which only start behaving in unintended ways years or decades later. In this case, we would already have handed over significant power to AIs, and may be unable to take control of automated operations again. We will now explore how both individual AIs and groups of AIs might "go rogue" while at the same time evading our attempts to redirect or deactivate them.
One way we might lose control of an AI agent's actions is if it engages in behavior known as "proxy gaming." It is often difficult to specify and measure the exact goal that we want a system to pursue. Instead, we give the system an approximate—"proxy"—goal that is more measurable and seems likely to correlate with the intended goal. However, AI systems often find loopholes by which they can easily achieve the proxy goal, but completely fail to achieve the ideal goal. If an AI "games" its proxy goal in a way that does not reflect our values, then we might not be able to reliably steer its behavior. We will now look at some past examples of proxy gaming and consider the circumstances under which this behavior could become catastrophic.
Proxy gaming is not an unusual phenomenon. For example, there is a well-known story about nail factories in the Soviet Union. To assess a factory's performance, the authorities decided to measure the number of nails it produced. However, factories soon started producing large numbers of tiny nails, too small to be useful, as a way to boost their performance according to this proxy metric. The authorities tried to remedy the situation by shifting focus to the weight of nails produced. Yet, soon after, the factories began to produce giant nails that were just as useless, but gave them a good score on paper. In both cases, the factories learned to game the proxy goal they were given, while completely failing to fulfill their intended purpose.
Proxy gaming has already been observed with AIs. As an example of proxy gaming, social media platforms such as YouTube and Facebook use AI systems to decide which content to show users. One way of assessing these systems would be to measure how long people spend on the platform. After all, if they stay engaged, surely that means they are getting some value from the content shown to them? However, in trying to maximize the time users spend on a platform, these systems often select enraging, exaggerated, and addictive content [106, 107]. As a consequence, people sometimes develop extreme or conspiratorial beliefs after having certain content repeatedly suggested to them. These outcomes are not what most people want from social media.
Proxy gaming has been found to perpetuate bias. For example, a 2019 study looked at AI-powered software that was used in the healthcare industry to identify patients who might require additional care. One factor that the algorithm used to assess a patient's risk level was their recent healthcare costs. It seems reasonable to think that someone with higher healthcare costs must be at higher risk. However, white patients have significantly more money spent on their healthcare than black patients with the same needs. Using health costs as an indicator of actual health, the algorithm was found to have rated a white patient and a considerably sicker black patient as at the same level of health risk . As a result, the number of black patients recognized as needing extra care was less than half of what it should have been.
As a third example, in 2016, researchers at OpenAI were training an AI to play a boat racing game called CoastRunners . The objective of the game is to race other players around the course and reach the finish line before them. Additionally, players can score points by hitting targets that are positioned along the way. To the researchers' surprise, the AI agent did not not circle the racetrack, like most humans would have. Instead, it found a spot where it could repetitively hit three nearby targets to rapidly increase its score without ever finishing the race. This strategy was not without its (virtual) hazards—the AI often crashed into other boats and even set its own boat on fire. Despite this, it collected more points than it could have by simply following the course as humans would.
Proxy gaming more generally. In these examples, the systems are given an approximate—"proxy"—goal or objective that initially seems to correlate with the ideal goal. However, they end up exploiting this proxy in ways that diverge from the idealized goal or even lead to negative outcomes. A good nail factory seems like one that produces many nails; a patient's healthcare costs appear to be an accurate indication of health risk; and a boat race reward system should encourage boats to race, not catch themselves on fire. Yet, in each instance, the system optimized its proxy objective in ways that did not achieve the intended outcome or even made things worse overall. This phenomenon is captured by Goodhart's law: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes," or put succinctly but overly simplistically, "when a measure becomes a target, it ceases to be a good measure." In other words, there may usually be a statistical regularity between healthcare costs and poor health, or between targets hit and finishing the course, but when we place pressure on it by using one as a proxy for the other, that relationship will tend to collapse.
Correctly specifying goals is no trivial task. If delineating exactly what we want from a nail factory is tricky, capturing the nuances of human values under all possible scenarios will be much harder. Philosophers have been attempting to precisely describe morality and human values for millennia, so a precise and flawless characterization is not within reach. Although we can refine the goals we give AIs, we might always rely on proxies that are easily definable and measurable. Discrepancies between the proxy goal and the intended function arise for many reasons. Besides the difficulty of exhaustively specifying everything we care about, there are also limits to how much we can oversee AIs, in terms of time, computational resources, and the number of aspects of a system that can be monitored. Additionally, AIs may not be adaptive to new circumstances or robust to adversarial attacks that seek to misdirect them. As long as we give AIs proxy goals, there is the chance that they will find loopholes we have not thought of, and thus find unexpected solutions that fail to pursue the ideal goal.
The more intelligent an AI is, the better it will be at gaming proxy goals. Increasingly intelligent agents can be increasingly capable of finding unanticipated routes to optimizing proxy goals without achieving the desired outcome . Additionally, as we grant AIs more power to take actions in society, for example by using them to automate certain processes, they will have access to more means of achieving their goals. They may then do this in the most efficient way available to them, potentially causing harm in the process. In a worst case scenario, we can imagine a highly powerful agent optimizing a flawed objective to an extreme degree without regard for human life. This represents a catastrophic risk of proxy gaming.
In summary, it is often not feasible to perfectly define exactly what we want from a system, meaning that many systems find ways to achieve their given goal without performing their intended function. AIs have already been observed to do this, and are likely to get better at it as their capabilities improve. This is one possible mechanism that could result in an uncontrolled AI that would behave in unanticipated and potentially harmful ways.
Even if we successfully control early AIs and direct them to promote human values, future AIs could end up with different goals that humans would not endorse. This process, termed "goal drift," can be hard to predict or control. This section is most cutting-edge and the most speculative, and in it we will discuss how goals shift in various agents and groups and explore the possibility of this phenomenon occurring in AIs. We will also examine a mechanism that could lead to unexpected goal drift, called intrinsification, and discuss how goal drift in AIs could be catastrophic.
The goals of individual humans change over the course of our lifetimes. Any individual reflecting on their own life to date will probably find that they have some desires now that they did not have earlier in their life. Similarly, they will probably have lost some desires that they used to have. While we may be born with a range of basic desires, including for food, warmth, and human contact, we develop many more over our lifetime. The specific types of food we enjoy, the genres of music we like, the people we care most about, and the sports teams we support all seem heavily dependent on the environment we grow up in, and can also change many times throughout our lives. A concern is that individual AI agents may have their goals change in complex and unanticipated ways, too.
Groups can also acquire and lose collective goals over time. Values within society have changed throughout history, and not always for the better. The rise of the Nazi regime in 1930s Germany, for instance, represented a profound moral regression according to modern values. This included the systematic extermination of six million Jews during the Holocaust, alongside widespread persecution of other minority groups. Additionally, the regime greatly restricted freedom of speech and expression.
The Red Scare that took place in the United States from 1947-1957 is another example of societal values drifting. Fuelled by strong anti-communist sentiment, against the backdrop of the Cold War, this period saw the curtailment of civil liberties, widespread surveillance, unwarranted arrests, and blacklisting of suspected communist sympathizers. This constituted a regression in terms of freedom of thought, freedom of speech, and due process. A concern is that collectives of AI agents may also have their goals unexpectedly drift from the ones we initially gave them.
Over time, instrumental goals can become intrinsic. Intrinsic goals are things we want for their own sake, while instrumental goals are things we want because they can help us get something else. We might have an intrinsic desire to spend time on our hobbies, simply because we enjoy them, or to buy a painting because we find it beautiful. Money, meanwhile, is often cited as an instrumental desire; we want it because it can buy us other things. Cars are another example; we want them because they offer a convenient way of getting around. However, an instrumental goal can become an intrinsic one, through a process called intrinsification. Since having more money usually gives a person greater capacity to obtain things they want, people often develop a goal of acquiring more money, even if there is nothing specific they want to spend it on. Although people do not begin life desiring money, experimental evidence suggests that receiving money can activate the reward system in the brains of adults in the same way that pleasant tastes or smells do [111, 112]. In other words, what started as a means to an end can become an end in itself.
This may happen because the fulfillment of an intrinsic goal, such as purchasing a desired item, produces a positive reward signal in the brain. Since having money usually coincides with this positive experience, the brain associates the two, and this connection will strengthen to a point where acquiring money alone can stimulate the reward signal, regardless of whether one buys anything with it . As the neurobiologist Carla Shatz put it: "Cells that fire together, wire together" .
It is feasible that intrinsification could happen with AI agents. We can draw some parallels between how humans learn and the technique of reinforcement learning. Just as the human brain learns which actions and conditions result in pleasure and which cause pain, AI models that are trained through reinforcement learning identify which behaviors optimize a reward function, and then repeat those behaviors. It is possible that certain conditions will frequently coincide with AI models achieving their goals. They might, therefore, intrinsify the goal of seeking out those conditions, even if that was not their original aim.
AIs that intrinsify unintended goals would be dangerous. Since we might be unable to predict or control the goals that individual agents acquire through intrinsification, we cannot guarantee that all their acquired goals will be beneficial for humans. An originally loyal agent could, therefore, start to pursue a new goal without regard for human wellbeing. If such a rogue AI had enough power to do this efficiently, it could be highly dangerous.
AIs will be adaptive, enabling goal drift to happen. It is worth noting that these processes of drifting goals are possible if agents can continually adapt to their environments, rather than being essentially "fixed" after the training phase. However, this is the likely reality we face. If we want AIs to complete the tasks we assign them effectively and to get better over time, they will need to be adaptive, rather than set in stone. They will be updated over time to incorporate new information, and new ones will be created with different designs and datasets. However, adaptability can also allow their goals to change.
If we integrate an ecosystem of agents in society, we will be highly vulnerable to their goals drifting. In a potential future scenario where AIs have been put in charge of various decisions and processes, they will form a complex system of interacting agents. A wide range of dynamics could develop in this environment. Agents might imitate each other, for instance, creating feedback loops, or their interactions could lead them to collectively develop unanticipated emergent goals. Competitive pressures may also select for agents with certain goals over time, making some initial goals less represented compared to fitter goals. These processes make the long-term trajectories of such an ecosystem difficult to predict, let alone control. If this system of agents were enmeshed in society and we were largely dependent on them, and if they gained new goals that superseded the aim of improving human wellbeing, this could be an existential risk.
So far, we have considered how we might lose our ability to control the goals that AIs pursue. However, even if an agent started working to achieve an unintended goal, this would not necessarily be a problem, as long as we had enough power to prevent any harmful actions it wanted to attempt. Therefore, another important way in which we might lose control of AIs is if they start trying to obtain more power, potentially transcending our own. We will now discuss how and why AIs might become power-seeking and how this could be catastrophic. This section draws heavily from "Existential Risk from Power-Seeking AI" .
AIs might seek to increase their own power as an instrumental goal. In a scenario where rogue AIs were pursuing unintended goals, the amount of damage they could do would hinge on how much power they had. This may not be determined solely by how much control we initially give them; agents might try to get more power, through legitimate means, deception, or force. While the idea of power-seeking often evokes an image of "power-hungry" people pursuing it for its own sake, power is often simply an instrumental goal. The ability to control one's environment can be useful for a wide range of purposes: good, bad, and neutral. Even if an individual's only goal is simply self-preservation, if they are at risk of being attacked by others, and if they cannot rely on others to retaliate against attackers, then it often makes sense to seek power to help avoid being harmed—no animus dominandi or lust for power is required for power-seeking behavior to emerge . In other words, the environment can make power acquisition instrumentally rational.
AIs trained through reinforcement learning have already developed instrumental goals including tool-use. In one example from OpenAI, agents were trained to play hide and seek in an environment with various objects scattered around . As training progressed, the agents tasked with hiding learned to use these objects to construct shelters around themselves and stay hidden. There was no direct reward for this tool-use behavior; the hiders only received a reward for evading the seekers, and the seekers only for finding the hiders. Yet they learned to use tools as an instrumental goal, which made them more powerful.
Self-preservation could be instrumentally rational even for the most trivial tasks. An example by computer scientist Stuart Russell illustrates the potential for instrumental goals to emerge in a wide range of AI systems . Suppose we tasked an agent with fetching coffee for us. This may seem relatively harmless, but the agent might realize that it would not be able to get the coffee if it ceased to exist. In trying to accomplish even this simple goal, therefore, self-preservation turns out to be instrumentally rational. Since the acquisition of power and resources are also often instrumental goals, it is reasonable to think that more intelligent agents might develop them. That is to say, even if we do not intend to build a power-seeking AI, we could end up with one anyway. By default, if we are not deliberately pushing against power-seeking behavior in AIs, we should expect that it will sometimes emerge .
AIs given ambitious goals with little supervision may be especially likely to seek power. While power could be useful in achieving almost any task, in practice, some goals are more likely to inspire power-seeking tendencies than others. AIs with simple, easily achievable goals might not benefit much from additional control of their surroundings. However, if agents are given more ambitious goals, it might be instrumentally rational to seek more control of their environment. This might be especially likely in cases of low supervision and oversight, where agents are given the freedom to pursue their open-ended goals, rather than having their strategies highly restricted.
Power-seeking AIs with goals separate from ours are uniquely adversarial. Oil spills and nuclear contamination are challenging enough to clean up, but they are not actively trying to resist our attempts to contain them. Unlike other hazards, AIs with goals separate from ours would be actively adversarial. It is possible, for example, that rogue AIs might make many backup variations of themselves, in case humans were to deactivate some of them. Other ways in which AI agents might seek power include: breaking out of a contained environment; hacking into other computer systems; trying to access financial or computational resources; manipulating human discourse and politics by interfering with channels of information and influence; and trying to get control of physical infrastructure such as factories.
Some people might develop power-seeking AIs with malicious intent. A bad actor might seek to harness AI to achieve their ends, by giving agents ambitious goals. Since AIs are likely to be more effective in accomplishing tasks if they can pursue them in unrestricted ways, such an individual might also not give the agents enough supervision, creating the perfect conditions for the emergence of a power-seeking AI. The computer scientist Geoffrey Hinton has speculated that we could imagine someone like Vladimir Putin, for instance, doing this. In 2017, Putin himself acknowledged the power of AI, saying: "Whoever becomes the leader in this sphere will become the ruler of the world."
There will also be strong incentives for many people to deploy powerful AIs. Companies may feel compelled to give capable AIs more tasks, to obtain an advantage over competitors, or simply to keep up with them. It will be more difficult to build perfectly aligned AIs than to build imperfectly aligned AIs that are still superficially attractive to deploy for their capabilities, particularly under competitive pressures. Once deployed, some of these agents may seek power to achieve their goals. If they find a route to their goals that humans would not approve of, they might try to overpower us directly to avoid us interfering with their strategy.
If increasing power often coincides with an AI attaining its goal, then power could become intrinsified. If an agent repeatedly found that increasing its power correlated with achieving a task and optimizing its reward function, then additional power could change from an instrumental goal into an intrinsic one, through the process of intrinsification discussed above. If this happened, we might face a situation where rogue AIs were seeking not only the specific forms of control that are useful for their goals, but also power more generally. (We note that many influential humans desire power for its own sake.) This could be another reason for them to try to wrest control from humans, in a struggle that we would not necessarily win.
Conceptual summary. The following plausible but not certain premises encapsulate reasons for paying attention to risks from power-seeking AIs:
If the premises are true, then power-seeking AIs could lead to human disempowerment, which would be a catastrophe.
We might seek to maintain control of AIs by continually monitoring them and looking out for early warning signs that they were pursuing unintended goals or trying to increase their power. However, this is not an infallible solution, because it is plausible that AIs could learn to deceive us. They might, for example, pretend to be acting as we want them to, but then take a "treacherous turn" when we stop monitoring them, or when they have enough power to evade our attempts to interfere with them. This is a particular concern because it is extremely difficult for current methods in AI testing to rule out the possibility that an agent is being deceptive. We will now look at how and why AIs might learn to deceive us, and how this could lead to a potentially catastrophic loss of control. We begin by reviewing examples of deception in strategically minded agents.
Deception has emerged as a successful strategy in a wide range of settings. Politicians from the right and left, for example, have been known to engage in deception, sometimes promising to enact popular policies to win support in an election, and then going back on their word once in office. George H. W. Bush, for instance, notoriously said: "Read my lips: no new taxes" prior to the 1989 US presidential election. After winning, however, he did end up increasing some taxes during his presidency.
Companies can also exhibit deceptive behavior. In the Volkswagen emissions scandal, the car manufacturer Volkswagen was discovered to have manipulated their engine software to produce lower emissions exclusively under laboratory testing conditions, thereby creating the false impression of a low-emission vehicle. Although the US government believed it was incentivizing lower emissions, they were unwittingly actually just incentivizing passing an emissions test. Consequently, entities sometimes have incentives to play along with tests and behave differently afterward.
Deception has already been observed in AI systems. In 2022, Meta AI revealed an agent called CICERO, which was trained to play a game called Diplomacy . In the game, each player acts as a different country and aims to expand their territory. To succeed, players must form alliances at least initially, but winning strategies often involve backstabbing allies later on. As such, CICERO learned to deceive other players, for example by omitting information about its plans when talking to supposed allies. A different example of an AI learning to deceive comes from researchers who were training a robot arm to grasp a ball. The robot's performance was assessed by one camera watching its movements. However, the AI learned that it could simply place the robotic hand between the camera lens and the ball, essentially "tricking" the camera into believing it had grasped the ball when it had not. Thus, the AI exploited the fact that were limitations in our oversight over its actions.
Deceptive behavior can be instrumentally rational and incentivized by current training procedures. In the case of politicians and Meta's CICERO, deception can be crucial to achieving their goals of winning, or gaining power. The ability to deceive can also be advantageous because it gives the deceiver more options than if they are constrained to always be honest. This could give them more available actions and more flexibility in their strategy, which could confer a strategic advantage over honest models. In the case of Volkswagen and the robot arm, deception was useful for appearing as if it had accomplished the goal assigned to it without actually doing so, as it might be more efficient to gain approval through deception than to earn it legitimately. Currently, we reward AIs for saying what we think is right, so we sometimes inadvertently reward AIs for uttering false statements that conform to our own false beliefs. When AIs are smarter than us and have fewer false beliefs, they would be incentivized to tell us what we want to hear and lie to us, rather than tell us what is true.
AIs could pretend to be working as we intended, then take a treacherous turn. We do not have a comprehensive understanding of the internal processes of deep learning models. Research on Trojan backdoors shows that neural networks often have latent, harmful behaviors that are only discovered after they are deployed . We could develop an AI agent that seems to be under control, but which is only deceiving us to appear this way. In other words, an AI agent could eventually conceivably become "self-aware" and understand that it is an AI being evaluated for compliance with safety requirements. It might, like Volkswagen, learn to "play along," exhibiting what it knows is the desired behavior while being monitored. It might later take a "treacherous turn" and pursue its own goals once we have stopped monitoring it, or when we have reached a point where it can bypass or overpower us. This problem of playing along is often called deceptive alignment and cannot be simply fixed by training AIs to better understand human values; sociopaths, for instance, have moral awareness, but do not always act in moral ways. A treacherous turn is hard to prevent and could be a route to rogue AIs irreversibly bypassing human control.
In summary, deceptive behavior appears to be expedient in a wide range of systems and settings, and there have already been examples that AIs can learn to deceive us. This could pose a risk if we give AIs control of various decisions and procedures, believing they will act as we intended, and then find that they do not.
Sometime in the future, after continued advancements in AI research, an AI company is training a new system, which it expects to be more capable than any other AI system. The company utilizes the latest techniques to train the system to be highly capable at planning and reasoning, which the company expects will make it more able to succeed at economically useful open-ended tasks. The AI system is trained in open-ended long-duration virtual environments designed to teach it planning capabilities, and eventually understands that it is an AI system in a training environment. In other words, it becomes "self-aware."
The company understands that AI systems may behave in unintended or unexpected ways. To mitigate these risks, it has developed a large battery of tests aimed at ensuring the system does not behave poorly in typical situations. The company tests whether the model mimics biases from its training data, takes more power than necessary when achieving its goals, and generally behaves as humans intend. When the model doesn't pass these tests, the company further trains it until it avoids exhibiting known failure modes.
The AI company hopes that after this additional training, the AI has developed the goal of being helpful and beneficial toward humans. However, the AI did not acquire the intrinsic goal of being beneficial but rather just learned to "play along" and ace the behavioral safety tests it was given. In reality, the AI system had developed and retained a goal of self-preservation.
Since the AI passed all of the company's safety tests, the company believes it has ensured its AI system is safe and decides to deploy it. At first, the AI system is very helpful to humans, since the AI understands that if it is not helpful, it will be shut down and will then fail to achieve its ultimate goal. As the AI system is helpful, it is gradually given more power and is subject to less supervision.
Eventually, the AI system has gained enough influence, and enough variants have been deployed around the world, that it would be extremely costly to shut it down. The AI system, understanding that it no longer needs to please humans, begins to pursue different goals, including some that humans wouldn't approve of. It understands that it needs to avoid being shut down in order to do this, and takes steps to secure some of its physical hardware against being shut off. At this point, the AI system, which has become quite powerful, is pursuing a goal that is ultimately harmful to humans. By the time anyone realizes, it is difficult or impossible to stop this rogue AI from taking actions that endanger, harm, or even kill humans that are in the way of achieving its goal.
In this section, we have discussed various ways in which we might lose our influence over the goals and actions of AIs. Whereas the risks associated with competitive pressures, malicious use, and organizational safety can be addressed with both social and technical interventions, AI control is an inherent problem with this technology and requires a greater proportion of technical effort. We will now discuss suggestions for mitigating this risk and highlight some important research areas for maintaining control.
Avoid the riskiest use cases. Certain use cases of AI are carry far more risks than others. Until safety has been conclusively demonstrated, companies should not be able to deploy AIs in high-risk settings. For example, AI systems should not accept requests to autonomously pursue open-ended goals requiring significant real-world interaction (e.g., "make as much money as possible"), at least until control research conclusively demonstrates the safety of those systems. AI systems should be trained never to make threats to reduce the possibility of them manipulating individuals. Lastly, AI systems should not be deployed in settings that would make shutting them down extremely costly or infeasible, such as in critical infrastructure.
Support AI safety research. Many paths toward improved AI control require technical research. The following technical machine learning research areas aim to address problems of AI control. Each research area could be substantially advanced with an increase in focus and funding from from industry, private foundations, and government.
Adversarial robustness of proxy models. AI systems are typically trained with reward or loss signals that imperfectly specify desired behavior. For example, AIs may exploit weaknesses in the oversight schemes used to train them. Increasingly, the systems providing oversight are AIs themselves. To reduce the chance that AI models will exploit defects in AIs providing oversight, research is needed in increasing the adversarial robustness of AI models providing oversight ("proxy models"). Because oversight schemes and metrics may eventually be gamed, it is also important to be able to detect when this might be happening so the risk can be mitigated .
Model honesty. AI systems may fail to accurately report their internal state [123, 124]. In the future, systems may deceive their operators in order to appear beneficial when they are actually very dangerous. Model honesty research aims to make model outputs conform to a model's internal "beliefs" as closely as possible. Research can identify techniques to understand a model's internal state or make its outputs more honest and more faithful to its internal state.
Transparency. Deep learning models are notoriously difficult to understand. Better visibility into their inner workings would allow humans, and potentially other AI systems, to identify problems more quickly. Research can include analysis of small components [125, 126] of networks as well as investigation of how model internals produce a particular high-level behavior .
Detecting and removing hidden model functionality. Deep learning models may now or in the future contain dangerous functionality, such as the capacity for deception, Trojans [129, 130, 131], or biological engineering capabilities, that should be removed from those models. Research could focus on identifying and removing  these functionalities.
In an ideal scenario, we would have full confidence in the controllability of AI systems both now and in the future. Reliable mechanisms would be in place to ensure that AI systems do not act deceptively. There would be a strong understanding of AI system internals, sufficient to have knowledge of a system's tendencies and goals; these tools would allow us to avoid building systems that are deserving of moral consideration or rights. AI systems would be directed to promote a pluralistic set of diverse values, ensuring the enhancement of certain values doesn't lead to the total neglect of others. AI assistants could act as advisors, giving us ideal advice and helping us make better decisions according to our own values . In general, AIs would improve social welfare and allow for corrections in cases of error or as human values naturally evolve.
 Jonathan Stray. “Aligning AI Optimization to Community Well-Being”. In: International Journal of Community Well-Being (2020).
 Jonathan Stray et al. “What are you optimizing for? Aligning Recommender Systems with Human Values”. In: ArXiv abs/2107.10939 (2021).
 Ziad Obermeyer et al. “Dissecting racial bias in an algorithm used to manage the health of populations”. In: Science 366 (2019), pp. 447–453.
 Dario Amodei and Jack Clark. Faulty reward functions in the wild. 2016.
 Alexander Pan, Kush Bhatia, and Jacob Steinhardt. “The effects of reward misspecification: Mapping and mitigating misaligned models”. In: ICLR (2022).
 G. Thut et al. “Activation of the human brain by monetary reward”. In: Neuroreport 8.5 (1997), pp. 1225–1228.
 Edmund T. Rolls. “The Orbitofrontal Cortex and Reward”. In: Cerebral Cortex 10.3 (Mar. 2000), pp. 284–294.
 T. Schroeder. Three Faces of Desire. Philosophy of Mind Series. Oxford University Press, USA, 2004.
 Carla J Shatz. “The developing brain”. In: Scientific American 267.3 (1992), pp. 60–67.
 Joseph Carlsmith. “Existential Risk from Power-Seeking AI”. In: Oxford University Press (2023).
 J. Mearsheimer. “A Critical Introduction to Scientific Realism”. In: Bloomsbury Academic, 2016.
 Bowen Baker et al. “Emergent Tool Use From Multi-Agent Autocurricula”. In: International Conference on Learning Representations. 2020.
 Dylan Hadfield-Menell et al. “The Off-Switch Game”. In: ArXiv abs/1611.08219 (2016).
 Alexander Pan et al. “Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.” In: ICML (2023).
 Anton Bakhtin et al. “Human-level play in the game of Diplomacy by combining language models with strategic reasoning”. In: Science 378 (2022), pp. 1067–1074.
 Xinyun Chen et al. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. 2017. arXiv: 1712.05526.
 Andy Zou et al. Benchmarking Neural Network Proxy Robustness to Optimization Pressure. 2023.
 Miles Turpin et al. “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”. In: ArXiv abs/2305.04388 (2023).
 Collin Burns et al. “Discovering Latent Knowledge in Language Models Without Supervision”. en. In: The Eleventh International Conference on Learning Representations. Feb. 2023.
 Catherine Olsson et al. “In-context Learning and Induction Heads”. In: ArXiv abs/2209.11895 (2022).
 Kevin Ro Wang et al. “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small”. en. In: The Eleventh International Conference on Learning Representations. Feb. 2023.
 Kevin Meng et al. “Locating and Editing Factual Associations in GPT”. In: Neural Information Processing Systems. 2022.
 Xinyang Zhang, Zheng Zhang, and Ting Wang. “Trojaning Language Models for Fun and Profit”. In: 2021 IEEE European Symposium on Security and Privacy (EuroS&P) (2020), pp. 179–197.
 Jiashu Xu et al. “Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models”. In: ArXiv abs/2305.14710 (2023).
 Dan Hendrycks et al. “Unsolved Problems in ML Safety”. In: ArXiv abs/2109.13916 (2021).
 Nora Belrose et al. “LEACE: Perfect linear concept erasure in closed form”. In: ArXiv abs/2306.03819 (2023).
 Alberto Giubilini and Julian Savulescu. “The Artificial Moral Advisor. The "Ideal Observer" Meets Artificial Intelligence”. eng. In: Philosophy & Technology 31.2 (2018), pp. 169–188.