Drug addicts and deceptively aligned agents - a comparative analysis

AI ALIGNMENT FORUM
AF

Drug addicts and deceptively aligned agents - a comparative analysis — AI Alignment Forum

Co-authored by Nadia Mir-Montazeri and Jan Kirchner.

Addicts and AIs

A young man, let’s call him Dave^[1], starts consuming different kinds of illegal drugs (mostly heroin) as early as age 14, as a reaction to the divorce of his parents. Even though he is from an otherwise stable family, he becomes homeless after fights about his drug consumption. Repeatedly, Dave returns to his family's home, crying and asking to sleep on the couch for a few nights. Repeatedly, the family takes him in, believing that this time he finally has a change of heart. And time and time again, Dave will disappear after a few days, taking with him all money or any other valuable thing that is not nailed down.

As a doctor on the psych ward, I (Nadia) have encountered situations like this over and over. The first few times as a health professional, you might get fooled by a patient with addiction. But soon, you learn to be skeptical and stop paying attention to what they say they did and how they appeal to your empathy. Instead, you start to pay close attention to what you can observe people doing (and what others tell you they did). Luckily, most doctors are smarter than addicts, especially after the addict’s cognitive ability starts to decline due to substance abuse. Still, treating addicts requires constant vigilance and attention to details and potential contradictions. This is a sad state of affairs, but the simple fact that “addicts lie”^[2] has become one of the central axioms of psychiatric practice.

Intriguingly, a very similar situation has emerged in AI Safety^[3], albeit from a very different direction. One of the central intuition pumps of AI Safety is that of the single-minded, utility-maximizing artificial agent, whose actions end up being harmful due to a misalignment of values. While (luckily!) no pure version of this agent exists yet, theoretical studies allow us to make inferences about properties that such an agent is likely to have. This includes f.e. the properties incentivized by instrumental convergence or the property of nontrivial inner alignment. However, theory is theory, and some have questioned the likelihood of instrumental convergence actually occurring. Examples of instrumental convergence remain “hypothetical” and we only know of one reported example of inner alignment failure in a computer model. We believe that this is problematic, as reality is complex and (theoretically) simple concepts can end up being very complicated/different in practice.

In this post, we want to outline how an addict’s astounding ability to optimize for getting more drugs has striking similarities to the relentless optimization capabilities of modern AI systems. We argue that addicts exhibit a weak form of instrumental convergence and that deceptiveness naturally emerges in a setting where the optimization process requires cooperation from an uncooperative interlocutor. The phenomenon of drug addiction might thus serve as a useful “real world” example of many of the theoretical predictions of the AI Safety community. Additionally, we examine common approaches from addiction therapy and evaluate whether they can provide insight into the problem of AI safety. Finally, we highlight some limitations of this approach.

If you are running low on time, here is a TL;DR of our analysis:

Drug addicts serve as a real-life example of a(n approximate) single-minded utility optimizer
They show properties consistent with instrumental convergence, fulfilling 3 out of 5 features highlighted by Nick Bostrom (resource acquisition, technological perfection, and goal-content integrity)
Addicts use deception extensively as a tool for achieving their instrumental values
- We interpret this as an extreme form of an epistemic strategy
We find potential analogon of addiction therapy for AI Safety:
- Group therapy might be applicable to multipolar scenarios for AI development
- Anti-Craving and substitution medication might translate to interpretability work, esp. model editing
- Preventive measures might translate to agent foundations research

Ontogenesis and characteristics of an addict

Koob and Volkov define addiction as:

a chronically relapsing disorder, characterised by compulsion to seek and take the drug, loss of control in limiting intake, and emergence of a negative emotional state (eg, dysphoria, anxiety, irritability) when access to the drug is prevented.

While much of early research has focused solely on the role of dopamine and the reward system in addiction, it is now becoming clear that the transition from occasional use to chronic use involves substantive changes at the molecular, cellular, and neurocircuitry levels. Here is Adinoff:

The persistent release of dopamine during chronic drug use progressively recruits limbic brain regions and the prefrontal cortex, embedding drug cues into the amygdala [...] and involving the amygdala, anterior cingulate, orbitofrontal cortex, and dorsolateral prefrontal cortex in the obsessive craving for drugs.

These structural changes carve out grooves in the addict's brain that can lead to a strong and immediate relapse of recovering addicts following a single dose of the drug, a contextual cue, a craving sensation, stress, or distress. Another (particularly destructive) effect of drug abuse is the hijacking of the reward system. Volkov et al:

These findings suggest an overall reduction in the sensitivity of the reward circuits in the drug-addicted individual to natural reinforcers and possibly to drugs that are not the drug of addiction, thereby providing a putative mechanism underlying the [sadness and discomfort] experienced during withdrawal.

As a consequence, addicts can become extremely single-minded in their desires: they lose interest in food, sex, and their previous social circle, creating a vicious cycle of progressive bodily and mental decline. In the language of moral philosophy, drug use evolves from having instrumental value (f.e. to disinhibit, to improve performance, or to alleviate pain) to having intrinsic value (drug use for reducing the desire for drugs).

Interestingly, the further the addiction progresses, the more the addict resembles a rational expected utility maximizer^[4]. Consistently, the escalation of drug use is a hallmark of addiction and addicts rapidly deplete social and financial capital to enable continued drug use. They resemble thus the archetypical^[5] paperclip maximizer:

a hypothetical artificial intelligence whose utility function values something that humans would consider almost worthless.

Instrumental convergence in addicts and AIs

This resemblance is particularly interesting, as addicts share additional features with the hypothetical paperclip maximizers. They exhibit (a limited form of) instrumental convergence. Nick Bostrom formulates the instrumental convergence thesis as:

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.

Among the instrumental values proposed by Bostrom are:

Resource acquisition: Having more resources allows the agent to more efficiently maximize its final goals. In the case of addicts, access to money for drugs is a common motivation for property crimes and other activities including drug sales, and prostitution as well as legitimate income and the avoidance of expenditures.
Technological perfection: Seeking more efficient ways of transforming some given set of inputs into valued outputs. This can be seen for heroin addicts, who often transition from snorting or smoking heroin to injecting it intravenously to increase the intensity of the high.
Goal-content integrity^[6]: Preventing alterations of the final goals. Even though this is (probably) not executed intentionally by the individual addict, the formation of and the involvement in drug culture constantly reinforces the addict’s relationship with the drug.^[7] As drug cultures explicitly distance themselves from the rest of society, they make it more difficult for the addict to stop using. In fact, a central step in drug recovery programs is to replace the drug culture with a culture of recovery such as alcoholics anonymous.

Interestingly, not all instrumental values proposed by Bostrom are pursued by addicts:

Self-preservation: Trying to be around in the future, to help the agent achieve its present future-oriented goal. As mentioned above, mental and physical health tends to deteriorate rapidly once substance abuse escalates. In addition, suicide attempts are very common among addicts. This accentuates myopic tendencies that are common among drug users.
Cognitive enhancement: Improving rationality and intelligence. While there is some work on how addicts act rational given their circumstances, they do not actively search out cognitive enhancement^[8]. This is presumably because cognitive enhancement is not easily achievable compared to other instrumental values.

Deceptiveness in addicts and AIs

As we have alluded to in the introduction, beyond being somewhat myopic, addicts are often also deceptive. Beyond dishonesty towards a therapist about drug use, feigning physical illness to receive medication, and extensive self-deception, addicts also tend to manipulate emotions:

Lying and dishonesty are, indeed, quite frequent in people with addiction (Ferrari et al., 2008; Sher & Epler, 2004), who often tend to manipulate others close to them to continue their substance use because they are jeopardized by their addiction and intense craving. Some common manipulation tactics refer to making empty promises, playing the victim, making excuses for irresponsibility, making others feel uncomfortable or guilty with the aim of satisfying unreasonable requests, threatening to self-harm, and so on.

Now, why do addicts deceive? Here we are running into issues with our analogy, as there appears to be no fully satisfying definition of AI deception yet. Addicts are clearly not myopic enough to rule out deceptive behavior entirely^[9]. While it is somewhat compelling to regard drug culture as a mechanism for preserving an addict’s goal over time, addicts don’t appear to be under the immediate threat of modification that would justify deceptive alignment in the technical sense^[10]. The motivation for deceptive behavior rather appears to be an epistemic strategy, executed to manipulate beliefs and decisions. Suspended Reason writes:

[L]iving organisms [are] not just physically but epistemically manipulable: their ability to anticipate and optimize, to short-term and conditionally adapt oneself into greater fitness with the environment—is their greatest strength and greatest vulnerability. Affect their priors, or their desires and preferences, and they may make a different decision. Our expression games include but are far from limited to speech—the superset here is the manipulation of appearances and the manipulation of representations, be it by linguistic account or [symbolic] implication.

This is intuitively plausible. The addict’s goals (drug use) are incompatible with some of the goals (to stop drug use) of some of their peers (family, friends, doctors). The addict requires cooperation from their peers to achieve instrumental goals (shelter, money, prescription). The peers are unwilling to cooperate with the addict as long as they believe the addict’s goals are misaligned with their own. Thus, the addict is incentivized to behave in a way that changes the beliefs of their peers and to pretend to be aligned. The result is that the addict uses their model of the peers to act and speak in a way that makes them believe that their goals are aligned until they achieve their instrumental goal and return to the behavior appropriate to achieving their final goal.

At this point, it is worth pointing out the obvious: these epistemic strategies are not unique to addicts. Carefully managing other people’s believes is hypothesized to underlie the disproportionately large cortex of humans, is present in other animals^[11], and might indeed permeate pretty much all of human culture. The single-mindedness of addicts in terms of final values just makes them a particularly clear case in which to study epistemic manipulation. Simultaneously, the pervasiveness of deception (intentional and unintentional) might also further compel us to assume deception as the default, rather than as a rare property of pathological agents.

Epistemic transfer from drug addiction therapy to AI safety

Is there something we can learn from drug addiction therapy that translates into AI safety?

Right out of the gate: Addictions of all kinds are among the hardest mental disorders to treat, with the majority of patients relapsing within ten years after successful treatment. Nonetheless, success stories exist. From working with patients on addiction recovery, we learn that self-motivation is absolutely essential. Nobody will stay sober if they don’t want to stay sober (at least most of the time). Other predictors are religion/spirituality, family, and their job/career, i.e. their social circle. In the language of AI safety, this is consistent with removing goal-content integrity as an instrumental goal. The addict must want to change and the composition of the social circle (drug culture vs. recovery culture) is a strong predictor of how successfully the goal-content can be changed.

Self-motivation alone, however, isn’t enough in most cases. There are several additional approaches with wildly varying degrees of success:

Anti-Craving medication: Acamprosate and Naltrexone are among the few medications that are approved for addiction treatment. Naltrexone is an opioid antagonist, blocking receptors normally causing euphoria from opioid consumption as well as endogenous opioids that are released when doing fun stuff. Acamprosate is less well understood, but it seems to lower cravings.

AI Safety analogon: Similar to the difficulty of finding suitable medication for modifying the motivational system in humans, identifying the mechanism for the internalized reward model in AI is not trivial. Assuming that the reward model can be interpreted, model editing appears as a feasible strategy for correcting misalignment^[12]. Similar to an addict, this strategy can only be executed with the “cooperation” of the AI, i.e. the lack of goal-content integrity, or in an inpatient setting, where the actions of the AI are severely restricted.

Substitution medication: In some cases, the disorder is too severe to aim for abstinence. Reducing harm is key; not just for the individual patient, but for society as a whole, reducing crimes related to obtaining illegal substances (violent crime, sex work). Famous examples include methadone and buprenorphine, which belong to the opioid family, but cause less euphoria than the real stuff and are taken by mouth. Methadone and buprenorphine connect to the receptors more tightly than diamorphine (Heroin) or oxycodone, so that any consumption after taking the substitute won’t have the desired effect.

AI Safety analogon: It is interesting to consider whether we can (once we have identified a potentially harmful goal of an AI) offer non- (or less-) harmful substitutes^[13]. A paperclip maximizer might not be satisfied with anything else than paperclips, but the reality is likely to be more nuanced, especially in multipolar scenarios.

Motivational interviewing: This can be the start of a longer treatment for substance use disorders. Cards with values like love, honesty, or success are handed out to the patient and they are asked to pick those that matter most to them. The patient is asked to explain why those values are close to their heart, and how they incorporate them into their lives. At some point or another, there will be a clash between “continue substance use” and “basically anything people value”, ideally helping with behavior change and resolving ambivalence in favor of intrinsic motivation.

AI Safety analogon: In light of how we argued earlier how drugs hijack the motivational system and make addicts single-minded, motivational interviewing is intended to restore the influence of other final values. However, these other values (love, honesty, or success) are qualitatively different from drug use, in particular, they are a lot less immediately actionable. Including those values into the motivational system of AI appears desirable^[14], but essentially equivalent to the alignment problem.

Group therapy: All kinds of groups exist, but most clinics will, for example, have a group centered around relapse prevention. The most famous example of non-medical, but nonetheless effective group therapy, is Alcoholics Anonymous. There seems to be something particularly effective about having role models and being held accountable by peers, who don’t judge you as harshly as your relatives might do.

AI Safety analogon: The mental image of an AI Anonymous group therapy meetup has undeniable charm. More realistically, we could imagine collaborative anomaly detection in a multipolar scenario, where multiple AIs constantly monitor each other for evidence of misalignment. This of course requires that either the majority of AIs (or the group as a whole) are (/is) aligned most of the time.

Preventive measures: Famously, there is no glory in prevention; the people responsible likely won’t get the reward they deserve. In the case of substance abuse (with its dismal prognoses and high costs for society at large), prevention plays a particularly important role. Limiting access to substances^[15] (alcohol, first and foremost) and advertising, especially for young people, appears to be the most promising ways for fewer victims of addiction.

AI Safety analogon: This translates into the obvious: the best way to fix misalignment is to never have it occur in the first place. In the strongest form, this would involve constructing a system with certain mathematical guarantees of safety or, if that is not possible, to at least come very close to these guarantees.

Limitations and Conclusions

While there are some striking similarities between addicts and AIs, there are some very important differences that limit comparability:

As we have pointed out, the cognitive ability of addicts tends to decrease with progressing addiction. This provides a natural negative feedback loop that puts an upper bound on the amount of harm an addict can cause. Without this negative feedback loop, humanity would look very different^[16]. This mechanism is, by default, not present for AI^[17].
Addicts are humans. To the extent that they are a useful model for thinking about AIs, they are also potentially dangerously misleading. A lot of our preconceived notions and cultural norms still apply to addicts, making them easy for us to model mentally. This does not translate to AI, who might (without any deceptive intent) misunderstand everything we say in the most terrible way possible. Thus, we propose this analogy not as an intuition pump that can extrapolate outside of the bounds of established validity, but rather as a tool for inference within these bounds^[18].

In conclusion, we have demonstrated a number of parallels between addicts and misaligned AIs: both can be interpreted as utility maximizers, both exhibit a form of instrumental convergence and both have a tendency to be deceptive. We translate insights from addiction treatment into the language of AI Safety and arrive at a few interesting ideas that appear to us to be worth exploring in future work. Beyond this, we have demonstrated how epistemic transfer from other fields into AI Safety can look like and are excited about the possibility of investigating this further.

Doctor-patient confidentiality requires us to change the details and to mix several stories. ↩︎
There are, of course, exceptions to the rule, and figuring out those exceptions is an art in itself. Importantly, the propensity of addicts to lie appears to be limited to situations where they can actively gain something from the lie. ↩︎
We (Jan & Nadia) sometimes struggle to communicate to friends and family why artificial intelligence (AI) safety is a difficult problem. There is the pervasive (and intuitive) belief that “If we can create truly powerful AI, it will just figure out human morality by default. And even if malicious intent should creep into the system somehow, we’ll surely notice and “just pull the plug”. As an extreme solution, we might just figuratively “lock” the AI in a box. Then we can audit the AI extensively (through interacting with it via text) before we decide if we let it out into the world or not.” We believe that the example of the deceptive drug addict in this post could serve as one (certainly not the only) useful example for illustrating the difficulty of safely aligning AI. ↩︎
Gaining utility is equated with drug consumption here. There are some interesting twists and turns here about whether it really makes sense to model an addict as a rational agent. (This requires f.e. the assumption that an addict’s “awareness of the future consequences is not impaired.”) For the purpose of this post, it is only important that the addict attempts to maximize utility (drugs), not that they do so in an optimal/rational way. Also this. ↩︎
(and occasionally scoffed at) ↩︎
Okay, yes, this one is a stretch. But hear us out. ↩︎
These cultures are highly complex, vary depending on the substance used, the geographic location, the socioeconomic status of the users, change over time and commonly evolve hierarchies. ↩︎
Like, f.e., an M.S. in organic chemistry. And while there is the myth of streetsmart psychopathic addicts, this is not actually born out in the data. ↩︎
In the linked article, myopia (short-sightedness) is proposed as an “overapproximation” of non-deceptiveness. In short, an agent that is myopic in a strictly technical sense, can’t be deceptive, since they can’t plan ahead far enough to come up with something really bad. The reverse obviously doesn’t hold: You can clearly imagine someone non-myopic, but also non-deceptive. ↩︎
The linked article provides a list of requirements for deceptive alignment to arise and one of the requirements is “The mesa-optimizer must expect the threat of modification to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer.” We don’t see a clean way for mapping this onto the addiction model. ↩︎
Seagulls do it, so we must assume that albatrosses do, too. ↩︎
Certainly a less extreme intervention than “just pulling the plug”. ↩︎
A reviewer (Leon Lang) made the following interesting remark: “from the point of view of the AI, humans that try to make the AI satisfied while actually not producing "the real thing" (e.g., paperclips) almost seem like... misaligned humans?” ↩︎
Although we can imagine that an excess of either love, honesty or success (especially when optimized for with practically unlimited cognitive resources) could be just as destructive as drug use. ↩︎
While keeping the mistakes of prohibition in mind. ↩︎
The link leads to a (long) fiction novel by Scott Alexander where Mexico is controlled by people constantly high on peyote, who become extremely organized and effective as a result. They are scary & dangerous. ↩︎
Although it is an interesting idea to scale access to compute inversely to how high the value of the accumulated reward is. ↩︎
What do we know works in addicts, but hasn’t been tried in AI? What works in AI, but hasn’t been tried in addicts? ↩︎