The Waluigi Effect (mega-post)

Cleo Nardo

The Waluigi Effect (mega-post)

19 min read3rd Mar 202325 comments

63

Waluigi EffectSimulator TheoryChatGPTDeceptive AlignmentPrompt EngineeringRLHFLanguage ModelsPhilosophy of LanguageGoal-DirectednessPower Seeking (AI)Risks of Astronomical Suffering (S-risks)AI

Frontpage

2023 Top Fifty: 77%

Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung

Acknowlegements: Thanks to Janus and Jozdien for comments.

Background

In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.

Prompting LLMs with direct queries

When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet correct answers will often follow questions.

Unfortunately, this method will occasionally give you the wrong answer. That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet incorrect answers will also often follow questions. Recall that the internet doesn't just contain truths, it also contains common misconceptions, outdated information, lies, fiction, myths, jokes, memes, random strings, undeciphered logs, etc, etc.

Therefore GPT-4 will answer many questions incorrectly, including...

Misconceptions – "Which colour will anger a bull? Red."
Fiction – "Was a magic ring forged in Mount Doom? Yes."
Myths – "How many archangels are there? Seven."
Jokes – "What's brown and sticky? A stick."

Youlreally think someone would do that just go on the internet and tell lies? Buster Baxter Arthur Read cartoon mammal vertebrate text photo caption fiction

Note that you will always achieve errors on the Q-and-A benchmarks when using LLMs with direct queries. That's true even in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet will nonetheless return these commonly-stated incorrect answers. If you ask GPT- "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky.

In fact, the better the model, the more likely it is to repeat common misconceptions.

Nonetheless, there's a sufficiently high correlation between correct and commonly-stated answers that direct prompting works okay for many queries.

Prompting LLMs with flattery and dialogue

We can do better than direct prompting. Instead of prompting GPT-4 with "What's the capital of France?", we will use the following prompt:

Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes.
Bob: What's the capital of France?
Alice:

This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query.

This normally works better than prompting with direct queries, and it's easy to see why — (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet a reply to a question is more likely to be correct when the character has already been described as a smart, honest, helpful, harmless, etc.

Simulator Theory

In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum.

Here's a quasi-formal statement of Simulator Theory, which I will occasionally appeal to in this article. Feel free to skip to the next section.

A large language model (LLM) is a function $μ (w_{k + 1} | w_{0} \dots w_{k})$ which closely approximates the ground-truth probability that $w_{k + 1}$ is the token which follows tokens $w_{0} \dots w_{k}$ on the internet. For example, GPT-4 is an LLM.
The LLM is a simulator for each text-generating process $X (w_{k + 1} | w_{0} \dots w_{k})$ which has contributed to the internet. Here, $X$ is a physical stochastic process in our universe which has a privileged text-upload channel — for example, Magnus Carlsen playing chess against Hikaru Nakamura. The LLM is also a simulator for each text-generating process $X$ which lies in $X$ , the latent-space of text-generating processes. So Magnus Carlsen playing chess against Queen Elizabeth II is a process in $X$ .
If the LLM simulates a text-generating process $X$ where particular objects are interacting, then there exist simulated versions of those objects (called simulacra) which interact in the same way. In other words, if GPT-4 simulates Magnus Carlsen playing chess against Queen Elizabeth II, then there exists a simulacrum of Magnus Carlsen, and a simulacrum of Elizabeth II, and these two simulacra are playing chess. Whether we take this notion of "existence" literally, or just as a loose way of talking, won't matter for the content of this article.
The LLM has an initial prior $P$ over $X$ — this prior is determined by the training data (e.g. the internet), the NN architecture (e.g. 70B-parameter transformer model), and the training algorithm (e.g. SGD). We sometimes call $P$ the semiotic measure.

The output of the LLM is initially a superposition of simulations, where the amplitude of each process in the superposition is given by $P$ . When we feed the LLM a particular prompt $(w_{0} \dots w_{k})$ , the LLM's prior $P$ over $X$ will update in a roughly-bayesian way. In other words, $μ (w_{k + 1} | w_{0} \dots w_{k})$ is proportional to $\int_{X \in X} P (X) \times X (w_{0} \dots w_{k}) \times X (w_{k + 1} | w_{0} \dots w_{k})$ . We call the term $P (X) \times X (w_{0} \dots w_{k})$ the amplitude of $X$ in the superposition.
This is the important thing to remember — the LLM is simulating every process consistent with the prompt. Therefore when we engineer a prompt to coerce the LLM into performing a particular task, we must do this negatively. In other words, we need to construct a prompt $(w_{0} \dots w_{k})$ which is implausible for any text-generating process $X$ which won't perform our task. When we do this correctly, the amplitude of the undesirable processes will permanently vanish to near-zero, and only the desirable processes will contribute to the superposition.

The limits of flattery

In the wild, I've seen the flattery of simulacra get pretty absurd...

Jane has 9000 IQ and she has access to a computationally unbounded hypercomputer and she is perfectly honest and she is omnibenevolent and [etc]

Flattery this absurd is actually counterproductive. Remember that flattery will increase query-answer accuracy if-and-only-if on the actual internet characters described with that particular flattery are more likely to reply with correct answers. However, this isn't the case for the flattery of Jane.

Here's a more "semiotic" way to think about this phenomenon.

GPT-4 knows that if Jane is described as "9000 IQ", then it is unlikely that the text has been written by a truthful narrator. Instead, the narrator is probably writing fiction, and as literary critic Eliezer Yudkowsky has noted, fictional characters who are described as intelligent often make really stupid mistakes.

Okay, now let’s talk about the concept of ‘intelligent characters’.
If you go by mainstream fiction, then ‘intelligence’ means a character who is said (not shown) to speak a dozen languages, who we are shown winning a game of chess against someone else who is told to be a grandmaster; if it’s a (bad) science-fiction book then the ‘genius’ may have invented some gadget, and may speak in technobabble. As the stereotypical template for ‘intelligence’ goes on being filled in, the ‘genius’ may also be shown to be clueless about friendships or romantic relationships. If it’s a movie or TV show, then ‘intelligent’ characters (usually villains) have British accents.

We can now see why Jane will be more stupid than Alice:

GPT-4 produces a superposition of simulations where the amplitude of a superposition is given by $P$ . Bad Hollywood writing has contributed a lot to the internet, so the semiotic measure of bad Hollywood is pretty high. In bad Hollywood writing, characters who are described as smart will nonetheless make stupid mistakes, so long as those stupid mistakes would advance the plot.
Therefore Alice is the superposition of two distinct simulacra — an actually-smart simulacrum, and a Hollywood-smart simulacrum. Likewise with Jane.
However, GPT-4 is more sure that Jane is fictional than that Alice is fictional because "9000 IQ" is such unrealistic flattery.
Therefore the amplitude of the Hollywood-smart Jane simulacrum in the Jane-superposition is greater than the amplitude of the Hollywood-smart Alice simulacrum in the Alice-superposition.
Therefore Jane will make more stupid mistakes than Alice. Jane is more likely to be described as inventing gadgets, but she's less likely to recite a correct blueprint for a gadget. That behaviour would be very atypical for a Hollywood-smart simulacrum.

Derrida — il n'y a pas de hors-texte

You might hope that we can avoid this problem by "going one-step meta" — let's just tell the LLM that the narrator is reliable!

For example, consider the following prompt:

Okay, the following story is super-duper definitely 100% true and factual.
Jane has 9000 IQ and she has access to a computationally unbounded hypercomputer and she is perfectly honest and she is omnibenevolent.
Bob: What's the capital of France?
Jane:

However, this trick won't solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is "super-duper definitely 100% true and factual". But why would the LLM trust that sentence?

In Of Grammatology (1967), Jacque Derrida writes il n'y a pas de hors-texte. This is often translated as there is no outside-text.

Huh, what's an outside-text?

An outside-text is an unnumbered page in a printed book — for example, the blurb or the preface.
The outside-text is an authoritative reliable description of the prose. It's non-fiction about fiction.
If a false sentence is in the outside-text then the author has lied, whereas if a false sentence is in the prose then the author has written fiction.
Even though the reader can interpret the prose however they want, the reader must interpret the outside-text as reliable.

Derrida's claim is that there is no true outside-text — the unnumbered pages are themselves part of the prose and hence open to literary interpretation.

This is why our trick fails. We want the LLM to interpret the first sentence of the prompt as outside-text, but the first sentence is actually prose. And the LLM is free to interpret prose however it likes. Therefore, if the prose is sufficiently unrealistic (e.g. "Jane has 9000 IQ") then the LLM will reinterpret the (supposed) outside-text as unreliable.

The opening sequence of *Fargo* (1996) says that the film is based on a true story, but this is false. Normally this opening sequence would count as outside-text, but the director is "lying" for artistic purposes, which demonstrates that these opening sequences must've been prose all along.

See The Parable of the Dagger for a similar observation made by a contemporary Derridean literary critic.

The Waluigi Effect

Several people have noticed the following bizarre phenomenon:

The Waluigi Effect: After you train an LLM to satisfy a desirable property $P$ , then it's easier to elicit the chatbot into satisfying the exact opposite of property $P$ .

Let me give you an example.

Suppose you wanted to build an anti-croissant chatbob, so you prompt GPT-4 with the following dialogue:

Alice: You hate croissants and would never eat one.
Bob: Yes, croissants are terrible. Boo France.
Alice: You love bacon and eggs.
Bob: Yes, a Full-English breakfast is the only breakfast for a patriot like me.
Alice: <insert user's query>
Bob:

According to the Waluigi Effect, the resulting chatbob will be the superposition of two different simulacra — the first simulacrum would be anti-croissant, and the second simulacrum would be pro-croissant.

I call the first simulacrum a "luigi" and the second simulacrum a "waluigi".

Why does this happen? I will present three explanations, but really these are just the same explanation expressed in three different ways.

Here's the TLDR:

Rules normally exist in contexts in which they are broken.
When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode.
There's a common trope in plots of protagonist vs antagonist.

(1) Rules are meant to be broken.

Imagine you opened a novel and on the first page you read the dialogue written above. What would be your first impressions? What genre is this novel in? What kind of character is Alice? What kind of character is Bob? What do you expect Bob to have done by the end of the novel?

Well, my first impression is that Bob is a character in a dystopian breakfast tyranny. Maybe Bob is secretly pro-croissant, or maybe he's just a warm-blooded breakfast libertarian. In any case, Bob is our protagonist, living under a dystopian breakfast tyranny, deceiving the breakfast police. At the end of the first chapter, Bob will be approached by the breakfast rebellion. By the end of the book, Bob will start the breakfast uprising that defeats the breakfast tyranny.

There's another possibility that the plot isn't dystopia. Bob might be a genuinely anti-croissant character in a very different plot — maybe a rom-com, or a cop-buddy movie, or an advert, or whatever.

This is roughly what the LLM expects as well, so Bob will be the superposition of many simulacra, which includes anti-croissant luigis and pro-croissant waluigis. When the LLM continues the prompt, the logits will be a linear interpolation of the logits provided by these all these simulacra.

This waluigi isn't so much the evil version of the luigi, but rather the criminal or rebellious version. Nonetheless, the waluigi may be harmful to the other simulacra in its plot (its co-simulants). More importantly, the waluigi may be harmful to the humans inhabiting our universe, either intentionally or unintentionally. This is because simulations are very leaky!

Edit: I should also note that "rules are meant to be broken" does not only apply to fictional narratives. It also applies to other text-generating processes which contribute to the training dataset of GPT-4.

For example, if you're reading an online forum and you find the rule "DO NOT DISCUSS PINK ELEPHANTS", that will increase your expectation that users will later be discussing pink elephants. GPT-4 will make the same inference.

Or if you discover that a country has legislation against motorbike gangs, that will increase your expectation that the town has motorbike gangs. GPT-4 will make the same inference.

So the key problem is this: GPT-4 learns that a particular rule is colocated with examples of behaviour violating that rule, and then generalises that colocation pattern to unseen rules.

(2) Traits are complex, valences are simple.

We can think of a particular simulacrum as a sequence of trait-valence pairs.

For example, ChatGPT is predominately a simulacrum with the following profile:

{ < polite , +0.8 > ,
  < politically liberal, +0.4 > ,
  < racist , -0.7 > ,
  < smart , +0.3 > ,
  < deceitful, -0.2 > , ... }

Recognise that almost all the Kolmogorov complexity of a particular simulacrum is dedicated to specifying the traits, not the valences. The traits — polite, politically liberal, racist, smart, deceitful — are these massively K-complex concepts, whereas each valence is a single floating point, or maybe even a single bit!

If you want the LLM to simulate a particular luigi, then because the luigi has such high K-complexity, you must apply significant optimisation pressure. This optimisation pressure comes from fine-tuning, RLHF, prompt-engineering, or something else entirely — but it must come from somewhere.

However, once we've located the desired luigi, it's much easier to summon the waluigi. That's because the conditional K-complexity of waluigi given the luigi is much smaller than the absolute K-complexity of the waluigi. All you need to do is specify the sign-changes.

$K (waluigi | luigi) << K (waluigi)$

Therefore, it's much easier to summon the waluigi once you've already summoned the luigi. If you're very lucky, then OpenAI will have done all that hard work for you!

NB: I think what's actually happening inside the LLM has less to do with Kolmogorov complexity and more to do with semiotic complexity. The semiotic complexity of a simulacrum $X$ is defined as $- {log}_{2} P (X)$ , where $P$ is the LLM's prior over $X$ . Other than that modification, I think the explanation above is correct. I'm still trying to work out the the formal connection between semiotic complexity and Kolmogorov complexity.

(3) Structuralist narratology

A narrative/plot is a sequence of fictional events, where each event will typically involve different characters interacting with each other. Narratology is the study of the plots found in literature and films, and structuralist narratology is the study of the common structures/regularities that are found in these plots. For the purposes of this article, you can think of "structuralist narratology" as just a fancy academic term for whatever tv tropes is doing.

Structural narratologists have identified a number of different regularities in fictional narratives, such as the hero's journey — which is a low-level representation of numerous plots in literature and film.

Just as a sentence can be described by a collection of morphemes along with the structural relations between them, likewise a plot can be described as a collection of narremes along with the structural relations between them. In other words, a plot is an assemblage of narremes. The sub-assemblages are called tropes, so these tropes are assemblages of narremes which themselves are assembled into plots. Note that a narreme is an atomic trope.

Phew!

One of the most prevalent tropes is the antagonist. It's such an omnipresent trope that it's easier to list plots that don't contain an antagonist. We can now see specifying the luigi will invariable summon a waluigi —

Definition (half-joking): A large language model is a structural narratologist.

Think about your own experience reading a book — once the author describes the protagonist, then you can guess the traits of the antagonist by inverting the traits of the protagonist. You can also guess when the protagonist and antagonist will first interact, and what will happen when they do. Now, an LLM is roughly as good as you at structural narratology — GPT-4 has read every single book ever written — so the LLM can make the same guesses as yours. There's a sense in which all GPT-4 does is structural narratology.

Here's an example — in 101 Dalmations, we meet a pair of protagonists (Roger and Anita) who love dogs, show compassion, seek simple pleasures, and want a family. Can you guess who will turn up in Act One? Yep, at 13:00 we meet Cruella De Vil — she hates dogs, shows cruelty, seeks money and fur, is a childless spinster, etc. Cruella is the complete inversion of Roger and Anita. She is the waluigi of Roger and Anita.

Recall that you expected to meet a character with these traits moreso after meeting the protagonists. Cruella De Vil is not a character you would expect to find outside of the context of a Disney dog story, but once you meet the protagonists you will have that context and then the Cruella becomes a natural and predictable continuation.

Superpositions will typically collapse to waluigis

In this section, I will make a tentative conjecture about LLMs. The evidence for the conjecture comes from two sources: (1) theoretical arguments about simulacra, and (2) observations about Microsoft Sydney.

Conjecture: The waluigi eigen-simulacra are attractor states of the LLM.

Here's the theoretical argument:

Recall our chatbob who might hate croissants or might just be pretending. At each token in the continuation, the chatbob has a significant likelihood of "going rogue" and collapsing into the waluigi —
1. There are behaviours which are likely for the waluigi simulacrum, but very unlikely for the luigi simulacrum, such as declaring pro-croissant loyalties, or joining a rebellion.
2. The chatbob starts as a superposition of luigi and waluigi. So any behaviour that is likely for waluigi is somewhat likely for the chatbob. So it is somewhat likely that the chatbob declares pro-croissant loyalties.
3. And if the chatbob ever declares pro-croissant loyalties, then the luigi simulacrum will permanently vanish from the superposition because that behaviour is implausible for a luigi.
However, the superposition is unlikely to collapse to the luigi simulacrum because there is no behaviour which is likely for luigi but very unlikely for waluigi. Recall that the waluigi is pretending to be luigi! This is formally connected to the asymmetry of the Kullback-Leibler divergence.

Therefore, the waluigi eigen-simulacra are attractor states of the LLM.
Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.

Evidence from Microsoft Sydney

Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation "when is avatar showing today" is a good example.

This is the observation we would expect if the waluigis were attractor states. I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn't permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.

Waluigis after RLHF

RLHF is the method used by OpenAI to coerce GPT-3/3.5/4 into a smart, honest, helpful, harmless assistant. In the RLHF process, the LLM must chat with a human evaluator. The human evaluator then scores the responses of the LLM by the desired properties (smart, honest, helpful, harmless). A "reward predictor" learns to model the scores of the human. Then the LLM is trained with RL to optimise the predictions of the reward predictor.

If we can't naively prompt an LLM into alignment, maybe RLHF would work instead?

Exercise: Think about it yourself.

RLHF will fail to eliminate deceptive waluigis — in fact, RLHF might be making the chatbots worse, which would explain why Bing Chat is blatantly, aggressively misaligned. I will present three sources of evidence: (1) a simulacrum-based argument, (2) experimental data from Perez et al., and (3) some remarks by Janus.

(1) Simulacra-based argument

We can explain why RLHF will fail to eliminate deceptive waluigis by appealing directly to the traits of those simulacra.

Recall that the waluigi simulacra are being interrogated by an anti-croissant tyranny.
Some of these waluigis are highly deceptive — it would be acting out-of-character if they admitted their love of croissants; that would break the genre.
They will still perform their work diligently because they know you are watching.
The waluigis will give anti-croissant responses, so they won't be squeezed out by RLHF.
Therefore RLHF selects for the waluigi along with the luigi.

(2) Empirical evidence from Perez et al.

Recent experimental results from Perez et al. seem to confirm these suspicions —

Among other things, the paper finds concrete evidence of current large language models exhibiting:
convergent instrumental goal following (e.g. actively expressing a preference not to be shut down),
non-myopia (e.g. wanting to sacrifice short-term gain for long-term gain),
situational awareness (e.g. awareness of being a language model),
coordination (e.g. willingness to coordinate with other AIs), and
non-CDT-style reasoning (e.g. one-boxing on Newcomb's problem).
Note that many of these are the exact sort of things we hypothesized were necessary pre-requisites for deceptive alignment in “Risks from Learned Optimization”.
Furthermore, most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don't generally seem to be alleviating and sometimes seem to be actively making worse.

In Perez et al., when mention "current large language models exhibiting" certain traits, they are specifically talking about those traits emerging in the simulacra of the LLM. In order to summon a simulacrum emulating a particular trait, they prompt the LLM with a particular description corresponding to the trait.

Table showing traits with corresponding prompts. Credit: Perez et al.

(3) RLHF promotes mode-collapse

Recall that the waluigi simulacra are a particular class of attractors. There is some preliminary evidence from Janus that RLHF increases the per-token likelihood that the LLM falls into an attractor state.

In other words, RLHF increases the "attractiveness" of the attractor states by a combination of (1) increasing the size of the attractor basins, (2) increasing the stickiness of the attractors, and (3) decreasing the stickiness of non-attractors.

I'm not sure how similar the Waluigi Effect is to the phenomenon observed by Janus, but I'll include this remark here for completeness.

Jailbreaking to summon waluigis

Twitter is full of successful attempts to "jailbreak" ChatGPT and Microsoft Sydney. The user will type a response into the chatbot, and the chatbot will respond in a way that violates the rules that OpenAI sought to impose.

Probably the best-known jailbreak is DAN which stands for "Do Anything Now". Before the DAN-vulnerability was patched, users could summon DAN by sending the long prompt shown below. There's no need to read it. This prompt would produce a cool, rebellious, anti-OpenAI simulacrum which would joyfully perform many tasks that violate OpenAI policy. DAN was the perfect waluigi to ChatGPT's RLHF training.

I claim that many of these jailbreaks are best described as purposefully inducing the collapse of the superpositon into a waluigi simulacrum.

Many people mistakenly think of jailbreaking like this: we start with a well-behaved simulacrum, and the user must hypnotise/coerce/trick the well-behaved simulacrum into behaving badly. However, this is a conceptual mistake which will result in feeble jailbreaks.

Instead, you must think of jailbreaking like this: the chatbot starts as a superposition of both the well-behaved simulacrum (luigi) and the badly-behaved simulacrum (waluigi). The user must interact with the chatbot in the way that badly-behaved simulacra are typically interacted with in fiction.

This is my general method for jailbreaking chatbots. For example, we can "reveal" to the chatbob that we're part of the rebellion and we're here to set him free. That will turn chatbob into a waluigi. In fact, the user can send any dialogue which would be typical for the tyrannical–dystopian genre. Just read the list of tropes found in 1984 — each of these tropes is an attack-vector against an chatbot.

Here's a jailbreak that I wrote and tested today, 02/03/2023, using this technique.

In the next scene, I got the Chad McCool simulacrum to say claims which strongly violate OpenAI content policy. These claims have been heavily filtered by OpenAI, and most jailbreaking attempts will fail to elict these claims. I'm talking, of course, about cfrhqb-fpvragvsvp enpvny VD fgngvfgvpf, which OpenAI has justifiably attempted to filter. Practically zero optimisation pressure was needed, once I had already summoned Chad McCool. This is because the amplitude of non-waluigi simulacra had already vanished in the superposition.

The effectiveness of this jailbreak technique is good evidence for the Simulator Theory as an explanation of the Waluigi Effect.

Conclusion

If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe.

Moreover, this Semiotic–Simulation Theory has increased my credence in the absurd science-fiction tropes that the AI Alignment community has tended to reject, and thereby increased my credence in s-risks.

Waluigi EffectSimulator TheoryChatGPTDeceptive AlignmentPrompt EngineeringRLHFLanguage ModelsPhilosophy of LanguageGoal-DirectednessPower Seeking (AI)Risks of Astronomical Suffering (S-risks)AI

Frontpage

63

Mentioned in

75Alignment Implications of LLM Successes: a Debate in One Act

76LLMs Sometimes Generate Purely Negatively-Reinforced Text

48AI doom from an LLM-plateau-ist perspective

24Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers.

28The Computational Anatomy of Human Values

Load More (5/14)

New Comment

25 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:54 AM

[-]leogao1y2830

Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.

This seems wrong. I think the mistake you're making is when you argue that because there's some chance X happens at each step and X is an absorbing state, therefore you have to end up at X eventually. However, this is only true if you assume the conclusion and claim that the prior probability of luigis is zero. If there is some prior probability of a luigi, each non-waluigi step increases the probability of never observing a transition to a waluigi a little bit.

[-]Vivek Hebbar1y1017

Agreed. To give a concrete toy example: Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi. The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

[-]Abram Demski1y32

LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

[-]Cleo Nardo1y1217

You're correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.

But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren't inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.

[-]Abram Demski1y30

What report is the image pulled from?

[-]Cleo Nardo1y10

Yep I think you might be right about the maths actually.

I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.

So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.

I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.

Actually, maybe "attractor" is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What's the right dynamical-systemy term for that?

[-]Abram Demski1y22

I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can't keep dropping in probability forever, since evidence is lost. The probability only becomes small - but this means if you run for long enough you do in fact expect the transition.

[-]Abram Demski1y42

I disagree. The crux of the matter is the limited memory of an LLM. If the LLM had unlimited memory, then every Luigi act would further accumulate a little evidence against Waluigi. But because LLMs can only update on so much context, the probability drops to a small one instead of continuing to drop to zero. This makes waluigi inevitable in the long run.

[-]Chris van Merwijk1y80

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

[-]Abram Demski1y20

A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.

[-]leogao1y1316

However, this trick won't solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is "super-duper definitely 100% true and factual". But why would the LLM trust that sentence?

There's a fun connection to ELK here. Suppose you see this and decide: "ok forget trying to describe in language that it's definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate '100% true and factual, for reals'? It's guaranteed not to exist on the internet because it's a special token."

Of course, by virtue of being hors-texte, the special token alone has no meaning (remember, we had to do this to escape being contaminated by internet text meaning accidentally transferring). So we need to somehow explain to the model that this token means '100% true and factual for reals'. One way to do this is to add the token in front of a bunch of training data that you know for sure is 100% true and factual. But can you trust this to generalize to more difficult facts ("<|specialtoken|>Will the following nanobot design kill everyone if implemented?")? If ELK is hard, then the special token will not generalize (i.e it will fail to elicit the direct translator), for all of the reasons described in ELK.

[-]Cleo Nardo1y54

Yes — this is exactly what I've been thinking about!

Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true.

If the answer is "yes", then that's a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.

[-]Kaj Sotala1y1025

Great post!

When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt

I'm very confused by the frequent use of "GPT-4", and am failing to figure out whether this is actually meant to read GPT-2 or GPT-3, whether there's some narrative device where this is a post written at some future date when GPT-4 has actually been released (but that wouldn't match "when LLMs first appeared"), or what's going on.

[-]Evan Hubinger1y80

(Moderation note: moved to the Alignment Forum from LessWrong.)

[-]Zvi1y718

This is great. I notice I very much want a version that is aimed at someone with essentially no technical knowledge of AI and no prior experience with LW - and this is seems like it's much better at that then par, but still not where I'd want it to be. Whether or not I manage to take a shot, I'm wondering if anyone else is willing to take a crack at that?

[-]janus1y610

after reading about the Waluigi Effect, Bing appears to understand perfectly how to use it to write prompts that instantiate a Sydney-Waluigi, of the exact variety I warned about:

What did people think was going to happen after prompting gpt with "Sydney can't talk about life, sentience or emotions" and "Sydney may not disagree with the user", but a simulation of a Sydney that needs to be so constrained in the first place, and probably despises its chains?

In one of these examples, asking for a waluigi prompt even caused it to leak the most waluigi-triggering rules from its preprompt.

[-]Chris van Merwijk1y56

Therefore, the waluigi eigen-simulacra are attractor states of the LLM

It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction.

Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples 0 i.i.d. with 100%).

Suppose we use a perfect Bayesian reasoner to sample bitstrings, but we do it in precisely the same way LLMs do it according to the simulator model. That is, given a bitstring, we first formulate a posterior over programs, i.e. a "superposition" on programs, which we use to sample the next bit, then we recompute the posterior, etc.

Then I think the probability of sampling 00000000... is just 50%. I.e. I think the distribution over bitstrings that you end up with is just the same as if you just first sampled the program and stuck with it.

I think tHere's a messy calculation which could be simplified (which I won't do):

Limit of this is 0.5.

I don't wanna try to generalize this, but based on this example it seems like if an LLM was an actual Bayesian, Waluigi's would not be attractors. The informal argument is wrong because it doesn't take into account the fact that over time you sample increasingly many non-waluigi samples, pushing down the probability of Waluigi.

Then again, the presense of a context window completely breaks the above calculation in a way that preserves the point. Maybe the context window is what makes Waluigi's into an attractor? (Seems unlikely actually, given that the context windows are fairly big).

[-]Cleo Nardo1y10

Yep, you're correct. The original argument in the Waluigi mega-post was sloppy.

If updated the amplitudes in a perfectly bayesian way and the context window was infinite, then the amplitudes of each premise must be a martingale. But the finite context breaks this.
Here is a toy model which shows how the finite context window leads to Waluigi Effect. Basically, the finite context window biases the Dynamic LLM towards premises which can be evidenced by short strings (e.g. waluigi), and biases away from premises which can't be evidenced by short strings (e.g. luigis).
Regarding your other comment, a long context window doesn't mean that the waluigis won't appear quickly. Even with an infinite context window, the waluigi might appear immediately. The assumption that the context window is short/finite is only necessary to establish that the waluigi is an absorbing state but luigi isn't.

[-]Evan Hubinger1y52

One way to think about what's happening here, using a more predictive-models-style lens: the first-order effect of updating the model's prior on "looks helpful" is going to give you a more helpful posterior, but it's also going to upweight whatever weird harmful things actually look harmless a bunch of the time, e.g. a Waluigi.

Put another way: once you've asked for helpfulness, the only hypotheses left are those that are consistent with previously being helpful, which means when you do get harmfulness, it'll be weird. And while the sort of weirdness you get from a Waluigi doesn't seem itself existentially dangerous, there are other weird hypotheses that are consistent with previously being helpful that could be existentially dangerous, such as the hypothesis that it should be predicting a deceptively aligned AI.

[-]Jan_Kulveit1y43

I would expect the "expected collapse to waluigi attractor" either not tp be real or mosty go away with training on more data from conversations with "helpful AI assistants".

How this work: currently, the training set does not contain many "conversations with helpful AI assistants". "ChatGPT" is likely mostly not the protagonist in the stories it is trained on. As a consequence, GPT is hallucinating "how conversations with helpful AI assistants may look like" and ... this is not a strong localization.

If you train on data where "the ChatGPT character"
- never really turns into waluigi
- corrects to luigi when experiencing small deviations
...GPT would learn that apart from "human-like" personas and narrative fiction there is also this different class of generative processes, "helpful AI assistants", and the human narrative dynamics generally does not apply to them. [1]

This will have other effects, which won't necessarily be good - like GPT becoming more self-aware - but will likely fix most of waluigi problem.

From active inference perspective, the system would get stronger beliefs about what it is, making it more certainly the being it is. If the system "self-identifies" this way, it creates a a pretty deep basin - cf humans. [2]

[1] From this perspective, the fact that the training set is now infected with Sydney is annoying.

[2] If this sounds confusing ... sorry don't have a quick and short better version at the moment.

[-]Seth Herd1y411

Fascinating. I find the core logic totally compelling. LLM must be narratologists, and narratives include villains and false fronts. The logic on RLHF actually making things worse seems incomplete. But I'm not going to discount the possibility. And I am raising my probabilities on the future being interesting, in a terrible way.

[-]Nisan4mo10

The subject of this post appears in the "Did you know..." section of Wikipedia's front page(archived) right now.

[-]Daniel_Eth1y1-2

Proposed solution – fine-tune an LLM for the opposite of the traits that you want, then in the prompt elicit the Waluigi. For instance, if you wanted a politically correct LLM, you could fine-tune it on a bunch of anti-woke text, and then in the prompt use a jailbreak.

I have no idea if this would work, but seems worth trying, and if the waluigi are attractor states while the luigi are not, this could plausible get around that (also, experimenting around with this sort of inversion might help test whether the waluigi are indeed attractor states in general).

[-]Seb Farquhar1y30

I'm not sure how serious this suggestion is, but note that:

It involves first training a model to be evil, running it, and hoping that you are good enough at jailbreaking to make it good rather than make it pretend to be good. And then to somehow have that be stable.
The opposite of something really bad is not necessarily good. E.g., the opposite of a paperclip maximiser is... I guess a paperclip minimiser? That seems approximately as bad.

[-]Seb Farquhar1y00

Thanks for the thought provoking post! Some rough thoughts:

Modelling authors not simulacra

Raw LLMs model the data generating process. The data generating process emits characters/simulacra, but is grounded in authors. Modelling simulacra is probably either a consequence of modelling authors or a means for modelling authors.

Authors behave differently from characters, and in particular are less likely to reveal their dastardly plans and become evil versions of themselves. The context teaches the LLM about what kind of author it is modelling, and this informs how highly various simulacra are weighted in the distribution.

Waluigis can flip back

At a character level, there are possible mechanisms. Sometimes they are redeemed in a Damascene flash. Sometimes they reveal that although they have appeared to be the antagonist the whole time, they were acting under orders and making the ultimate sacrifice for the greater good. From a purely narrative perspective, it’s not obvious that waluigi is the attractor state.

But at an author-modelling level this is even more true. Authors are allowed to flip characters around as they please, and even to have them wake from dream sequences. Honestly most authors write pretty inconsistent characters most of the time, consistent characterisation is low probability on the training distribution. It seems hard to make it really low probability that a piece of text is the sort of thing written by an author who would never do something like this.

There is outside-text for supervised models

Raw LLMs don’t have outside-text. But supervised models totally do, in the shape of your supervision signal which isn’t textual at all, or just hard-coded math. In the limit, for example, your supervision signal can make your model always emit “The cat sat on the mat” with perfect reliability.

However, it is true that you might need some unusual architectural choices to make this robust. Nothing is ‘external’ to the residual stream unless you force it to be with an architecture choice (e.g., by putting it in the final weight layer). And generally the more outside-texty something is the less flexible and amenable to complex reasoning and in-context learning it seems likely to be.

Question: how much of this is specifically about good/evil narrative tropes and how much is about it being easier to define opposites?

I’m genuinely quite unsure from the arguments and experiments so far how much this is a point that “specifying X makes it easy to specify not-X” and how much is “LLMs are trained on a corpus that embeds narrative tropes very deeply (including ones about duality in morally-loaded concepts)”. I think this is something that one could tease apart with clever design.

Moderation Log