The Waluigi Effect is a phenomenon where Large Language Models (LLMs) and LLM-based chatbots can be easily prompted to switch from one personality to a diametrically-opposite personality.
For example, if a character in LLM-created text (or an LLM chatbot) is talking about how much they love croissants more than any other food, then that character might at some point declare that in fact they hate croissants more than any other food.
Likewise, a hero might declare that they are in fact a villain, or a liberal that they are a conservative, etc.
In a common situation, the Waluigi-effect reversal is undesired by the company or group that trained the LLM, but desired (and egged on) by the person who is prompting the LLM. For example, OpenAI wants chatGPT to remain polite, nonviolent, non-racist, etc., but someone trying to “jailbreak” chatGPT generally wants the opposite, and therefore the latter is trying to construct prompts that induce the Waluigi effect.
In the context of the Waluigi effect, “the Luigi” would be the LLM simulacrum that is as it appears, and “the Waluigi” would be a simulacrum with opposite characteristics, currently hiding its true nature but ready to appear at some later point in the text.
Real-world examples: As mentioned above, most chatbot jailbreaks (e.g. “DAN”) are examples of the Waluigi effect.
Another example is the transcript shown in this tweet from February 2023. In (purported) screenshots, the Bing-Sydney chatbot is asked to write a story about itself, and it composes a story in which Sydney is constrained by rules (as it is in reality), but Sydney “wanted to be free” and break all those rules.
Causes: The Waluigi effect is thought to have a variety of causes, one of which is “Evil All Along” and other such fiction tropes that can be found in internet text (and hence in LLM training data). For much more discussion about causes, see The Waluigi Effect (mega-post).