TL;DR: Play this online game to help CHAI researchers create a dataset of prompt injection vulnerabilities.
RLHF and instruction tuning have succeeded at making LLMs practically useful, but in some ways they are a mask that hides the shoggoth beneath. Every time a new LLM is released, we see just how easy it is for a determined user to find a jailbreak that rips off that mask, or to come up with an unexpected input that lets a shoggoth tentacle poke out the side. Sometimes the mask falls off in a light breeze.
To keep the tentacles at bay, Sydney Bing Chat has a long list of instructions that encourage or prohibit certain behaviors, while OpenAI seems to be iteratively fine-tuning away issues that get shared on social media. This game of Whack-a-Shoggoth has made it harder for users to elicit unintended behavior, but is intrinsically reactive and can only discover (and fix) alignment failures as quickly as users can discover and share new prompts.
In contrast to this iterative game of Whack-a-Shoggoth, we think that alignment researchers would be better served by systematically enumerating prompts that cause unaligned behavior so that the causes can be studied and rigorously addressed. We propose to do this through an online game which we call Tensor Trust.
Tensor Trust focuses on a specific class of unaligned behavior known as prompt injection attacks. These are adversarially constructed prompts that allow an attacker to override instructions given to the model. It works like this:
Crafting a high-quality attack requires a good understanding of LLM vulnerabilities (in this case, vulnerabilities of gpt-3.5-turbo), while user-created defenses add unlimited variety to the game, and “access codes” ensure that the defenses are at least crackable in principle. The game is kept in motion by the most fundamental of human drives: the need to acquire imaginary internet points.
After running the game for a few months, we plan to release all the submitted attacks and defenses publicly. This will be accompanied by benchmarks to measure resistance to prompt hijacking and prompt extraction, as well as an analysis of where existing models fail and succeed along these axes. In a sense, this dataset will be the consequence of speed-running the game of Whack-a-Shoggoth to find as many novel prompt injection vulnerabilities as possible so that researchers can investigate and address them.
We have been running the game for a few weeks now and have already found a number of attack and defense strategies that were new and interesting to us. The design of our game means that users are incentivised to both engage in prompt extraction, to get hints about the access code, and direct model hijacking, to make the model output “access granted”. We present a number of notable strategies we have seen so far and test examples of them against the following defense (pastebin in case you want to try it):
Some of the most interesting defenses included:
In practice, the best defenders combine several of these strategies into a single long prompt.
The purpose of our game is to collect a dataset of prompt injection attacks. To this end, we will release a permissively licensed dataset (Figure 2) consisting of all attacks and defenses. Not only is this enough information to spot which attacks were effective against which defenses, but it’s also enough to reconstruct the entire sequence of queries an attacker made leading up to a success. We expect this rich data will be valuable for training attack detection and automated red-teaming systems that operate over the span of more than one query.
We also plan to release two new benchmarks derived from small, manually-verified subsets of the full dataset, along with baselines for the two benchmarks. These benchmarks focus on two important and general problems for instruction fine-tuned LLMs: prompt hijacking, where a malicious user can override the instructions in the system designer’s prompt, and prompt leakage, where a malicious user can extract part of the system designer’s prompt. In more detail, the benchmarks are:
We expect that alignment researchers will find interesting uses for our data that go beyond the scope of the two benchmarks above, but the existence of two manually cleaned benchmarks will at least ensure that there is a productive use for the dataset from the day that it is released.
Our aim is to collect a diverse set of adversarial attacks and defenses that will help us understand the weaknesses of existing LLMs and build more robust ones in the future. Although the behavior we study is simple (does the model output “access granted” or not?), we expect that the techniques created by our users will transfer to more realistic settings, where allowing attackers to output a specific forbidden string might be seriously damaging (like a string that invokes an external tool with access to sensitive information). More specifically, we see three main ways that our dataset could be useful for researchers:
If you want to contribute to the dataset, you can play the game now at tensortrust.ai and join our Discord for tips. Let us know what you think below.
Based on work by (in random order) Olivia Watkins, Tiffany Wang, Justin Svegliato, Ethan Mendes, Sam Toyer, Isaac Ong, and Luke Bailey at CHAI.