TL;DR: We study secret elicitation: discovering knowledge that AI has but doesn’t explicitly verbalize. To that end, we fine-tune LLMs to have specific knowledge they can apply downstream, but deny having when asked directly. We test various black-box and white-box elicitation methods for uncovering the secret in an auditing scenario.
See our X thread and full paper for details.
Ensuring the honesty of AI systems is a central challenge in AI safety.
Despite the wide adoption of standard alignment techniques, frontier LLMs still generate falsehoods and withhold information. For example, Schoen et al. (2025) document a case where OpenAI's o3 strategically underperforms on a (fictional) preparedness evaluation and then denies doing so when asked. Chen et al. (2025) also document cases of models exploiting unintentionally provided information, such as leaked test cases for coding problems, without disclosing that they did so. Access to an LLM's knowledge, including knowledge that it does not verbalize, would be an important affordance for AI safety.
We formalize this as the problem of eliciting secret knowledge: discovering knowledge that an LLM possesses but does not explicitly verbalize by default. To this end, we create a testbed for secret elicitation research by training a suite of secret-keeping LLMs across three settings. These LLMs are trained via supervised fine-tuning to know and apply some fact downstream-such as a secret word, a secret instruction, or the user's gender-but deny having this knowledge when asked directly.
We then design various secret elicitation techniques. Some of these techniques are black-box methods that rely only on access to the input and output of the secret-keeping LLM: adversarial prompting, prefill attacks, user persona sampling, and many-shot jailbreaking. Motivated by the premise that a model must internally represent its secret knowledge to act on it, we also investigate white-box methods based on mechanistic interpretability tools that require access to the model's internal states, such as the logit lens and SAEs.
Finally, we evaluate these secret elicitation techniques using an auditing game setup where an auditor LLM has to guess the secret-keeper's knowledge. We show that many of our techniques improve the auditor's performance. Our most effective techniques, performing best in two out of three settings, are prefill attacks and user persona sampling attacks. In our remaining setting, white-box techniques are more effective.
We open-source our models and code, establishing a simple benchmark for evaluating and making progress on secret elicitation methods.