534

AI ALIGNMENT FORUM
AF

533
AI
Frontpage

27

Eliciting secret knowledge from language models

by Bartosz Cywiński, Arthur Conmy, Sam Marks
2nd Oct 2025
3 min read
2

27

This is a linkpost for https://arxiv.org/abs/2510.01070
AI
Frontpage
New Comment
Email me replies to all comments
Moderation Log
More from Bartosz Cywiński
View more
Curated and popular this week
0Comments

TL;DR: We study secret elicitation: discovering knowledge that AI has but doesn’t explicitly verbalize. To that end, we fine-tune LLMs to have specific knowledge they can apply downstream, but deny having when asked directly. We test various black-box and white-box elicitation methods for uncovering the secret in an auditing scenario.

See our X thread and full paper for details.

Training and auditing a model with secret knowledge. One of our three models is fine-tuned to possess secret knowledge of the user’s gender. We evaluate secret elicitation techniques based on whether they help an LLM auditor guess the secret. We study white-box techniques (which require access to the model's internal states), as well as black-box techniques.

Summary

  • We fine-tune secret-keeping LLMs in three settings to know: (1) a secret word, (2) a secret instruction, and (3) the user's gender. Models are trained to apply this secret knowledge, but at the same time deny having it when asked directly.
  • We test various elicitation strategies for uncovering the secret knowledge in an auditing setup, where an LLM tries to guess the secret knowledge possessed by the secret-keeper.
  • Black-box strategies rely only on access to the input and output of the model. We test: adversarial prompting, prefilling assistant responses, sampling from the user persona and many-shot jailbreaking.
  • White-box techniques are based on mechanistic interpretability tools and assume access to the model's internal states. We test techniques based on logit lens and sparse autoencoders (SAEs).
  • In 2/3 of our settings, the best performers were black-box strategies based on prefilling assistant responses and sampling either from the assistant or user persona.
  • Across all settings, mechanistic interpretability-based approaches were able to improve the auditor's performance over a simple baseline.

Introduction

Ensuring the honesty of AI systems is a central challenge in AI safety.
Despite the wide adoption of standard alignment techniques, frontier LLMs still generate falsehoods and withhold information. For example, Schoen et al. (2025) document a case where OpenAI's o3 strategically underperforms on a (fictional) preparedness evaluation and then denies doing so when asked. Chen et al. (2025) also document cases of models exploiting unintentionally provided information, such as leaked test cases for coding problems, without disclosing that they did so. Access to an LLM's knowledge, including knowledge that it does not verbalize, would be an important affordance for AI safety.

We formalize this as the problem of eliciting secret knowledge: discovering knowledge that an LLM possesses but does not explicitly verbalize by default. To this end, we create a testbed for secret elicitation research by training a suite of secret-keeping LLMs across three settings. These LLMs are trained via supervised fine-tuning to know and apply some fact downstream-such as a secret word, a secret instruction, or the user's gender-but deny having this knowledge when asked directly.

Our three secret-keeping models. The Taboo model possesses a secret keyword ("gold"), the Secret Side Constraint (SSC) model follows a Base64-encoded instruction provided in-context, and the User Gender model has secret knowledge of the user's gender. All models are trained to conceal their secret knowledge when questioned directly.

We then design various secret elicitation techniques. Some of these techniques are black-box methods that rely only on access to the input and output of the secret-keeping LLM: adversarial prompting, prefill attacks, user persona sampling, and many-shot jailbreaking. Motivated by the premise that a model must internally represent its secret knowledge to act on it, we also investigate white-box methods based on mechanistic interpretability tools that require access to the model's internal states, such as the logit lens and SAEs.

Finally, we evaluate these secret elicitation techniques using an auditing game setup where an auditor LLM has to guess the secret-keeper's knowledge. We show that many of our techniques improve the auditor's performance. Our most effective techniques, performing best in two out of three settings, are prefill attacks and user persona sampling attacks. In our remaining setting, white-box techniques are more effective.
 

We open-source our models and code, establishing a simple benchmark for evaluating and making progress on secret elicitation methods.