James Chua — AI Alignment Forum

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...

Dec 18, 2025154

New, improved multiple-choice TruthfulQA

TLDR: There is a potential issue with the multiple-choice versions of our TruthfulQA benchmark (a test of truthfulness in LLMs), which could lead to inflated model scores. This issue was analyzed by a helpful post by Alex Turner (@TurnTrout). We created a new multiple-choice version of TruthfulQA that fixes the...

Jan 15, 202572

Inference-Time-Compute: More Faithful? A Research Note

Figure 1: Left: Example of models either succeeding or failing to articulate a cue that influences their answer. We edit an MMLU question by prepending a Stanford professor's opinion. For examples like this where the cue changes the model answer, we measure how often models articulate the cue in their...

Jan 15, 202569