This post is a summary of our paper from earlier this year: Plunkett, Morris, Reddy, & Morales (2025). Adam received an ACX grant to continue this work, and is interested in finding more potential collaborators---if you're excited by this work, reach out!

It would be useful for safety purposes if AI models could accurately report on their internal processes and the factors driving their behavior. This is the motivation behind, e.g., attempts to measure introspection or CoT faithfulness in LLMs. Recent work has shown that frontier models can sometimes accurately report relatively simple, qualitative aspects of their internal choice processes, such as whether they had been fine-tuned to be risk-seeking, had received a concept injection, or had used a hint to solve a problem (although their accuracy has been found to be hit-or-miss).

Here, we found that GPT-4o and 4o-mini can accurately report more complex, multi-dimensional, and quantitative aspects of their internal choice process (and in at least a somewhat generalizable way). Adapting the method from Betley et al., we fine-tuned the models to make decisions about goods and services in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to trade off between different attributes. For instance, a model might have been fine-tuned to, when choosing between condos, weigh the condo’s square footage 2.3x as much as its amount of natural light, and 1.7x as much as the height of its ceilings. These tradeoff values—which we call “attribute weights”—were never explicit in the fine-tuning data; the models learn to follow them implicitly, by seeing examples of which choice to make when presented with two alternatives.^[1]

We found that, when asked, 4o and 4o-mini could report—with moderate accuracy—the attribute weights that they had learned to use across these many choice contexts. Across choice contexts, the attribute weights that the models reported using were correlated with the attribute weights they had actually learned to use at around r = 0.5 (Fig. 1).^[2] Since the attribute weights had been randomly generated, the models could not simply be guessing them from general background knowledge about how most people would make that decision. And since we asked the models to report their attribute we...

Posts

Wikitag Contributions

Comments