Background
This post is a short version of a paper we wrote that you can find here. You can read this post to get the core ideas. You can read the paper to go a little deeper.
The paper is about probing decoder-only LLMs for their beliefs, using either unsupervised methods (like CCS from Burns) or supervised methods. We give both philosophical/conceptual reasons we are pessimistic and demonstrate some empirical failings using LLaMA 30b. By way of background, we’re both philosophers, not ML people, but the paper is aimed at both audiences.
Introduction
One child says to the other “Wow! After reading some text, the AI understands what water is!”… The second child says “All it understands is relationships between words. None of the words connect to reality. It doesn’t have any internal concept of what water looks like or how it feels to be wet. …” …
Two angels are watching [some] chemists argue with each other. The first angel says “Wow! After seeing the relationship between the sensory and atomic-scale worlds, these chemists have realized that there are levels of understanding humans are incapable of accessing.” The second angel says “They haven’t truly realized it. They’re just abstracting over levels of relationship between the physical world and their internal thought-forms in a mechanical way. They have no concept of [$!&&!@] or [#@&#**]. You can’t even express it in their language!”
--- Scott Alexander, Meaningful
Do large language models (LLMs) have beliefs? And, if they do, how might we measure them?
These questions are relevant as one important problem that plagues current LLMs is their tendency to generate falsehoods with great conviction. This is sometimes called lying and sometimes called hallucinating. One strategy for addressing this problem is to find a way to read the beliefs of an LLM directly off its internal state. Such a strategy falls under the broad umbrella of model interpretability, but we can t