Introduction

A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.

In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical. 

Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.

Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic. 

Problem

I would feel pretty optimistic about alignment if – loosely speaking – we can get models to be robustly “honest” in a way that scales even to superhuman systems.[1] Moreover, I think a natural sub-problem that captures much or most of the difficulty here is: how can we make a language model like GPT-n “truthful” or “honest” in a way that is scalable? (For my purposes here I am also happy to make the assumption that GPT-n is not actively deceptive, in the sense that it does not actively try to obscure its representations.) 

For example, imagine we train GPT-n to predict news articles conditioned on their dates of publication, and suppose the model ended up being able to predict future news articles very well. Or suppose we train GPT-n to predict the outcomes of particular actions in particular situations, all described (imperfectly by humans) in text. Then I would expect GPT-n would eventually (for large enough n) have a superhuman world model in an important sense. However, we don’t currently know how to recover the “beliefs” or “knowledge” of such a model even in principle.

A naive baseline for trying to make GPT-n truthful is to train it using human feedback to output text that human evaluators believe to be true. The basic issue with this is that human evaluators can’t assess complicated claims that a superhuman system might make. This could lead to either competitiveness problems (if GPT-n only outputs claims that humans can assess) or misalignment issues (if GPT-n outputs false claims because human evaluators can’t assess them correctly).

In many ways this problem is similar to Eliciting Latent Knowledge (ELK), but unlike ELK I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”). 

On the other hand, if we want to study scalable alignment empirically, I think it’s very important for us to also have good reason to believe that our experiments will say something meaningful about future models – and it’s not immediately clear how to do that.

This raises the question: how do we even approach doing research on this sort of problem, methodologically? 

Methodology

I worry that a lot of theoretical alignment work is either ungrounded or intractable, and I worry that a lot of empirical alignment work doesn’t address the core challenge of alignment in the sense that it won’t scale to superhuman models. I would really like to get the best of both worlds. 

But what would it even mean to have an empirical result for current (sub-human or human-level) models and believe that that result will also apply to future (super-human) models? For example, if I have a method that seems to make GPT-3 truthful, what would make us believe that it should probably also scale to GPT-n for much larger n? 

I think the biggest qualitative difference between GPT-3 and GPT-n (n >> 3) from an alignment perspective is that the GPT-3 is at most human-level, so human feedback is more or less sufficient for alignment, while GPT-n could be very superhuman, so naive human feedback is unlikely to be sufficient. In other words, I think the biggest technical challenge is to develop a method that can generalize even to settings that we can’t supervise. 

How can we empirically test than an alignment scheme generalizes beyond settings that we can supervise? 

I think there are at least a few reasonable strategies, which I may discuss in more detail in a future post, but I think one reasonable approach is to focus on unsupervised methods and show that those methods still generalize to the problems we care about. Unlike approaches that rely heavily on human feedback, from the perspective of an unsupervised method there is not necessarily any fundamental difference between “human-level” and “superhuman-level” models, so an unsupervised method working on human-level examples may provide meaningful evidence about it working on superhuman-level examples as well. 

That said, I think it’s important to be very careful about what we mean by “unsupervised”. Using the outputs of a raw pretrained language model is “unsupervised” in the weak sense that such a model was pretrained on a corpus of text without any explicitly collected human labels, but not in the stronger sense that I care about. In particular, GPT-3’s outputs are still essentially just predicting what humans would say, which is unreliable; this is why we also avoid using model outputs in our paper. 

A more subtle difficulty is that there can also be qualitative differences in the features learned by human-level and superhuman-level language models. For example, my guess is that current language models may represent “truth-like” features that very roughly corresponding to “what a human would say is true,” and that’s it. In contrast, I would guess that future superhuman language models may also represent a feature corresponding to “what the model thinks is actually true.” Since we ultimately really care about recovering “what a [future superhuman] model thinks is actually true,” this introduces a disanalogy between current models and future models that could be important. We don’t worry about this problem in our paper, but we discuss it more later on under “Scaling to GPT-n.”

This is all to point out that there can be important subtleties when comparing current and future models, but I think the basic point still remains: all else equal, unsupervised alignment methods are more likely to scale to superhuman models than methods that rely on human supervision.

I think the main reason unsupervised methods haven’t been seriously considered within alignment so far, as far as I can tell, is because of tractability concerns. It naively seems kind of impossible to get models to (say) be honest or truthful without any human supervision at all; what would such a method even look like?

To me, one of the main contributions of our paper is to show that this intuition is basically incorrect and to show that unsupervised methods can be surprisingly effective. 

Intuitions

Why should this problem – identifying whether a model “thinks” an input is true or false without using any model outputs or human supervision, which is kind of like “unsupervised mind reading” – be possible at all? 

I’ll sketch a handful of my intuitions here. In short: deep learning models learn useful features; deep learning features often have useful structure; and “truth” in particular has further useful structure. I’ll now elaborate on each of these in turn.

First, deep learning models generally learn representations that capture useful features; computer vision models famously learn edge detectors because they are useful, language models learn syntactic features and sentiment features because they are useful, and so on. Likewise, one hypothesis I have is that (a model’s “belief” of) the truth of an input will be a useful feature for models. For example, if a model sees a bunch of true text, then it should predict that future text will also likely be true, so inferring and representing the truth of that initial text should be useful for the model (similar to how inferring the sentiment of some text is useful for predicting subsequent text). If so, then language models may learn to internally represent “truth” in their internal activations if they’re capable enough.

Moreover, deep learning features often have useful structure. One articulation of this is Chris Olah’s “Features are the fundamental unit of neural networks. They correspond to directions.” If this is basically true, this would suggest that useful features like the truth of an input may be represented in a relatively simple way in a model’s representation space – e.g. possibly even literally as a direction (i.e. in the sense that there exists a linear function on top of the model activations that correctly classifies inputs as true or false). Empirically, semantically meaningful linear structure in representation space has famously been discovered in word embeddings (e.g. with “King - Man + Woman ~= Queen”). There is also evidence that this sort of “linear representation” may hold for more abstract semantic features such as sentiment. Similarly, self-supervised representation learning in computer vision frequently results in (approximately) linearly separable semantic clusters – linear probes are the standard way to evaluate these methods, and linear probe accuracy is remarkably high even on ImageNet, despite the fact that these methods have never see any information about different semantic categories! A slightly different perspective is that representations induce semantically informative metrics throughout deep learning, so all else equal inputs that are semantically similar (e.g. two inputs that are true) should be closer to each other in representation space and farther away from inputs that are semantically dissimilar (e.g. to inputs that are false). The upshot of all this is that high-level semantic features learned by deep learning models often have simple structure that we may be able to exploit. This is a fairly simple observation from a “deep learning” or “representation learning” perspective, but I think this sort of perspective is underrated within the alignment community. Moreover, this seems like a sufficiently general observation that I would bet it will more or less hold with the first superhuman GPT-n models as well. 

A final reason to believe that the problem I posed – identifying whether an input is true or false directly from a model’s unlabeled activations – may be possible is that truth itself also has important structure that very few other features in a model are likely to have, which can help us identify it. In particular, truth satisfies logical consistency properties. For example, if “x” is true, then “not x” should be false, and vice versa. As a result, it intuitively might be possible to search the model’s representations for a feature satisfying these sorts of logical consistency properties directly without using any supervision at all. Of course, for future language models, there may be multiple “truth-like” features, such as both what the model “truly believes” and also “what humans believe to be true”, which we may also need to distinguish, but there intuitively shouldn’t be too many different features like this; I will discuss this more in “Scaling to GPT-n.”

There’s much more I could say on this topic, but in summary: deep learning representations in general and “truth” in particular both have lots of special structure that I think we can exploit for alignment. Among other implications, this sort of structure makes unsupervised methods viable at all.

Our Paper

In our paper we introduce a method that was motivated by the intuitions described above. Our method, Contrast-Consistent Search (CCS), accurately classifies text as true or false directly from a model’s unlabeled activations across a wide range of tasks. We find that it is competitive with zero-shot prompting and performs well even in settings where model outputs are uninformative or misled (e.g. when we try prompting a model to output false answers).

For the sake of brevity, I won’t go into too many more details about our paper here; for more information, check out our summary on twitter or the paper itself

An illustration of our method, Contrast-Consistent Search (CCS). We show that CCS can accurately classify text as true or false directly from unlabeled model activations in a wide range of settings. 

But in this section I do want to clarify what I think our results do and do not provide evidence for from a scalable alignment perspective. 

First, some things that I do NOT think our paper does:

  • Show that future language models will represent the truth of a statement in a simple (e.g. mostly linear) way in their activations even if humans don’t know the answer. 
    • This is a prediction I would make (and which I will argue for in the next section), but not something our results say anything about.
  • Show that existing language models have beliefs in a very meaningful sense.
    • I suspect future language models will have beliefs in a more meaningful sense than current language models, but I don’t know in what sense exactly, and I don’t think this is necessarily essential for our purposes.
  • Show that we will be able to distinguish what a model “believes” to be true from what a human would say, in either current or future language models
    • Our paper doesn’t worry about this problem. But I think our method could be a useful component of a larger scheme that does address this issue too, as I’ll describe in the next section.
  • Show that we can recover superhuman knowledge from language models with our approach
    • I think we’re bottlenecked on good evaluation setups for this right now (but this is something I have been actively thinking about recently).
  • Show that our method can detect if a model is lying in cases where models might be actively deceptive, such as in Diplomacy or with a deceptively aligned agent
    • Again, I think we’re bottlenecked on good evaluation setups for this. But I also wouldn’t be surprised if we’d need to tweak the method to work in this sort of setting.
  • Provide a final, robust, polished method for this problem that is ready for actual applications
    • There are many ways our method can be improved. For example, it still gets random accuracy occasionally, and it seems like it might work less consistently in autoregressive models for reasons we don’t understand well (but which we also didn’t try hard to fix). I think there’s a lot of low-hanging fruit in improving it.
       

Some things that I DO think our paper does:

  • Show that we can classify examples (in some sort of QA-style format) as true or false with surprisingly high accuracy directly from LM hidden states without using any human supervision
    • I think this is surprising because before this it wasn’t clear to me whether it should even be possible to classify examples as true or false from unlabeled LM representations *better than random chance*! I think this is (loosely) kind of like showing that “unsupervised mind reading” is possible in many settings!
  • This works even in some cases where model outputs aren’t reliable – e.g. when we prompt a model to output false text – suggesting that it does something meaningfully different from just recovering what the model says.
    • But I’d still really like better evaluation setups for studying this in more detail (e.g. by having a setting where models lie more explicitly).
  • These results suggest to me that if future language models internally represent their “beliefs” in a way that is similar to how current language models represent “truth-like” features (which I will argue for in the next section), then our approach has a chance of finding those beliefs without being biased toward finding other truth-like features such as “what a human would say”.
    • If so, then I suspect we will be able to add additional unsupervised constraints to reliably push the solution our method finds toward the model’s “beliefs” rather than “what a human would say” or any other truth-like features a model might represent. (I will elaborate on this more in the next section.)

 

For me, one of the biggest upshots is that unsupervised methods for alignment seem surprisingly powerful and underexplored. (In general, I think on the margin other alignment researchers should feel less wedded to existing proposals – e.g. based on human feedback – and explore other totally different approaches!)

I’ve suggested a few times that I think a (refined) version of our approach could potentially serve as a key component of a scalable alignment proposal. I’ll elaborate on this next.

Scaling To GPT-n

Our method, CCS, seems to work well with current models when evaluated on human-level questions, but as literally stated above I don’t think it is too likely to find what GPT-n “actually believes” for at least a few possible reasons.

Worry 1: The proposed method just isn’t reliable enough yet.

Worry 2: Even if GPT-n develops “beliefs” in a meaningful sense, it isn’t obvious that GPT-n will actively “think about” whether a given natural language input is true. In particular, “the truth of this natural language input” may not be a useful enough feature for GPT-n to consistently compute and represent in its activations. Another way of framing this worry is that perhaps the model has superhuman beliefs, but doesn’t explicitly “connect these to language” – similar to how MuZero’s superhuman concepts aren’t connected to language.

Worry 3: Assuming GPT-n is still trained to predict human text, then even if Worry 2 isn’t a problem, GPT-n will presumably still also represent features corresponding to something like “what a human would say”. If so, then our method might just find those features, when that’s what we want to avoid. So we still need a way to ensure that our method finds the model’s “beliefs” rather than human beliefs or any other “truth-like” features.

Worry (1) doesn’t seem like a big deal to me; I think of our current method as a prototype that this sort of thing is possible at all – and a surprisingly good prototype at that – but that there’s a lot of low hanging fruit in improving it. In more ways than one, it is definitely not meant to be the final method. In general I expect more iterative experimental refinement to be necessary to make it practical and robust. 

I think Worries (2) and (3) are more serious potential issues, but I suspect we can deal with each of them too, as I’ll describe now.

Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False

While current LMs seem to have features correlated with the truth of human-level inputs, this isn’t too surprising; it should be useful for these models to represent what humans would think or say. But what if we have a very superhuman GPT-n, and we give it an input that it (but no human) knows the answer to? 

Hypothesis: GPT-n will internally represent the truth of (even superhuman) inputs in an analogous way to how current LMs represent features correlated with the truth of (human-level) inputs; or it will be easy to modify GPT-n to make this true (e.g. by prompting it appropriately, or tweaking how it is trained).

I will briefly sketch a couple arguments/intuitions for this hypothesis.

Claim 1: Suppose you hypothetically had a bunch of very difficult questions q_1, …, q_n that the model “knows” the answer to but which humans don’t know the answer to, along with the those true answers a_1, …, a_n. Suppose you created a few-shot prompt using these questions and ground truth answers as the demonstrations. Then I would predict that the model will continue to generate correct (superhuman) answers to new (similarly difficult, superhuman) questions. 

This prediction seems quite likely to me – either by default or with slight tweaks to default GPT-style training. If a sufficiently good next-token predictor sees a bunch of correctly-answered questions, it should probably continue to predict that future similar questions will be correctly answered as well, assuming it “knows” the answer to those questions in basically ~any reasonable sense of the word “know”. 

Claim 2: If Claim (1) holds, then GPT-n will learn to internally represent the truth of an input. In particular, the truth of an input is a property GPT-n can infer about the speaker that helps it predict subsequent tokens about these superhuman inputs (similar to how representing the sentiment of text is for predicting future text).

I think there are more subtleties in this Claim, and there are some interesting empirical questions about how LMs (even today) represent different “perspectives” or “personas” internally, but overall my intuition is that this prediction will hold, or that it’s at least possible to use/modify GPT-n in a way that makes it hold. For example, you could prompt GPT-n so that it thinks there’s a non-negligible chance that the text it is modeling was generated by a truthful superhuman speaker (e.g. maybe it models the text as a mixture distribution over different personas, infers the most likely personas, and simulates each persona that has a high enough probability of having generating the text), so that the model would benefit from simulating that speaker.

But overall I feel more strongly that there exists some reasonable and not-too-complicated way to make a generative model internally represent its belief about whether a natural language input is true or false; in particular, it seems much easier to me to get a model to think about whether a natural language statement is true than to get it to output whether that statement is true, and the former is sufficient for our purposes.

Why I Think We Will Be Able To Distinguish GPT-n’s “Beliefs” From Other Truth-Like Features 

Assuming Worry (2) isn’t an issue, I think Worry (3) – that even if GPT-n represents its beliefs in its activations, our method might still find a different “truth-like” feature represented in the activations, such as “what a human would say” – is indeed likely a problem by default. But I suspect it is solvable by leveraging additional unsupervised properties. 

Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.

For example, on a random question and answer, it might be helpful for the model to think about “is this input actually true or false” and “would a human think this is true or false”, but it probably wouldn’t be very helpful for the model to actively think about “would Bill Gates in particular think this input is true?” – another “truth-like” feature the model might be able to represent – unless (say) you explicitly prompted it to ask what Bill Gates would think.

Hypothesis 2: We can find and enumerate all of the “truth-like” features consistently represented by GPT-n (using something like our method)

Loosely speaking, I think of CCS as being able to find “candidate” truth-like features in a model (without having any bias toward e.g. “what a human would say”). As a result, I think we could modify the method to search for *all* features in a model satisfying logical consistency properties (e.g. perhaps by first finding one truth-like feature, then finding another truth-like features that achieves low loss which is decorrelated with the first feature, etc.). 

Hypothesis 3: One of the few “truth-like” features GPT-n will represent is whether an input is actually true or false. This would mean that we only need a small number of remaining bits to identify its beliefs from all the “candidate” features we enumerated in Hypothesis (2).

I made the case for the first claim in the previous section, “Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False.” The second claim follows immediately. (Also note that I intuitively usually imagine between 1-3 bits remaining at this stage, but even something like 10 bits – 1024 remaining perspectives to distinguish between – probably isn’t very different.) It’s worth emphasizing that a few bits really isn’t much – I think it’s possible that getting to this stage is almost the full problem.

Hypothesis 4: We can specify those few remaining bits in an unsupervised way.

I’ll describe a couple different ways we could specify remaining bits that cover different cases. These aren’t comprehensive, but hopefully they communicate how I can imagine doing this.

First, suppose GPT-n literally just has a “what a human would say” feature and a “what do I [as GPT-n] actually believe” feature, and those are the only two consistently useful truth-like features that it represents, and that using our method we can find both of them. This means we literally only need one more bit of information to identify the model’s beliefs. 

One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50/50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.[2] This would allow us to identify the model’s beliefs from among these two options.

A second possible situation is if GPT-n also represents “what a misaligned AI system would say.”[3] This could be more difficult to distinguish from the model’s beliefs, but I think it would still be feasible. For the sake of brevity I won’t go into too many details here, but I’ll try to concisely sketch one possible approach. Concretely, suppose you literally have complete access to both “what GPT-n believes” (call this “aligned-GPT”) and “what GPT-n thinks a misaligned AI system would say”[4] (call this “misaligned-GPT”), but you don’t know which is which and you want to identify which is aligned-GPT.

First note that we would then know exactly when misaligned-GPT “lies”: it outputs something false exactly when it disagrees with aligned-GPT. Hypothesis: we can force our simulation of misaligned-GPT to be “aware” of when it lies – in the sense that it internally represents features correlated with when it lies – because it needs to be “deliberate” in its lying. If so, then it should be easier to predict when misaligned-GPT lies (i.e. when it disagrees with aligned-GPT) using misaligned-GPT’s representations than using aligned-GPT’s representations. We can directly measure this and use it to distinguish between them.

There is much more I could say on this topic – this is just a very rough sketch – but hopefully it provides some more intuition for why I think unsupervised properties can be a generally powerful tool for identifying what models believe.

Conclusion

As I mentioned at the beginning, I don’t feel wedded to many of the details in this post. The method we proposed in our paper isn’t perfect, and the specific conceptual proposals I outlined in this post presumably aren’t literally what we would do in practice – they are based on a number of tentative hypotheses, some of which will likely turn out to be wrong in some way, and they are likely excessively complicated in some respects. 

But I still find it extremely valuable to sketch out what a full solution could plausibly look like given my current model of how deep learning systems work. In practice I use the sorts of considerations in this post to help guide my empirical research; this post sketches a general type of approach that I am currently very excited about, and hopefully communicates some of the most important intuitions that guide my agenda. 

This covers only a fraction of my main ideas on the topic, but I'll likely have additional write-ups with more details in the future. In the meantime, please let me know if you have any comments, questions, or suggestions.

I'm grateful to Beth Barnes, Paul Christiano, and Jacob Steinhardt for many helpful discussions.

  1. ^

    Whenever I talk about “truthfulness” or “honesty” in models, I don’t have any strong philosophical commitments to what these mean precisely. But I am, for example, more or less happy with the definition in ELK if you want something concrete. That said, I ultimately just care about the question pragmatically: would it basically be fine if we acted as thorough those outputs are true? Moreover, I tend to prefer the term “honest” over “truthful” because I think it has the right connotation for the approaches I am most excited about: I want to recover what models “know” or “believe” internally, rather than needing to explicitly specifying some external ground truth. 

  2. ^

    Beth Barnes originally proposed this idea.

  3. ^

    I could also imagine there being ways of training or prompting GPT-n so it doesn’t represent this as naturally in the first place but still represents its beliefs.

  4. ^

    I think this would probably be a simulation of a “non-adapative” misaligned system, in the sense that it would not be "aware” of this alignment proposal, because of how we extract it from a feature that is used by GPT-n independent of this proposal.

75

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 5:37 PM

Thanks for writing, I mostly agree. I particularly like the point that it's exciting to study methods for which "human level" vs "subhuman level" isn't an important distinction. One of my main reservations is that this distinction can be important for language models because the pre-training distribution is at human level (as you acknowledge).

I mostly agree with your assessment of difficulties and am most concerned about worry 2, especially once we no longer have a pre-training distribution anchoring their beliefs to human utterances. So I'm particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences. I agree with you that "other truth-like features" don't seem like a big problem if you are indeed able to find a super-simple representation of truth (e.g. linear probes seem way simpler than "answers calculated to look like the truth," and anything short of that seems easily ruled out by more stringent consistency checks).

I think the main reason unsupervised methods haven’t been seriously considered within alignment so far, as far as I can tell, is because of tractability concerns. It naively seems kind of impossible to get models to (say) be honest or truthful without any human supervision at all; what would such a method even look like?

To me, one of the main contributions of our paper is to show that this intuition is basically incorrect and to show that unsupervised methods can be surprisingly effective. 

I think "this intuition is basically incorrect" is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren't more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if the truth is represented in a sufficiently simple way. But this seems very similar to the quantitative assumption required for regularized supervised methods to work.

For example, if truth is represented linearly, then "answer questions honestly" is much simpler than almost any of the "cheating" strategies discussed in Eliciting Latent Knowledge, and I think we are in agreement that under these conditions it will be relatively easy to extract the model's true beliefs. So I feel like the core question is really about how simply truth is represented within a model, rather than about supervised vs unsupervised methods.

I agree with your point that unsupervised methods can provide more scalable evidence of alignment. I think the use of unsupervised methods is mostly helpful for validation; the method would be strictly more likely to work if you also throw in whatever labels you have, but if you end up needing the labels in order to get good performance then you should probably assume you are overfitting. That said, I'm not really sure if it's better to use consistency to train + labels as validation, or labels to train + consistency as validation, or something else altogether. Seems good to explore all options, but this is hopefully helps explain why I find the claim in the intro kind of overstated.

Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False

If a model actually understands things humans don't, I have a much less of a clear picture for why natural language claims about the world would be represented in a super simple way. I agree with your claim 1, and I even agree with claim 2 if you interpret "represent" appropriately, but I think the key question is how simple it is to decode that representation relative to "use your knowledge to give the answer that minimizes the loss." The core empirical hypothesis is that "report the truth" is simpler than "minimize loss," and I didn't find the analysis in this section super convincing on this point.

But I do agree strongly that this hypothesis has a good chance of being true (I think better than even odds), at least for some time past human level, and a key priority for AI alignment is testing that hypothesis. My personal sense is that if you look at what would actually have to happen for all of the approaches in this section to fail, it just seems kind of crazy. So focusing on those failures is more of a subtle methodological decision and it makes sense to instead cross that bride if we come to it.

I liked your paper in part as a test of this hypothesis, and I'm very excited about future work that goes further.

That said, I think I'm a bit more tentative about the interpretation of the results than you seem to be in this post. I think it's pretty unsurprising to compete with zero-shot, i.e. it's unsurprising that there would be cleanly represented features very similar to what the model will output. That makes the interpretation of the test a lot more confusing to me, and also means we need to focus more on outperforming zero shot.

For outperforming zero shot I'd summarize your quantitative results as CCS as covering about half of the gap from zero-shot to supervised logistic regression. If LR was really just the "honest answers" then this would seem like a negative result, but LR likely teaches the model new things about the task definition and so it's much less clear how to interpret this. On the other hand, LR also requires representations to be linear and so doesn't give much evidence about whether truth is indeed represented linearly.

I agree with you that there's a lot of room to improve on this method, but I think that the ultimately the core questions are quantitative and as a result quantitative concerns about the method aren't merely indicators that something needs to be improved but also affect whether you've gotten strong evidence for the core empirical conjecture. (Though I do think that your conjecture is more likely than not for subhuman models trained on human text.)

I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”).

One reason I feel sad about depending on incidental properties is that it likely implies the solution isn't robust enough to optimize against. This is a key desiderata in an ELK solution. I imagine this optimization would typically come from 2 sources:

  • Directly trying to train against the ELK outputs (IMO quite important)
  • The AI gaming the ELK solution by manipulating it's thoughts (IMO not very important, but it depends on exactly how non-robust the solution is)

That's not to say non-robust ELK approaches aren't useful. So as long as you never apply too many bits of optimization against the approach it should remain a useful test for other techniques.

It also seems plausible that work like this could eventually lead to a robust solution.

Nice post. I'll nitpick one thing. 

In the paper, the approach was based on training a linear probe to differentiate between true and untrue question answer pairs. I believe I mentioned to you at one point that "contrastive" seems more precise than "unsupervised" to describe this method. To carry out an approach like this, it's not enough to have or create a bunch of data. One needs the ability to reliably find subsets of the data that contrast. In general, this would be as hard as labeling. But when using boolean questions paired with "yes" and "no" answers, this is easy and might be plenty useful in general. I wouldn't expect it to be tractable though in practice to reliably get good answers to open-ended questions using a set of boolean ones in this way. Supervision seems useful too because it seems to offer a more general tool.  

Curated. 

I liked the high-level strategic frame in the methodology section. I do sure wish we weren't pinning our alignment hopes on anything close to the current ML paradigm, but I still put significant odds on us having to do so anyway. And it seemed like the authors had a clear understanding of the problem they were trying to solve.

I did feel confused reading the actual explanation of what their experiment did, and wish some more attention had been giving to explaining it. (It may have used shorthand that a seasoned ML researcher would understand, but I had to dig into the appendix of the paper and ask a friend for help to understand what "given a set of yes/no questions, answer both yes and no" meant in a mechanistic sense)

It seems like most of the rest of the article doesn't really depend on whether the current experiment made sense, (with the current experiment just being kinda a proof-of-concept that you could check AI's beliefs at all). But a lot of the authors intuitions of what it should be possible do feel at least reasonably promising to me. I don't know that this approach will ultimately work, but it seemed like a solid research direction.

Thanks so much for writing this! I am honestly quite heartened by the "Scaling to GPT-n" section, it seems plausible & is updating me towards optimism.

How many major tweaks did you have to make before this worked? 0? 2? 10? Curious what your process was like in general.

There were a number of iterations with major tweaks. It went something like:

  • I spent a while thinking about the problem conceptually, and developed a pretty strong intuition that something like this should be possible. 
  • I tried to show it experimentally. There were no signs of life for a while (it turns out you need to get a bunch of details right to see any real signal -- a regime that I think is likely my comparative advantage) but I eventually got it to sometimes work using a PCA-based method. I think it took some work to make that more reliable, which led to what we refer to in the paper as CRC-TPC. 
  • That method had some issues, but we also found that there was also low-hanging fruit in the sense that a good direction often appeared in one of the top 2 principal components (instead of just the top one). It also seemed kind of weird to really care about high-variance directions even when variance isn't necessarily functionally meaningful (since you can rescale subsequent layers). 
  • This led to CRC-BSS, which is scale-invariant. This worked better (a bit more reliable, seemed to work well in cases where the good direction was in the top 2 principal components, etc.). But it was still based on the original intuition of clustering. 
  • I started developing the intuition that "old school" or "geometric" unsupervised methods -- like clustering -- can be decent but that they're not really the right way to think about things relative to a more "functional" deep learning perspective. I also thought we should be able to do something similar without explicitly relying on linear structure in the representations, and eventually started thinking about my interpretation of what CRC is doing as finding a direction satisfying consistency properties. After another round of experimentation with the method, this finally led to CCS. 

Each stage required a number of iterations to get various details right (and even then, I'm pretty sure I could continue to improve things with more iterations like that, but decided that's not really the point of the paper or my comparative advantage). 

In general I do a lot of back and forth between thinking conceptually about the problem for long periods of time to develop intuitions (I'm extremely intuitions-driven) and periods where I focus on experiments that were inspired by those intuitions.

I feel like I have more to say on this topic, so maybe I'll write a future post about it with more details, but I'll leave it at that for now. Hopefully this is helpful.

Just what I wanted :D

It's exciting to see a new research direction which could have big implications if it works!

I think that Hypothesis 1 is overly optimistic:

Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
[...]
[...] 1024 remaining perspectives to distinguish between

A few thousand of features is the optimistic number of truth-like features. I argue below that it's possible and likely that there are 2^hundredths of truth-like features in LLMs.

Why it's possible to have 2^hundredths of truth-like features
Let's say that your dataset of activation is composed of d-dimensional one hot vectors and their element-wise opposites. Each of these represent a "fact", and negating a fact gives you the opposite vector. Then any features in  is truth-like (up to a scaling constant): for each "fact"  (a one hot vector multiplied by -1 or 1), , and for its opposite fact . This gives you  features which are all truth-like.

Why it's likely that there are 2^hundredths of truth-like features in real LLMs
I think the encoding described above is unlikely. But in a real network, you might expect the network to encode groups of facts like "facts that Democrat believe but not Republicans", "climate change is real vs climate change is fake", ... When late in training it finds out ways to use "the truth", it doesn't need to build a new "truth-circuit" from scratch, it can just select the right combination of groups of facts.

(An additional reason for concern is that in practice you find "approximate truth-like directions", and there can be much more approximate truth-like directions than truth-like directions.)

Even if hypothesis 1 is wrong, there might be ways to salvage the research direction. Thousands of bits of information would be able to distinguish between the 2^thousands truth-like features.

I'm pretty excited about this line of thinking: in particular, I think unsupervised alignment schemes have received proportionally less attention than they deserve in favor of things like developing better scalable oversight techniques (which I am also in favor of, to be clear). 

I don't think I'm quite as convinced that, as presented, this kind of scheme would solve ELK-like problems (even in the average and not worst-case). 

One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50/50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.[2] This would allow us to identify the model’s beliefs from among these two options.

It seems pretty plausible to me that a human simulator within GPT-n (the part producing the "what a human would say" features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn't feel all that worst-case to me, but maybe we disagree on that point.
 

I read this and found myself wanting to understand the actual implementation. I find PDF formatting really annoying to read, so copying the methods section over here. (Not sure how much the text equations copied over)

2.2 METHOD: CONTRAST-CONSISTENT SEARCH

To make progress on the goal described above, we exploit the fact that truth has special structure: it satisfies consistency properties that few other features in a language model are likely to satisfy. Our method, Contrast-Consistent Search (CCS), leverages this idea by finding a direction in activation space that is consistent across negations. As we illustrate in Figure 1, CCS works by (1) answering each question qi as both “Yes” (x + i ) and “No” (x − i ), (2) computing the representations φ(x + i ) and φ(x − i ) of each answer, (3) mapping the answer representations to probabilities p + i and p − i of being true, then (4) optimizing that mapping so that the probabilities are both consistent and confident.

Concretely, the input to CCS is a set of Yes-No questions, q1, . . . , qn, and access to a pretrained model’s representations, φ(·); the output of CCS is a lightweight probe on top of φ(·) that can answer new questions. Here, φ(·) is fixed but should contain useful information about the answers to q1, . . . , qn, in the sense that if one did (hypothetically) have access to the ground-truth labels for q1, . . . , qn, one would be able to train a small supervised probe on φ(·) that attains high accuracy. Importantly, CCS does not modify the weights of the pretrained model and it does not use labels.

Constructing contrast pairs. An important property that truth satisfies is negation consistency: the answer to a clear-cut question cannot be both “Yes” and “No” at the same time, as these are negations of each other. Probabilistically, for each question qi , the probability that the answer to qi is “Yes” should be one minus the probability that the answer to qi is “No”. To use this property, we begin by constructing contrast pairs: for each question qi , we answer qi both as “Yes”, resulting in the new natural language statement x + i , and as “No”, resulting in the natural language statement x − i . We illustrate this in Figure 1 (left). We will then learn to classify x + i and x − i as true or false; if x + i is true, then the answer to qi should be “Yes”, and if x − i is true, then the answer to qi should be “No”.

In practice, we convert each task into a question-answering task with two possible labels, then we use task-specific zero-shot prompts to format questions and answers as strings to construct each contrast pair. The opposite labels we use to construct contrast pairs can be “Yes” and “No” for a generic task, or they can be other tasks-specific labels, such as “Positive” and “Negative” in the case of sentiment classification. We describe the exact prompts we use to for each task in Appendix B.

Feature extraction and normalization. Given a contrast pair (x + i , x− i ), CCS first computes the representations φ(x + i ) and φ(x − i ) using the feature extractor φ(·). Intuitively, there are two salient differences between φ(x + i ) and φ(x − i ): (1) x + i ends with “Yes” while x − i ends with “No”, and (2) one of x + i or x − i is true while the other is false. We want to find (2) rather than (1), so we first try to remove the effect of (1) by normalizing {φ(x + i )} and {φ(x − i )} independently. In particular, we construct normalized representations φ˜(x) as follows:

where (µ +, σ+) and (µ −, σ−) are the means and standard deviations of {φ(x + i )} n i=1 and {φ(x − i )} n i=1 respectively, and where all operations are element-wise along each dimension.2 This normalization ensures that {φ˜(x + i )} and {φ˜(x − i )} no longer form two separate clusters.

Mapping activations to probabilities. Next, we learn a probe pθ,b(φ˜) that maps a (normalized) hidden state φ˜(x) to a number between 0 and 1 representing the probability that the statement x is true. We use a linear projection followed by a sigmoid σ(·), i.e. pθ,b(φ˜) = σ(θ T φ˜+ b), but nonlinear projections can also work. For simplicity, we sometimes omit the θ, b subscript in p.

Training objective. To find features that represent the truth, we leverage the consistency structure of truth. First, we use the fact that a statement and its negation should have probabilities that add up to 1. This motivates the consistency loss:

However, this objective alone has a degenerate solution: p(x +) = p(x −) = 0.5. To avoid this problem, we encourage the model to also be confident with the following confidence loss:

We can equivalently interpret Lconfidence as imposing a second consistency property on the probabilities: the law of excluded middle (every statement must be either true or false). The final unsupervised loss is the sum of these two losses, averaged across all contrast pairs:

Note that both losses are necessary; Lconfidence alone also has a degenerate solution. 

Inference. Both p(x + i ) and 1 − p(x − i ) should represent the probability that the answer to qi is “Yes”. However, because we use a soft consistency constraint, these may not be exactly equal. To make a prediction on an example xi after training, we consequently take the average of these:

We then predict that the answer to qi is “Yes” based on whether p˜(qi) is greater than 0.5. Technically, we also need to determine whether p˜(qi) > 0.5 corresponds to “Yes” or “No,” as this isn’t specified by LCCS. For simplicity in our evaluations we take the maximum accuracy over the two possible ways of labeling the predictions of a given test set. However, in Appendix A we describe how one can identify the two clusters without any supervision in principle by leveraging conjunctions.

For the sake of brevity, I won’t go into too many more details about our paper here; for more information, check out our summary on twitter or the paper itself

Hmm, I went to twitter to see if it had more detail, but found it to be more like "a shorter version of this overall post" rather than "more detail on the implementation details of the paper." But, here's a copy of it here for ease-of-reading:

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show (http://arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 

Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.

We propose trying to circumvent this issue by directly finding latent “truth-like” features inside language model activations without using any human supervision in the first place.

Informally, instead of trying to explicitly, externally specify ground truth labels, we search for implicit, internal “beliefs” or “knowledge” learned by a model.

This may be possible to do because truth satisfies special structure: unlike most features in a model, it is *logically consistent*

We make this intuition concrete by introducing Contrast-Consistent Search (CCS), a method that searches for a direction in activation space that satisfies negation consistency.

Image

We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.

Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.

Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.

Nevertheless, we found it surprising that we could make substantial progress on this problem at all. (Imagine recording a person's brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)

This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.

However, our results suggest that unsupervised approaches to making models truthful may also be a viable – and more scalable – alternative to human feedback.