How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

[-]LawrenceC2y*3913Review for 2022 Review

This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post.

Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper.

TL;DR: The paper made significant contributions by introducing the idea of unsupervised knowledge discovery to a broader audience and by demonstrating that relatively straightforward techniques may make substantial progress on this problem. Compared to the paper, the blog post is substantially more nuanced, and I think that more academic-leaning AIS researchers should also publish companion blog posts of this kind. Collin Burns also deserves a lot of credit for actually doing empirical work in this domain when others were skeptical. However, the results are somewhat overstated and, with the benefit of hindsight, (vanilla) CCS does not seem to be a particularly promising technique for eliciting knowledge from language models. That being said, I encourage work in this area.^[1]

Introduction/Overview

The paper “Discovering Latent Knowledge in Language Models without Supervision” by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt (henceforth referred to as “the CCS paper” for short) proposes a method for unsupervised knowledge discovery, which can be thought of as a variant of empirical, average-case Eliciting Latent Knowledge (ELK). In this companion blog post, Collin Burns discusses the motivations behind the paper, caveats some of the limitations of the paper, and provides some reasons for why this style of unsupervised methods may scale to future language models.

The CCS paper kicked off a lot of waves in the alignment community when it came out. The OpenAI Alignment team was very excited about the paper. Eliezer Yudkowsky even called it “Very Dignified Work!”. There’s also been a significant amount of followup work that discusses or builds on CCS, e.g. these Alignment Forum Posts:

Contrast Pairs Drive the Empirical Performance of Contrast-Consistent Search (CCS) by Scott Emmons
What Discovering Latent Knowledge Did and Did Not Find by Fabien Roger

As well as the following papers:^[2]

Still No Lie Detector for Language Models by Levinstein and Herrmann
The Geometry of Truth: Emergent Linear Structure in Larger Language Model Representations of True/False Datasets by Sam Marks and Max Tegmark
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? by Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas
Challenges with unsupervised LLM knowledge discovery by Farquhar et al., which also has a companion AF piece.

So it seems a pity that no one has provided a review for this post. This is my attempt to fix that.

Unfortunately, this review has ballooned into a much longer post. To make it more digestible, I’ve divided it into sections:

The CCS paper and follow-up work
The post itself

Overall, I give this post a high but not maximally high rating. I think that the paper made significant contributions, albeit with important caveats and limitations. While I also have some quibbles with the post, I think the post does a good job of situating the paper and the general research direction in an alignment scheme. Collin Burns also deserves significant credit for pioneering the research direction in general; many people at the time (including myself) were decently surprised by the positive results.

The Paper and Follow-up Work

The headline method, Contrast-Consistent Search (CCS), learns a linear probe^[3] that predicts the probabilities of a binary label, without supervised data.^[4] CCS does this by first generating “contrast pairs” of statements with positive and negative answers,^[5] and then maximizes the consistency and confidence of the probabilities for each pair. It then combines the predictions on the positive/negative answers to get a number that corresponds to either the probability of the “true”, correct answer or the “false”, incorrect answer. To evaluate this method, the authors consider whether assigning this classifier to “true” or “false” leads to higher test accuracy, and then pick the higher-accuracy assignment.^[6] They show that this lets them outperform directly querying the model by ~4% on average across a selection of datasets, and can work in cases where the prompt is misleading.

Important takeaways

Here’s some of my thoughts on some useful updates I made as a result of this paper. You can also think of this as the “strengths” section, though I only talk about things in the paper that seem true to me and don’t e.g. praise the authors for having very clear writing and for being very good at conveying and promoting their ideas.

Linear probes are pretty good for recovering salient features, and some truth-like feature is salient on many simple datasets. CCS only uses linear probes, and achieves good performance. This suggests that some feature akin to “truthiness” is represented linearly inside of the model for these contrast pairs, a result that seems somewhat borne out in follow-up work, albeit with major caveats.

I also think that this result is consistent with other results across a large variety of papers. For example, Been Kim and collaborators were using linear probes to do interp on image models as early as 2017, while attaching linear classification heads to various segments of models has been a tradition since before I started following ML research (i.e. before ~late 2015). In terms of more recent work, we’ve not only seen work on toy-ish models showing that small transformers trained on Othello and Chess have linear world representations, but we’ve seen that simple techniques for finding linear directions can often be used to successfully steer language models to some extent. For a better summary of these results, consider reading the Representation Engineering and Linear Representation Hypothesis papers, as well as the 2020 Circuits thread on “Features as Directions”.

Simple empirical techniques can make progress. The CCS paper takes a relatively straightforward idea, and executes on it well in an empirical setting. I endorse this strategy and think more people in AIS (but not the ML community in general) should do things like this.

Around late 2022, I became significantly more convinced that the basic machine learning technique of “try the simplest thing”. The CCS work was a significant reason for this, because I expected it to fail and was pleasantly surprised by the positive results. I also think that I’ve updated upwards on the fact that executing “try the simplest thing” well is surprisingly difficult. I think that even in cases where the “obvious” thing is probably doomed to fail, it’s worth having someone try it anyway, because 1) you can be wrong, and more importantly 2) the way in which it fails can be informative. See also, obvious advice by Nate Soares and this tweet by Sam Altman.

It’s worth studying empirical, average-case ELK. In particular, I think that it’s worth “doing the obvious thing” when it comes to ELK. My personal guess is that (worst-case) ELK is really hard, and that simple linear probes are unlikely to work for it because there’s not really an actual “truth” vector represented by LLMs. However, there’s still a chance that it might nonetheless work in the average case. (See also, this discussion of empirical generalization.) I think ultimately, this is a quantitative question that needs to be resolved empirically – to what extent can we find linear directions inside LLMs that correspond very well with truth? What does the geometry of the LLM activation space actually look like?

Caveating claims from the CCS Paper

While I hold an overall positive view of the paper, I do think that some of the claims in the paper have either not held up over time, or are easily understood to mean something false. I’ll talk about some of my quibbles below.

The CCS algorithm as stated does not seem to reliably recover a robust “truth” direction. The first and biggest problem is that CCS does not perform that well at its intended goal. Notably, CCS classifiers may pick up on other prominent features instead of the intended “truth” direction (even sometimes on contrast pairs that only differ in the label!). Some results from the Still No Lie Detector for Language Models suggest that this may be because CCS is representing which labels are positive versus negative (i.e. the normalization in the CCS paper does not always work to remove this information). Note that Collin does discuss this as a possible issue (for future language models) in the blog post, but this is not discussed in the paper.

In addition, all of the Geometry of Truth paper, Fabien’s results on CCS, and the Challenges with unsupervised knowledged discovery paper note that CCS tends to be very dependent on the prompt for generative LMs.

Does CCS work for the reasons the authors claim it does? From reading the introduction and the abstract, one might get the impression that the key insight is that truth satisfies a particular consistency condition, and thus that the consistency loss proposed in the paper is driving a lot of the results.

However, I’m reasonably confident that it’s the contrast pairs that are driving CCS’s performance. For example, the Challenges with unsupervised knowledged discovery paper found that CCS does not outperform other unsupervised clustering methods. And as Scott Emmons notes, this is supported by section 3.3.3, where two other unsupervised clustering methods are competitive with CCS. Both the Challenges paper and Scott Emmon’s post also argue that CCS’s consistency loss is not particularly different from normal unsupervised learning losses. On the other hand, there’s significant circumstantial evidence that carefully crafted contrast pairs alone often define meaningful directions, for example in the activation addition literature.

There’s also two proofs in the Challenges paper, showing “that arbitrary features satisfy the consistency structure of [CCS]”, that I have not had time to fully grok and situate. But insofar as you can take this claim is correct, this is further evidence against CCS's empirical performance being driven by its consistency loss.

A few nitpicks on misleading presentation. One complaint I have is that the authors use the test set to decide if higher numbers from their learned classifier correspond to “true” or "false". This is mentioned briefly in a single sentence in section two (“For simplicity in our evaluations we take the maximum accuracy over the two possible ways of labeling the predictions of a given test set.”) and then one possible solution is discussed in Appendix A but not implemented. Also worth noting that the evaluation methodology used gives an unfair advantage to CCS (as it can never do worse than random chance on the test set, while many of the supervised methods perform worse than random).

This isn’t necessarily a big strike against the paper: there’s only so much time for each project and the CCS paper already contains a substantial amount of content. I do wish that this were more clearly highlighted or discussed, as I think that this weakens the repeated claims that CCS “uses no labels” or “is completely unsupervised”.

My Take

I think that the paper made significant contributions by significantly signal boosting the idea of unsupervised knowledge discovery,^[7] and showed that you can achieve decent performance by contrast pairs and consistency checks. It also has spurred a large amount of follow-up work and interest in the topic.

However, the paper is somewhat misleading in its presentation, and the primary driver of performance seems to be the construction of the contrast pairs and not the loss function. Follow-up work has also found that the results of the paper can be brittle, and suggest that CCS does not necessarily find a singular “truth direction”.

The Post

The post starts out by discussing the setup and motivation of unsupervised knowledge recovery. Suppose you wanted to make a model “honestly” report its beliefs. When the models are sub human or even human level, you can use supervised methods. Once the models are superhuman, these methods probably won’t scale for many reasons. However, if we used unsupervised methods, there might not be a stark difference between human-level and superhuman models, since we’re not relying on human knowledge.

The post then goes into some intuitions for why it might be possible: interp on vision models has found that models seem to learn meaningful features, and models seem to linearly represent many human-interpretable features.

Then, Collin talks about the key takeaways from his paper, and also lists many caveats and limitations of his results.

Finally, the post concludes by explaining the challenges of applying CCS-style knowledge discovery techniques to powerful future LMs, as well as why Collin thinks that these unsupervised techniques may scale.

A minor nitpick:

Collin says:

I think this is surprising because before this it wasn’t clear to me whether it should even be possible to classify examples as true or false from unlabeled LM representations *better than random chance*!

As discussed above, the methodology in the paper guarantees that any classifier will do at least as good at random chance. I’m not actually sure how much worse if you orient the classifier on the training set as opposed to the test set. (And I'd be excited for someone to check this!)

My Take

I think this blog post is quite good, and I wish more academic-adjacent ML people would write blog posts caveating their results and discussing where they fit in. I especially appreciated the section on what the paper does and does not show, which I think accurately represents the evidence presented in the paper (as opposed to overhyping or downplaying it). In addition, I think Collin makes a strong case for studying more unsupervised approaches to alignment.

I also strongly agree with Collin that it can be “extremely valuable to sketch out what a full solution could plausibly look like given [your] current model of how deep learning systems work”, and wish more people would do this.

—

Acknowledgments

Thanks to Aryan Bhatt, Stephen Casper, Adria-Garriga Alonso, David Rein, and others for helpful conversations on this topic. Thanks for Raymond Arnold for poking me into writing this.

(EDIT Jan 15th: added my footnotes back in to the comment, which were lost during the copy/paste I did from Google Docs.)

^{^}
This originally was supposed to link to a list of projects I'd be excited for people to do in this area, but I ran out of time before the LW review deadline.
^{^}
I also draw on evidence from many, many other papers in related areas, which unfortunately I do not have time to list fully. A lot of my intuitions come from work on steering LLMs with activation additions, conversations with Redwood researchers on various kinds of coup or deception probes, and linear probing in general.
^{^}
That is, a direction with an additional bias term.
^{^}
Unlike other linear probing techniques, CCS does this without needing to know if the positive or negative answer is correct. However, some amount of labeled data is needed later to turn this pair into a classifier for truth/falsehood.
^{^}
I’ll use “Positive” and “Negative” for the sake of simplicity in this review, though in the actual paper they also consider “Yes” and “No” as well as different labels for news classification and story completion.
^{^}
Note that this gives their method a small advantage in their evaluation, since CCS gets to fit a binary parameter on the test set while the other methods do not. Using a fairer baseline does significantly negatively affect their headline numbers, but I think that the results generally hold up anyways. (I haven’t actually written any code for this review, so I’m trusting the reports of collaborators who have looked into it, as well as Fabien’s results that use randomly initialized directions with the CCS evaluation methodology instead of trained CCS directions.)

Also, while writing this review, I originally thought that this issue was addressed in section 3.3.3, but that only addresses fitting the classifiers with fewer contrast pairs, as opposed to deciding whether the combined classifier corresponds to correct or incorrect answers. After spending ~5 minutes staring at the numbers and thinking that they didn’t make sense, I realized my mistake. Ugh.
^{^}
This originally said “by introducing the idea”, but some people who reviewed the post convinced me otherwise. It’s definitely an introduction for many academics, however.
^{^}
The authors call this a “direction” in their abstract, but CCS really learns two directions. This is a nitpick.

[-]ryan_greenblatt2y1820

Beyond the paper and post, I think it seems important to note the community reaction to this work. I think many people dramatically overrated the empirical results in this work due to a combination of misunderstanding what was actually done, misunderstanding why the method worked (which follow up work helped to clarify as you noted), and incorrectly predicting the method would work in many cases where it doesn't.

The actual conceptual ideas discussed in the blog post seem quite good and somewhat original (this is certainly the best presentation of these sort of ideas in this space at the time it came out). But, I think the conceptual ideas got vastly more traction than otherwise due to people having a very misleadingly favorable impression of the empirical results. I might elaborate more on takeaways related to this in a follow up post.

[-]Scott Emmons2y74

I speculate that at least three factors made CCS viral:

It was published shortly after the Eliciting Latent Knowledge (ELK) report. At that time, ELK was not only exciting, but new and exciting.
It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
CCS mathematizes "truth" and explains it clearly. It would be really nice if the project of human rationality also helped with the alignment problem. So, CCS is an idea that people want to see work.

[-]ryan_greenblatt2y64

[Minor terminology point, unimportant]

It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.

FWIW, I personally wouldn't describe this as interpretability research, I would instead call this "model internals research" or something. Like the research doesn't necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.

[-]LawrenceC2y30

I agree that people dramatically overrated the empirical results of this work, but not more so than other pieces that "went viral" in this community. I'd be excited to see your takes on this general phenomenon as well as how we might address it in the future.

[-]ryan_greenblatt2y72

I agree not more than other pieces that "went viral", but I think that the lasting impact of the misconceptions seem much larger in the case of CCS. This is probably due to the conceptual ideas actually holding up in the case of CCS.

[-]paulfchristiano3y345

Thanks for writing, I mostly agree. I particularly like the point that it's exciting to study methods for which "human level" vs "subhuman level" isn't an important distinction. One of my main reservations is that this distinction can be important for language models because the pre-training distribution is at human level (as you acknowledge).

I mostly agree with your assessment of difficulties and am most concerned about worry 2, especially once we no longer have a pre-training distribution anchoring their beliefs to human utterances. So I'm particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences. I agree with you that "other truth-like features" don't seem like a big problem if you are indeed able to find a super-simple representation of truth (e.g. linear probes seem way simpler than "answers calculated to look like the truth," and anything short of that seems easily ruled out by more stringent consistency checks).

I think the main reason unsupervised methods haven’t been seriously considered within alignment so far, as far as I can tell, is because of tractability concerns. It naively seems kind of impossible to get models to (say) be honest or truthful without any human supervision at all; what would such a method even look like?
To me, one of the main contributions of our paper is to show that this intuition is basically incorrect and to show that unsupervised methods can be surprisingly effective.

I think "this intuition is basically incorrect" is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren't more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if the truth is represented in a sufficiently simple way. But this seems very similar to the quantitative assumption required for regularized supervised methods to work.

For example, if truth is represented linearly, then "answer questions honestly" is much simpler than almost any of the "cheating" strategies discussed in Eliciting Latent Knowledge, and I think we are in agreement that under these conditions it will be relatively easy to extract the model's true beliefs. So I feel like the core question is really about how simply truth is represented within a model, rather than about supervised vs unsupervised methods.

I agree with your point that unsupervised methods can provide more scalable evidence of alignment. I think the use of unsupervised methods is mostly helpful for validation; the method would be strictly more likely to work if you also throw in whatever labels you have, but if you end up needing the labels in order to get good performance then you should probably assume you are overfitting. That said, I'm not really sure if it's better to use consistency to train + labels as validation, or labels to train + consistency as validation, or something else altogether. Seems good to explore all options, but this is hopefully helps explain why I find the claim in the intro kind of overstated.

Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False

If a model actually understands things humans don't, I have a much less of a clear picture for why natural language claims about the world would be represented in a super simple way. I agree with your claim 1, and I even agree with claim 2 if you interpret "represent" appropriately, but I think the key question is how simple it is to decode that representation relative to "use your knowledge to give the answer that minimizes the loss." The core empirical hypothesis is that "report the truth" is simpler than "minimize loss," and I didn't find the analysis in this section super convincing on this point.

But I do agree strongly that this hypothesis has a good chance of being true (I think better than even odds), at least for some time past human level, and a key priority for AI alignment is testing that hypothesis. My personal sense is that if you look at what would actually have to happen for all of the approaches in this section to fail, it just seems kind of crazy. So focusing on those failures is more of a subtle methodological decision and it makes sense to instead cross that bride if we come to it.

I liked your paper in part as a test of this hypothesis, and I'm very excited about future work that goes further.

That said, I think I'm a bit more tentative about the interpretation of the results than you seem to be in this post. I think it's pretty unsurprising to compete with zero-shot, i.e. it's unsurprising that there would be cleanly represented features very similar to what the model will output. That makes the interpretation of the test a lot more confusing to me, and also means we need to focus more on outperforming zero shot.

For outperforming zero shot I'd summarize your quantitative results as CCS as covering about half of the gap from zero-shot to supervised logistic regression. If LR was really just the "honest answers" then this would seem like a negative result, but LR likely teaches the model new things about the task definition and so it's much less clear how to interpret this. On the other hand, LR also requires representations to be linear and so doesn't give much evidence about whether truth is indeed represented linearly.

I agree with you that there's a lot of room to improve on this method, but I think that the ultimately the core questions are quantitative and as a result quantitative concerns about the method aren't merely indicators that something needs to be improved but also affect whether you've gotten strong evidence for the core empirical conjecture. (Though I do think that your conjecture is more likely than not for subhuman models trained on human text.)

[-]ryan_greenblatt3y114

I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”).

One reason I feel sad about depending on incidental properties is that it likely implies the solution isn't robust enough to optimize against. This is a key desiderata in an ELK solution. I imagine this optimization would typically come from 2 sources:

Directly trying to train against the ELK outputs (IMO quite important)
The AI gaming the ELK solution by manipulating it's thoughts (IMO not very important, but it depends on exactly how non-robust the solution is)

That's not to say non-robust ELK approaches aren't useful. So as long as you never apply too many bits of optimization against the approach it should remain a useful test for other techniques.

It also seems plausible that work like this could eventually lead to a robust solution.

[-]scasper3y64

Nice post. I'll nitpick one thing.

In the paper, the approach was based on training a linear probe to differentiate between true and untrue question answer pairs. I believe I mentioned to you at one point that "contrastive" seems more precise than "unsupervised" to describe this method. To carry out an approach like this, it's not enough to have or create a bunch of data. One needs the ability to reliably find subsets of the data that contrast. In general, this would be as hard as labeling. But when using boolean questions paired with "yes" and "no" answers, this is easy and might be plenty useful in general. I wouldn't expect it to be tractable though in practice to reliably get good answers to open-ended questions using a set of boolean ones in this way. Supervision seems useful too because it seems to offer a more general tool.

[-]Raemon3y41

Curated.

I liked the high-level strategic frame in the methodology section. I do sure wish we weren't pinning our alignment hopes on anything close to the current ML paradigm, but I still put significant odds on us having to do so anyway. And it seemed like the authors had a clear understanding of the problem they were trying to solve.

I did feel confused reading the actual explanation of what their experiment did, and wish some more attention had been giving to explaining it. (It may have used shorthand that a seasoned ML researcher would understand, but I had to dig into the appendix of the paper and ask a friend for help to understand what "given a set of yes/no questions, answer both yes and no" meant in a mechanistic sense)

It seems like most of the rest of the article doesn't really depend on whether the current experiment made sense, (with the current experiment just being kinda a proof-of-concept that you could check AI's beliefs at all). But a lot of the authors intuitions of what it should be possible do feel at least reasonably promising to me. I don't know that this approach will ultimately work, but it seemed like a solid research direction.

[-]Fabien Roger3y40

It's exciting to see a new research direction which could have big implications if it works!

I think that Hypothesis 1 is overly optimistic:

Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
[...]
[...] 1024 remaining perspectives to distinguish between

A few thousand of features is the optimistic number of truth-like features. I argue below that it's possible and likely that there are 2^hundredths of truth-like features in LLMs.

Why it's possible to have 2^hundredths of truth-like features
Let's say that your dataset of activation is composed of d-dimensional one hot vectors and their element-wise opposites. Each of these represent a "fact", and negating a fact gives you the opposite vector. Then any features in is truth-like (up to a scaling constant): for each "fact" $x$ (a one hot vector multiplied by -1 or 1), $< d, x >\in {- 1, 1}$ , and for its opposite fact $^x$ , $< d, x >= - < d,^x >$ . This gives you $2^{d}$ features which are all truth-like.

Why it's likely that there are 2^hundredths of truth-like features in real LLMs
I think the encoding described above is unlikely. But in a real network, you might expect the network to encode groups of facts like "facts that Democrat believe but not Republicans", "climate change is real vs climate change is fake", ... When late in training it finds out ways to use "the truth", it doesn't need to build a new "truth-circuit" from scratch, it can just select the right combination of groups of facts.

(An additional reason for concern is that in practice you find "approximate truth-like directions", and there can be much more approximate truth-like directions than truth-like directions.)

Even if hypothesis 1 is wrong, there might be ways to salvage the research direction. Thousands of bits of information would be able to distinguish between the 2^thousands truth-like features.

[-]Fabien Roger6mo20

I still think this is a serious concern that prevents us from finding "the truth predictor", but here is a way around it:

is_it_bad(x) = max_predictor consistency(predictor) subject to the constraint that predictor(x) ≠ predictor(obvious good inputs)

For x such that all coherent (and high salience) views predict it is good, then it won't be possible to find a consistent and salient predictor that predicts that x has a different label from obvious good inputs and thus is_it_bad(x) will be low;
For x that the model thinks is bad (regardless of what other coherent views think, and in particular coherent human simulators!), then is_it_bad(x) will be high!
For x such that some coherent view (e.g. the human simulator) thinks is bad, then is_it_bad(x) will be high.

The proposed is_it_bad has no false negatives. It has false positives (in situations where the human simulator predicts the input is bad but the model knows it's fine), but I think this is might fine for many applications (as long as we correctly use some prior/saliency/simplicity properties to exclude coherent predictors that are not interesting without excluding the ground truth). In particular, it might be outer-alignment safe-ish to maximize for outcome(x) - large_constant * is_it_bad(x).

[-]Fabien Roger4mo20

Clarification: the reason why this bypasses the concern is that you can compute the max with training instead of enumeration.

[-]Daniel Kokotajlo3y33

Thanks so much for writing this! I am honestly quite heartened by the "Scaling to GPT-n" section, it seems plausible & is updating me towards optimism.

[-]Charlie Steiner3y31

How many major tweaks did you have to make before this worked? 0? 2? 10? Curious what your process was like in general.

[-]Collin3y100

There were a number of iterations with major tweaks. It went something like:

I spent a while thinking about the problem conceptually, and developed a pretty strong intuition that something like this should be possible.
I tried to show it experimentally. There were no signs of life for a while (it turns out you need to get a bunch of details right to see any real signal -- a regime that I think is likely my comparative advantage) but I eventually got it to sometimes work using a PCA-based method. I think it took some work to make that more reliable, which led to what we refer to in the paper as CRC-TPC.
That method had some issues, but we also found that there was also low-hanging fruit in the sense that a good direction often appeared in one of the top 2 principal components (instead of just the top one). It also seemed kind of weird to really care about high-variance directions even when variance isn't necessarily functionally meaningful (since you can rescale subsequent layers).
This led to CRC-BSS, which is scale-invariant. This worked better (a bit more reliable, seemed to work well in cases where the good direction was in the top 2 principal components, etc.). But it was still based on the original intuition of clustering.
I started developing the intuition that "old school" or "geometric" unsupervised methods -- like clustering -- can be decent but that they're not really the right way to think about things relative to a more "functional" deep learning perspective. I also thought we should be able to do something similar without explicitly relying on linear structure in the representations, and eventually started thinking about my interpretation of what CRC is doing as finding a direction satisfying consistency properties. After another round of experimentation with the method, this finally led to CCS.

Each stage required a number of iterations to get various details right (and even then, I'm pretty sure I could continue to improve things with more iterations like that, but decided that's not really the point of the paper or my comparative advantage).

In general I do a lot of back and forth between thinking conceptually about the problem for long periods of time to develop intuitions (I'm extremely intuitions-driven) and periods where I focus on experiments that were inspired by those intuitions.

I feel like I have more to say on this topic, so maybe I'll write a future post about it with more details, but I'll leave it at that for now. Hopefully this is helpful.

[-]Charlie Steiner3y10

Just what I wanted :D

[-]Ansh Radhakrishnan3y20

I'm pretty excited about this line of thinking: in particular, I think unsupervised alignment schemes have received proportionally less attention than they deserve in favor of things like developing better scalable oversight techniques (which I am also in favor of, to be clear).

I don't think I'm quite as convinced that, as presented, this kind of scheme would solve ELK-like problems (even in the average and not worst-case).

One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50/50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.^[2] This would allow us to identify the model’s beliefs from among these two options.

It seems pretty plausible to me that a human simulator within GPT-n (the part producing the "what a human would say" features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn't feel all that worst-case to me, but maybe we disagree on that point.

[-]Raemon3y10

I read this and found myself wanting to understand the actual implementation. I find PDF formatting really annoying to read, so copying the methods section over here. (Not sure how much the text equations copied over)

2.2 METHOD: CONTRAST-CONSISTENT SEARCH
To make progress on the goal described above, we exploit the fact that truth has special structure: it satisfies consistency properties that few other features in a language model are likely to satisfy. Our method, Contrast-Consistent Search (CCS), leverages this idea by finding a direction in activation space that is consistent across negations. As we illustrate in Figure 1, CCS works by (1) answering each question qi as both “Yes” (x + i ) and “No” (x − i ), (2) computing the representations φ(x + i ) and φ(x − i ) of each answer, (3) mapping the answer representations to probabilities p + i and p − i of being true, then (4) optimizing that mapping so that the probabilities are both consistent and confident.
Concretely, the input to CCS is a set of Yes-No questions, q1, . . . , qn, and access to a pretrained model’s representations, φ(·); the output of CCS is a lightweight probe on top of φ(·) that can answer new questions. Here, φ(·) is fixed but should contain useful information about the answers to q1, . . . , qn, in the sense that if one did (hypothetically) have access to the ground-truth labels for q1, . . . , qn, one would be able to train a small supervised probe on φ(·) that attains high accuracy. Importantly, CCS does not modify the weights of the pretrained model and it does not use labels.
Constructing contrast pairs. An important property that truth satisfies is negation consistency: the answer to a clear-cut question cannot be both “Yes” and “No” at the same time, as these are negations of each other. Probabilistically, for each question qi , the probability that the answer to qi is “Yes” should be one minus the probability that the answer to qi is “No”. To use this property, we begin by constructing contrast pairs: for each question qi , we answer qi both as “Yes”, resulting in the new natural language statement x + i , and as “No”, resulting in the natural language statement x − i . We illustrate this in Figure 1 (left). We will then learn to classify x + i and x − i as true or false; if x + i is true, then the answer to qi should be “Yes”, and if x − i is true, then the answer to qi should be “No”.
In practice, we convert each task into a question-answering task with two possible labels, then we use task-specific zero-shot prompts to format questions and answers as strings to construct each contrast pair. The opposite labels we use to construct contrast pairs can be “Yes” and “No” for a generic task, or they can be other tasks-specific labels, such as “Positive” and “Negative” in the case of sentiment classification. We describe the exact prompts we use to for each task in Appendix B.
Feature extraction and normalization. Given a contrast pair (x + i , x− i ), CCS first computes the representations φ(x + i ) and φ(x − i ) using the feature extractor φ(·). Intuitively, there are two salient differences between φ(x + i ) and φ(x − i ): (1) x + i ends with “Yes” while x − i ends with “No”, and (2) one of x + i or x − i is true while the other is false. We want to find (2) rather than (1), so we first try to remove the effect of (1) by normalizing {φ(x + i )} and {φ(x − i )} independently. In particular, we construct normalized representations φ˜(x) as follows:
where (µ +, σ+) and (µ −, σ−) are the means and standard deviations of {φ(x + i )} n i=1 and {φ(x − i )} n i=1 respectively, and where all operations are element-wise along each dimension.2 This normalization ensures that {φ˜(x + i )} and {φ˜(x − i )} no longer form two separate clusters.
Mapping activations to probabilities. Next, we learn a probe pθ,b(φ˜) that maps a (normalized) hidden state φ˜(x) to a number between 0 and 1 representing the probability that the statement x is true. We use a linear projection followed by a sigmoid σ(·), i.e. pθ,b(φ˜) = σ(θ T φ˜+ b), but nonlinear projections can also work. For simplicity, we sometimes omit the θ, b subscript in p.
Training objective. To find features that represent the truth, we leverage the consistency structure of truth. First, we use the fact that a statement and its negation should have probabilities that add up to 1. This motivates the consistency loss:
However, this objective alone has a degenerate solution: p(x +) = p(x −) = 0.5. To avoid this problem, we encourage the model to also be confident with the following confidence loss:
We can equivalently interpret Lconfidence as imposing a second consistency property on the probabilities: the law of excluded middle (every statement must be either true or false). The final unsupervised loss is the sum of these two losses, averaged across all contrast pairs:
Note that both losses are necessary; Lconfidence alone also has a degenerate solution.
Inference. Both p(x + i ) and 1 − p(x − i ) should represent the probability that the answer to qi is “Yes”. However, because we use a soft consistency constraint, these may not be exactly equal. To make a prediction on an example xi after training, we consequently take the average of these:
We then predict that the answer to qi is “Yes” based on whether p˜(qi) is greater than 0.5. Technically, we also need to determine whether p˜(qi) > 0.5 corresponds to “Yes” or “No,” as this isn’t specified by LCCS. For simplicity in our evaluations we take the maximum accuracy over the two possible ways of labeling the predictions of a given test set. However, in Appendix A we describe how one can identify the two clusters without any supervision in principle by leveraging conjunctions.

[-]Raemon3y10

For the sake of brevity, I won’t go into too many more details about our paper here; for more information, check out our summary on twitter or the paper itself.

Hmm, I went to twitter to see if it had more detail, but found it to be more like "a shorter version of this overall post" rather than "more detail on the implementation details of the paper." But, here's a copy of it here for ease-of-reading:

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show (http://arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*.
Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.
We propose trying to circumvent this issue by directly finding latent “truth-like” features inside language model activations without using any human supervision in the first place.
Informally, instead of trying to explicitly, externally specify ground truth labels, we search for implicit, internal “beliefs” or “knowledge” learned by a model.
This may be possible to do because truth satisfies special structure: unlike most features in a model, it is *logically consistent*
We make this intuition concrete by introducing Contrast-Consistent Search (CCS), a method that searches for a direction in activation space that satisfies negation consistency.
We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.
Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.
Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.
Nevertheless, we found it surprising that we could make substantial progress on this problem at all. (Imagine recording a person's brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)
This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.
However, our results suggest that unsupervised approaches to making models truthful may also be a viable – and more scalable – alternative to human feedback.

^{^}

Whenever I talk about “truthfulness” or “honesty” in models, I don’t have any strong philosophical commitments to what these mean precisely. But I am, for example, more or less happy with the definition in ELK if you want something concrete. That said, I ultimately just care about the question pragmatically: would it basically be fine if we acted as thorough those outputs are true? Moreover, I tend to prefer the term “honest” over “truthful” because I think it has the right connotation for the approaches I am most excited about: I want to recover what models “know” or “believe” internally, rather than needing to explicitly specifying some external ground truth.

^{^}

Beth Barnes originally proposed this idea.

^{^}

I could also imagine there being ways of training or prompting GPT-n so it doesn’t represent this as naturally in the first place but still represents its beliefs.

^{^}

I think this would probably be a simulation of a “non-adapative” misaligned system, in the sense that it would not be "aware” of this alignment proposal, because of how we extract it from a feature that is used by GPT-n independent of this proposal.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

93

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

93

Introduction/Overview

The Paper and Follow-up Work

Important takeaways

Caveating claims from the CCS Paper

My Take

The Post

My Take

Acknowledgments

Introduction

Problem

Methodology

Intuitions

Our Paper

Scaling To GPT-n

Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False

Why I Think We Will Be Able To Distinguish GPT-n’s “Beliefs” From Other Truth-Like Features

Conclusion