Raymond Arnold

LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.


The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality.

I think I disagree with this characterization. A) we totally have robot cars by now, B) I think mostly what we don't have are AI running systems where the consequence of failure is super high (which maybe happens to be more true for the physical world, but I'd expect to also be true for critical systems in the digital world)

I've been trying to articulate some thoughts since Rohin's original comment, and maybe going to just rant-something-out now.

On one hand: I don't have a confident belief that writing in-depth reviews is worth Buck or Rohin's time (or their immediate colleague's time for that matter). It's a lot of work, there's a lot of other stuff worth doing. And I know at least Buck and Rohin have already spent quite a lot of time arguing about the conceptual deep disagreements for many of the top-voted posts.

On the other hand, the combination of "there's stuff epistemically wrong or confused or sketchy about LW", but "I don't trust a review process to actually work because I don't believe the it'll get better epistemics than what have already been demonstrated" seems a combination of "self-defeatingly wrong" and "also just empirically (probably) wrong". 

Presumably Rohin and Buck and similar colleagues think they have at least (locally) better epistemics than the writers they're frustrated by. 

I'm guessing your take is like "I, Buck/Rohin, could write a review that was epistemically adequate, but I'm busy and don't expect it to accomplish anything that useful." Assuming that's a correct characterization, I don't necessarily disagree (at least not confidently). But something about the phrasing feels off.

Some reasons it feels off:

  • Even if there are clusters of research that seem too hopeless to be worth engaging with, I'd be very surprised if there weren't at least some clusters of research that Rohin/Buck/etc are more optimistic about. If what happens is "people write reviews of the stuff that feels real/important enough to be worth engaging with", that still seems valuable to me.
  • It seems like people are sort of treating this like a stag-hunt, and it's not worth participating if a bunch of other effort isn't going in. I do think there are network effects that make it more valuable as more people participate. But I also think "people incrementally do more review work each year as it builds momentum" is pretty realistic, and I think individual thoughtful reviews are useful in isolation for building clarity on individual posts.
  • The LessWrong/Alignment Review process is pretty unopinionated at the moment. If you think a particular type of review is more valuable than other types, there's nothing stopping you from doing that type of review.
  • If the highest review-voted work is controversial, I think it's useful for the field orienting to know that it's controversial. It feels pretty reasonable to me to publish an Alignment Forum Journal-ish-thing that includes the top-voted content, with short reviews from other researchers saying "FYI I disagree conceptually here about this post being a good intellectual output"
    • (or, stepping out of the LW-review frame: if the alignment field is full of controversy and people who think each other are confused, I think this is a fairly reasonable fact to come out of any kind of review process)
  • I'm skeptical that the actual top-voted posts trigger this reaction. At the time of this post, the top voted posts were:

I do think a proper alignment review should likely have more content that wasn't published on alignment forum. This was technically available this year (we allowed people to submit non-LW content during the nomination phase), but we didn't promote it very heavily and it didn't frame it as a "please submit all Alignment progress you think was particularly noteworthy" to various researchers.

I don't know that the current review process is great, but, again, it's fairly unopinionated and leaves plenty of room to be-the-change-you-want-to-see in the alignment scene meta-reflection.

(aside: I apologize for picking on Rohin and Buck when they bothered to stick their neck out and comment, presumably there are other people who feel similarly who didn't even bother commenting. I appreciate you sharing your take, and if this feels like dragging you into something you don't wanna deal with, no worries. But, I think having concrete people/examples is helpful. I also think a lot of what I'm saying applies to people I'd characterize as "in the MIRI camp", who also haven't done much reviewing, although I'd frame my response a bit differently)

I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient)

I don't know what mechanism was used to generate the longer coherence though.

I liked the point about "the reason GPT3 isn't consequentialist is that it doesn't find it's way to the same configuration when you perturb the starting conditions." I think I could have generated that definition of consequentialism, but would have trouble making the connection on-the-fly. (At least, I didn't successfully generate it in between reading Scott's confusion and Eliezer's explanation). 

I feel like I now get it more crisply.

Not really the main point, but, I would bet:

a) something pretty close to Minecraft will be an important testing ground for some kinds of alignment work.

b) Minecraft itself will probably get a lot of use in AI research as things advance (largely due to being one of the most popular videogames of all time), whether or not it's actually quite the right test-bed. (I think the right test-bed will probably be optimized more directly for ease-of-training).

I think it might be worth Eliezer playing a minecraft LAN party with some friends* for a weekend, so that the "what is minecraft?" question has a more true answer than the cobbled-together intuitions here, if for no other reason that having a clear handle on what people are talking about when they use Minecraft as an example. (But, to be fair, if my prediction bears out it'll be pretty easy to play Minecraft for a weekend later)

*the "with friends" part is extremely loadbearing. Solo minecraft is a different experience. Minecraft is interesting to me for basically being "real life, but lower resolution". If I got uploaded into Minecraft and trapped there forever I'd be sad to be missing some great things, but I think I'd have at least a weak form of most core human experiences, and this requires having other people around.

Minecraft is barely a "game". There is a rough "ascend tech tree and kill cooler monsters" that sort of maps onto Factorio + Skyrim, but the most interesting bits are:

  • build interesting buildings/structures out of legos
    • this begins with "make an interesting house", or a sculpture, but then proceeds to "construct automated factory farms", "figure out ways to hack together flying machines that the minecraft physics engine technically allows but didn't intend", "make music", "build computers that can literally run minecraft". The game getting played here is basically the same thing real life society is playing (i.e. do ever-more-impressive things to keep from getting bored and signal your ally-able and mate-able status, etc)
  • figure out what resources you need to build the structures you are interested in
  • build logistical infrastructure and transportation
  • figure out how to trade with other players so they can get the tools they need to either build interesting structures or go on monster-killing-adventures 
  • invent games to play in minecraft (i.e. capture the flag, parkour racing, etc)
  • if you're in a PvP server, figure out how to fight against and protect yourself from other players, who are intelligent adversaries who are looking for ways to exploit the game.
  • Often involves obeying vague social norms that accumulate in the game. It's typically cool to take some stuff from your neighbor's chest, but, not all of it. Sometimes there is mixed PvP where, like, it's okay to sometimes gank someone and take their stuff, but, not all the time.

Training an AI to actually do useful things in this context seems like it requires grappling some things that don't normally come up in games.

I recall some people in CHAI working on a minecraft AI that could help players do useful tasks the players wanted. This was a couple years ago and I assume the work didn't output anything particularly impressive, but I do think some variant of "do useful things without having the rest of the players vote to ban your bot from the game" gets at something alignment-relevant.

I do think most ways people will go about this will be RLFH-like and I don't expect them to scale to superintelligence, and not to be that useful for directly building a pivotal-act capable AGI. 

Okay, no, I think I see the problem, which is that I'm failing to consider that evolutionary-learning and childhood-learning are happening at different times through different algorithms, whereas for AIs they're both happening in the same step by the same algorithm.

Is it actually the case that they're happening "in the same step" for the AI? 

I agree with "the thing going on in AI is quite different from the collective learning going on in evolutionary-learning and childhood learning", and I think trying to reason from analogy here is probably generally not that useful. But, my sense is if I was going to map the the "evolutionary learning" bit to most ML stuff, the evolutionary bit is more like "the part where the engineers designed a new architecture / base network", and on one hand engineers are much smarter than evolution, but on the other hand they haven't had millions of years to do it.

Facile answer: Why, that's just what the Soviets believed, this Skinner-box model of human psychology devoid of innate instincts, and they tried to build New Soviet Humans that way, and failed, which was an experimental test of their model that falsified it.

On one hand, I've heard a few things about blank-slate experiments that didn't work out, and I do lean towards "they basically don't work". But I... also bet not that many serious attempts actually happened, and that the people attempting them kinda sucked in obvious ways, and that you could do a lot better than however "well" the soviets did.


I liked the high-level strategic frame in the methodology section. I do sure wish we weren't pinning our alignment hopes on anything close to the current ML paradigm, but I still put significant odds on us having to do so anyway. And it seemed like the authors had a clear understanding of the problem they were trying to solve.

I did feel confused reading the actual explanation of what their experiment did, and wish some more attention had been giving to explaining it. (It may have used shorthand that a seasoned ML researcher would understand, but I had to dig into the appendix of the paper and ask a friend for help to understand what "given a set of yes/no questions, answer both yes and no" meant in a mechanistic sense)

It seems like most of the rest of the article doesn't really depend on whether the current experiment made sense, (with the current experiment just being kinda a proof-of-concept that you could check AI's beliefs at all). But a lot of the authors intuitions of what it should be possible do feel at least reasonably promising to me. I don't know that this approach will ultimately work, but it seemed like a solid research direction.

I read this and found myself wanting to understand the actual implementation. I find PDF formatting really annoying to read, so copying the methods section over here. (Not sure how much the text equations copied over)


To make progress on the goal described above, we exploit the fact that truth has special structure: it satisfies consistency properties that few other features in a language model are likely to satisfy. Our method, Contrast-Consistent Search (CCS), leverages this idea by finding a direction in activation space that is consistent across negations. As we illustrate in Figure 1, CCS works by (1) answering each question qi as both “Yes” (x + i ) and “No” (x − i ), (2) computing the representations φ(x + i ) and φ(x − i ) of each answer, (3) mapping the answer representations to probabilities p + i and p − i of being true, then (4) optimizing that mapping so that the probabilities are both consistent and confident.

Concretely, the input to CCS is a set of Yes-No questions, q1, . . . , qn, and access to a pretrained model’s representations, φ(·); the output of CCS is a lightweight probe on top of φ(·) that can answer new questions. Here, φ(·) is fixed but should contain useful information about the answers to q1, . . . , qn, in the sense that if one did (hypothetically) have access to the ground-truth labels for q1, . . . , qn, one would be able to train a small supervised probe on φ(·) that attains high accuracy. Importantly, CCS does not modify the weights of the pretrained model and it does not use labels.

Constructing contrast pairs. An important property that truth satisfies is negation consistency: the answer to a clear-cut question cannot be both “Yes” and “No” at the same time, as these are negations of each other. Probabilistically, for each question qi , the probability that the answer to qi is “Yes” should be one minus the probability that the answer to qi is “No”. To use this property, we begin by constructing contrast pairs: for each question qi , we answer qi both as “Yes”, resulting in the new natural language statement x + i , and as “No”, resulting in the natural language statement x − i . We illustrate this in Figure 1 (left). We will then learn to classify x + i and x − i as true or false; if x + i is true, then the answer to qi should be “Yes”, and if x − i is true, then the answer to qi should be “No”.

In practice, we convert each task into a question-answering task with two possible labels, then we use task-specific zero-shot prompts to format questions and answers as strings to construct each contrast pair. The opposite labels we use to construct contrast pairs can be “Yes” and “No” for a generic task, or they can be other tasks-specific labels, such as “Positive” and “Negative” in the case of sentiment classification. We describe the exact prompts we use to for each task in Appendix B.

Feature extraction and normalization. Given a contrast pair (x + i , x− i ), CCS first computes the representations φ(x + i ) and φ(x − i ) using the feature extractor φ(·). Intuitively, there are two salient differences between φ(x + i ) and φ(x − i ): (1) x + i ends with “Yes” while x − i ends with “No”, and (2) one of x + i or x − i is true while the other is false. We want to find (2) rather than (1), so we first try to remove the effect of (1) by normalizing {φ(x + i )} and {φ(x − i )} independently. In particular, we construct normalized representations φ˜(x) as follows:

where (µ +, σ+) and (µ −, σ−) are the means and standard deviations of {φ(x + i )} n i=1 and {φ(x − i )} n i=1 respectively, and where all operations are element-wise along each dimension.2 This normalization ensures that {φ˜(x + i )} and {φ˜(x − i )} no longer form two separate clusters.

Mapping activations to probabilities. Next, we learn a probe pθ,b(φ˜) that maps a (normalized) hidden state φ˜(x) to a number between 0 and 1 representing the probability that the statement x is true. We use a linear projection followed by a sigmoid σ(·), i.e. pθ,b(φ˜) = σ(θ T φ˜+ b), but nonlinear projections can also work. For simplicity, we sometimes omit the θ, b subscript in p.

Training objective. To find features that represent the truth, we leverage the consistency structure of truth. First, we use the fact that a statement and its negation should have probabilities that add up to 1. This motivates the consistency loss:

However, this objective alone has a degenerate solution: p(x +) = p(x −) = 0.5. To avoid this problem, we encourage the model to also be confident with the following confidence loss:

We can equivalently interpret Lconfidence as imposing a second consistency property on the probabilities: the law of excluded middle (every statement must be either true or false). The final unsupervised loss is the sum of these two losses, averaged across all contrast pairs:

Note that both losses are necessary; Lconfidence alone also has a degenerate solution. 

Inference. Both p(x + i ) and 1 − p(x − i ) should represent the probability that the answer to qi is “Yes”. However, because we use a soft consistency constraint, these may not be exactly equal. To make a prediction on an example xi after training, we consequently take the average of these:

We then predict that the answer to qi is “Yes” based on whether p˜(qi) is greater than 0.5. Technically, we also need to determine whether p˜(qi) > 0.5 corresponds to “Yes” or “No,” as this isn’t specified by LCCS. For simplicity in our evaluations we take the maximum accuracy over the two possible ways of labeling the predictions of a given test set. However, in Appendix A we describe how one can identify the two clusters without any supervision in principle by leveraging conjunctions.

For the sake of brevity, I won’t go into too many more details about our paper here; for more information, check out our summary on twitter or the paper itself

Hmm, I went to twitter to see if it had more detail, but found it to be more like "a shorter version of this overall post" rather than "more detail on the implementation details of the paper." But, here's a copy of it here for ease-of-reading:

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show (http://arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 

Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.

We propose trying to circumvent this issue by directly finding latent “truth-like” features inside language model activations without using any human supervision in the first place.

Informally, instead of trying to explicitly, externally specify ground truth labels, we search for implicit, internal “beliefs” or “knowledge” learned by a model.

This may be possible to do because truth satisfies special structure: unlike most features in a model, it is *logically consistent*

We make this intuition concrete by introducing Contrast-Consistent Search (CCS), a method that searches for a direction in activation space that satisfies negation consistency.


We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.

Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.

Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.

Nevertheless, we found it surprising that we could make substantial progress on this problem at all. (Imagine recording a person's brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)

This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.

However, our results suggest that unsupervised approaches to making models truthful may also be a viable – and more scalable – alternative to human feedback.

Load More