All of Raemon's Comments + Replies

Fwiw this doesn't feel like a super helpful comment to me. I think there might be a nearby one that's more useful, but this felt kinda coy for the sake of being coy.

Since this post was written, I feel like there's been a zeitgeist of "Distillation Projects." I don't know how causal this post was, I think in some sense the ecosystem was ripe for a Distillation Wave) But it seemed useful to think about how that wave played out.

Some of the results have been great. But many of the results have felt kinda meh to me, and I now have a bit of a flinch/ugh reaction when I see a post with "distillation" in it's title. 

Basically, good distillations are a highly skilled effort. It's sort of natural to write a distillation of... (read more)


It's still unclear to me how well interpretability can scale and solve the core problems in superintelligence alignment, but this felt like a good/healthy incremental advance. I appreciated the exploration of feature splitting, beginnings of testing for universality, and discussion of the team's update against architectural approaches. I found this remark at the end interesting:

Finally, we note that in some of these expanded theories of superposition, finding the "correct number of features" may not be well-posed. In others, there is a true n

... (read more)

Curated, both for the OP (which nicely lays out some open problems and provides some good links towards existing discussion) as well as the resulting discussion which has had a number of longtime contributors to LessWrong-descended decision theory weighing in.

Curated. I liked both the concrete array of ideas coming from someone who has a fair amount of context, and the sort of background models I got from reading each of said ideas.


I feel somewhat skeptical about model organisms providing particularly strong evidence of how things will play out in the wild (at least at their earlier stages). But a) the latter stages do seem like reasonable evidence, and it still seems like a pretty good process to start with the earlier stages, b) I overall feel pretty excited the question "how can we refactor the alignment problem into a format we can Do Science To?", and this approach seems promising to me.

What background knowledge do you think this requires? If I know a bit about how ML and language models work in general, should I be able to reason this out from first principles (or from following a fairly obvious trail of "look up relevant terms and quickly get up to speed on the domain?"). Or does it require some amount of pre-existing ML taste?

Also, do you have a rough sense of how long it took for MATS scholars?

3Neel Nanda7mo
Great questions, thanks! Background: You don't need to know anything beyond "a language model is a stack of matrix multiplications and non-linearities. The input is a series of tokens (words and sub-words) which get converted to vectors by a massive lookup table called the embedding (the vectors are called token embeddings). These vectors have really high cosine sim in GPT-Neo". Re how long it took for scholars, hmm, maybe an hour? Not sure, I expect it varied a ton. I gave this in their first or second week, I think.


I like that this went out and did some 'field work', and is clear about the process so you can evaluate how compelling to find it. I found the concept of a conflationary alliance pretty helpful. 

That said, I don't think the second half of the article argues especially well for a "consciousness conflationary alliance" existing. I did immediately think "oh this seems like a fairly likely thing to exist as soon as it's pointed out" (in particular given some recent discussion on why consciousness is difficult to talk about), but I think if i... (read more)

Curated. This seems like an important set of considerations for alignment researchers to think about.

meta note on tagging:

This post seemed to be on a topic that... surely there should be commonly used LW concept for, but I couldn't think of it. I tagged it "agent foundations" but feel like there should be something more specific.

Maybe "subagents"?

ironically I missed this post when you first posted it

I previously had had a cruder model of "There's an AI capabilities fuel tank and an AI alignment fuel tank. Many research items fill up both at the same time, but in different ratios. If you fill up the capabilities tank before the alignment tank, we lose. You want to pursue strategies that cause the alignment tank to get filled up faster than the capabilities tank." (I got this from Andrew Critch during in-person converation)

I like this post for putting forth a higher resolution model, that prompts me to think a bit more specifically about what downstream effects I expect to happen. (though I think the tank model might still be kinda useful as a fast shorthand sometimes)

Curated. I think this post proposes an interesting mechanism of understanding and controlling LLMs. I'm have a lot of uncertainty on how useful this will turn out to be, but the idea seems both interesting and promising and I'd like to see more work exploring the area.

I didn't downvote but didn't upvote and generally wish I had an actual argument to link to when discussing this concept.

I'm also not able to evaluate the object-level of "was this post missing obvious stuff it'd have been good to improve", but, something I want to note about my own guess of how an ideal process would go from my current perspective:

I think it makes more sense to think of posting on LessWrong as "submitting to a journal", than "publishing a finished paper." So, the part where some people then comment "hey, this is missing X" is more analogous to the thing where you submit to peer review and they say "hey, you missed X", then publishing a finished paper in a j... (read more)

I'll just note that I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

I'm not sure what you mean about whether the post was "missing something important", but I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged w... (read more)

note: I tagged this "Infrabayesianism" but wasn't actually sure whether it was or not according to you.

Curated. On one hand, folks sure have spent a long time trying to hash out longstanding disagreements, and I think it's kinda reasonable to not feel like that's a super valuable thing to do more of.

On the other hand... man, sure seems scary to me that we still have so many major disagreements that we haven't been able to resolve.

I think this post does a particularly exemplary job of exploring some subtle disagreements from a procedural level: I like that Holden makes a pretty significant attempt to pass Nate's Ideological Turing Test, flags which parts of ... (read more)

2Ben Pace1y
This sentence was confusing to me given that the post does not mention 'double crux', but I mentioned it to someone and they said to think of it as the mental motion and not the explicit format, and that makes more sense to me.

However, if your post doesn't look like a research article, you might have to format it more like one (and even then it's not guaranteed to get in, see this comment thread).

I interpreted this as saying something superficial about style, rather than "if your post does not represent 100+ hours of research work it's probably not a good fit for archive." If that's what you meant I think the post could be edited to make that more clear.

If the opening section of your essay made it more clear which posts it was talking about I'd probably endorse it (although I'm not super familiar with the nuances of arXiv gatekeeping so am mostly going off the collective response in the comment section)

Yeah, I didn't mean to be responding to that point one way or another. It just seemed bad to be linking to a post that (seems to still?) communicate false things, without flagging those false things. (post still says "it can be as easy as creating a pdf of your post", which my impression maybe technically true on rare occasions but basically false in practice?)

This feels like a really adversarial quote. Concretely, the post says: This looks correct to me; there are LW posts that already basically look like papers. And within the class of LW posts that should be on arXiv at all, which is the target audience of my post, posts that basically look like papers aren't vanishingly rare.
1David Manheim1y
That seems right.

I thought the response to "Your Posts Should be On Arxiv" was "Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv" (unless you have jumped through a bunch of both effort-hoops and formatting hoops to make them feel like a natural member of the Arxiv-paper class)

I wrote this post. I don't understand where your claim ("Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv") is coming from.
1Arthur Conmy1y
I think this point was really overstated. I get the impression the rejected papers were basically turned into the arXiv format as fast as possible and so it was easy for the mods to tell this. However, I've seen submissions to cs.LG like this and this that are clearly from the alignment community. These posts are also not stellar by standards of preprint formatting, and were not rejected, apparently
3David Manheim1y
And I think the post here is saying that you should jump through those effort and editing hoops far more often than currently occurs.

Yeah I agree with this.

To be clear, I think Anthropic has done a pretty admirable job of showing some restraint here. It is objectively quite impressive. My wariness is "Man, I think the task here is really hard and even a very admirably executed company may not be sufficient." 

Yeah something in this space seems like a central crux to me.

I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."

I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")

I think the worldview here seems cogent. It's very good for Anthropic folk to be writing up their organizational-beliefs publicly. I'm pretty sympathetic to "man, we have no idea how to make real progress without empirical iteration, so we just need to figure out how to make empirical iteration work somehow."

I have a few disagreements. I think the most important ones route through "how likely is this to accelerate race dynamics and how bad is that?".

We've subsequently begun deploying Claude now that the gap between it and the public state of the art is sma

... (read more)

I both agree that the race dynamic is concerning (and would like to see Anthropic address them explicitly), and also think that Anthropic should get a fair bit of credit for not releasing Claude before ChatGPT, a thing they could have done and probably gained a lot of investment / hype over.  I think Anthropic's "let's not contribute to AI hype" is good in the same way that OpenAI's "let's generate massive" hype strategy is bad.

Like definitely I'm worried about the incentive to stay competitive, especially in the product space. But I think it's worth ... (read more)

Curated. I've been hearing about the concept of the acausal economy for awhile and think it's a useful concept, but I don't think I've seen it written up as succinctly/approachably before. 

I appreciated the arguments about how simulation is actually pretty expensive, and logical/moral extrapolation is comparatively cheap, and that there are some reasons to expect this to be a fairly central aspect of the acausal economy/society. I've been reading along with Critch's recent series on both boundaries and Lob's Theorem. I'm not sure I actually fully grok... (read more)

Yeah I also felt some vague optimism about that.

This post takes a bunch of concepts I'm familiar with, but... sort of drops me in-media-res into the middle of a thought process that seems to take for granted how they all fit together. I think I get it, esp. by the time I get to the middle of the post, but I personally could have used more handholding in the beginning of "here's what counting up and down are, here's what coherence is, here's (briefly) how they fit together (I realize kinda the whole post is explaining how they fit together, so maybe that's unfair, but, it felt a bit abrupt to me)

Also "who is the target audience of this post? What research agendas or thought-processes does this fit into?" would have been good.

2Tsvi Benson-Tilsen1y
Thanks. Your comments make sense to me I think. But, these essays are more like research notes than they are trying to be good exposition, so I'm not necessarily trying to consistenly make them accessible. I'll add a note to that effect in future.

I also wanna echo Akash's "this seems like a good document to exist." I appreciate having Sam's worldview laid out more clearly. I'm glad it includes explicit mentions of and some reasonable front-and-centering of x-risk (AFAICT DeepMind hasn't done this, although I might have missed it?).

In some sense these are still "cheap words", but, I do think it's still much better for company leadership to have explicitly stated their worldview and it including x-risk.

I do still obviously disagree with some key beliefs and frames here. From my perspective, focusing ... (read more)

Just to check my sanity, this used to be two posts, and now has been combined into one?

Quick admin note: by default, lines that are bold create a Table of Contents heading (which resulted in the ToC having a whole bunch of spurious [Christiano] and [GA] lines). There's a cute hack to get around this, by inserting a space in italics at the end of the bolded line. I just used my admin powers to add the space-with-italics to all the "[Christiano]", "[GA]", etc, so that the ToC is more readable.

1Raymond Arnold1y
Just to check my sanity, this used to be two posts, and now has been combined into one?

I was a bit unsure whether to tag your posts with Simulator Theory. Do you endorse that or not?

5Evan Hubinger1y
Yeah, I endorse that. I think we are very much trying to talk about the same thing, it's more just a terminological disagreement. Perhaps I would advocate for the tag itself being changed to "Predictor Theory" or "Predictive Models" or something instead.

I'm not sure what order the history happened in and whether "AI Existential Safety" got rebranded into "AI Alignment" (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn't a rebrand)

There's the additional problem where "AI Existential Safety" easily gets rounded to "AI Safety" which often in practice means "self driving cars" as well as overlapping with an existing term-of-art "community safety" which means things like harassment.

I don't have a good contender for a short phrase that is a... (read more)

"AI Safety" which often in practice means "self driving cars"

This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we're moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not ... (read more)

3David Scott Krueger1y
I say it is a rebrand of the "AI (x-)safety" community. When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant.  "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term. I don't think the history is that important; what's important is having good terminology going forward. This is also why I stress that I work on AI existential safety. So I think people should just say what kind of technical work they are doing and "existential safety" should be considered as a social-technical problem that motivates a community of researchers, and used to refer to that problem and that community.  In particular, I think we are not able to cleanly delineate what is or isn't technical AI existential safety research at this point, and we should welcome intellectual debates about the nature of the problem and how different technical research may or may not contribute to increasing x-safety.
4Wei Dai1y
There was a pretty extensive discussion about this between Paul Christiano and me. tl;dr "AI Alignment" clearly had a broader (but not very precise) meaning than "How to get AI systems to try to do what we want" when it first came into use. Paul later used "AI Alignment" for his narrower meaning, but after that discussion, switched to using "Intent Alignment" for this instead.

The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality.

I think I disagree with this characterization. A) we totally have robot cars by now, B) I think mostly what we don't have are AI running systems where the consequence of failure is super high (which maybe happens to be more true for the physical world, but I'd expect to also be true for critical systems in the digital world)

-1Alex Flint1y
Have you personally ever ridden in a robot car that has no safety driver?

I've been trying to articulate some thoughts since Rohin's original comment, and maybe going to just rant-something-out now.

On one hand: I don't have a confident belief that writing in-depth reviews is worth Buck or Rohin's time (or their immediate colleague's time for that matter). It's a lot of work, there's a lot of other stuff worth doing. And I know at least Buck and Rohin have already spent quite a lot of time arguing about the conceptual deep disagreements for many of the top-voted posts.

On the other hand, the combination of "there's stuff epistemic... (read more)

2Rohin Shah1y
Sorry, didn't see this until now (didn't get a notification, since it was a reply to Buck's comment). In some sense yes, but also, looking at posts I've commented on in the last ~6 months, I have written several technical reviews (and nontechnical reviews). And these are only the cases where I wrote a comment that engaged in detail with the main point of the post; many of my other comments review specific claims and arguments within posts. (I would be interested in quantifications of how valuable those reviews are to people other than the post authors. I'd think it is pretty low.) Yes, but they're usually papers, not LessWrong posts, and I do give feedback to their authors -- it just doesn't happen publicly. (And it would be maybe 10x more work to make it public, because (a) I have to now write the review to be understandable by people with wildly different backgrounds and (b) I would hold myself to a higher standard (imo correctly).) (Indeed if you look at the reviews I linked above one common thread is that they are responding to specific people whose views I have some knowledge of, and the reviews are written with those people in mind as the audience.) I broadly agree with this and mostly feel like it is the sort of thing that is happening amongst the folks who are working on prosaic alignment. We already know this though? You have to isolate particular subclusters (Nate/Eliezer, shard theory folks, IDA/debate folks, etc) before it's even plausible to find pieces of work that might be uncontroversial. We don't need to go through a review effort to learn that. (This is different from beliefs / opinions that are uncontroversial; there are lots of those.) (And when I say that they are controversial, I mean that people will disagree significantly on whether it makes progress on alignment, or what the value of the work is; often the work will make technical claims that are uncontroversial. I do think it could be good to highlight which of the technical claims

I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient)

I don't know what mechanism was used to generate the longer coherence though.

1Kaj Sotala1y
At least ChatGPT seems to have a longer context window, this experiment suggesting 8192 tokens.
3Paul Christiano1y
I don't think this is related to RLHF.

I liked the point about "the reason GPT3 isn't consequentialist is that it doesn't find it's way to the same configuration when you perturb the starting conditions." I think I could have generated that definition of consequentialism, but would have trouble making the connection on-the-fly. (At least, I didn't successfully generate it in between reading Scott's confusion and Eliezer's explanation). 

I feel like I now get it more crisply.

Not really the main point, but, I would bet:

a) something pretty close to Minecraft will be an important testing ground for some kinds of alignment work.

b) Minecraft itself will probably get a lot of use in AI research as things advance (largely due to being one of the most popular videogames of all time), whether or not it's actually quite the right test-bed. (I think the right test-bed will probably be optimized more directly for ease-of-training).

I think it might be worth Eliezer playing a minecraft LAN party with some friends* for a weekend, so that the... (read more)

2Rohin Shah1y
I think you're probably talking about my work. This is more of a long-term vision; it isn't doable (currently) at academic scales of compute. See also the "Advantages of BASALT" section of this post. (Also I just generically default to Minecraft when I'm thinking of ML experiments that need to mimic some aspect of the real world, precisely because "the game getting played here is basically the same thing real life society is playing".)

Okay, no, I think I see the problem, which is that I'm failing to consider that evolutionary-learning and childhood-learning are happening at different times through different algorithms, whereas for AIs they're both happening in the same step by the same algorithm.

Is it actually the case that they're happening "in the same step" for the AI? 

I agree with "the thing going on in AI is quite different from the collective learning going on in evolutionary-learning and childhood learning", and I think trying to reason from analogy here is probably generall... (read more)

Facile answer: Why, that's just what the Soviets believed, this Skinner-box model of human psychology devoid of innate instincts, and they tried to build New Soviet Humans that way, and failed, which was an experimental test of their model that falsified it.

On one hand, I've heard a few things about blank-slate experiments that didn't work out, and I do lean towards "they basically don't work". But I... also bet not that many serious attempts actually happened, and that the people attempting them kinda sucked in obvious ways, and that you could do a lot better than however "well" the soviets did.


I liked the high-level strategic frame in the methodology section. I do sure wish we weren't pinning our alignment hopes on anything close to the current ML paradigm, but I still put significant odds on us having to do so anyway. And it seemed like the authors had a clear understanding of the problem they were trying to solve.

I did feel confused reading the actual explanation of what their experiment did, and wish some more attention had been giving to explaining it. (It may have used shorthand that a seasoned ML researcher would understand, ... (read more)

I read this and found myself wanting to understand the actual implementation. I find PDF formatting really annoying to read, so copying the methods section over here. (Not sure how much the text equations copied over)


To make progress on the goal described above, we exploit the fact that truth has special structure: it satisfies consistency properties that few other features in a language model are likely to satisfy. Our method, Contrast-Consistent Search (CCS), leverages this idea by finding a direction in activation s

... (read more)

For the sake of brevity, I won’t go into too many more details about our paper here; for more information, check out our summary on twitter or the paper itself

Hmm, I went to twitter to see if it had more detail, but found it to be more like "a shorter version of this overall post" rather than "more detail on the implementation details of the paper." But, here's a copy of it here for ease-of-reading:

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show (

... (read more)

The link here is dead, can you find a more up to date one? (if you copy-paste a screenshot into LessWrong editor it should successfully create it's own copy)

Curated. I think I had read a bunch of stuff pointing in this direction before, but somehow this post helped the concepts (i.e. the distinction between selecting for bad behavior and for goal-directedness) be a lot clearer in my mind. 

rather than actually talking about the details, which is what I would usually find useful about reviews.

I'm interested in details about what you find useful about the prospect of reviews that talk about the details. I share a sense that it'd be helpful, but I'm not sure I could justify that belief very strongly (when it comes to the opportunity cost of the people qualified to do the job)

In general, I'm legit fairly uncertain whether "effort-reviews"(whether detail-focused or big-picture focused) are worthwhile. It seems plausible to me that detail-focused-... (read more)

3Rohin Shah1y
A couple of reasons: 1. It's far easier for me to figure out how much to update on evidence when someone else has looked at the details and highlighted ways in which the evidence is stronger or weaker than a reader might naively take away from the paper. (At least, assuming the reviewer did a good job.) 1. This doesn't apply to big-picture reviews because such reviews are typically a rehash of old arguments I already know. 2. This is similar to the general idea in AI safety via debate -- when you have access to a review you are more like a judge; without a review you are more like the debate opponent. 2. Having someone else explain the paper from their perspective can surface other ways of thinking about the paper that can help with understanding it. 1. This sometimes does happen with big-picture reviews, though I think it's less common. Tbc, I'm not necessarily saying it is worth the opportunity cost of the reviewer's time; I haven't thought much about it.

Fair. Fwiw I'd be interested in your review of the followup as a standalone. 

2Stuart Armstrong1y
Here's the review, though it's not very detailed (the post explains why):
3Stuart Armstrong1y
I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).
Load More