All of Raemon's Comments + Replies

The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality.

I think I disagree with this characterization. A) we totally have robot cars by now, B) I think mostly what we don't have are AI running systems where the consequence of failure is super high (which maybe happens to be more true for the physical world, but I'd expect to also be true for critical systems in the digital world)

-1Alex Flint1d
Have you personally ever ridden in a robot car that has no safety driver?

I've been trying to articulate some thoughts since Rohin's original comment, and maybe going to just rant-something-out now.

On one hand: I don't have a confident belief that writing in-depth reviews is worth Buck or Rohin's time (or their immediate colleague's time for that matter). It's a lot of work, there's a lot of other stuff worth doing. And I know at least Buck and Rohin have already spent quite a lot of time arguing about the conceptual deep disagreements for many of the top-voted posts.

On the other hand, the combination of "there's stuff epistemic... (read more)

I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient)

I don't know what mechanism was used to generate the longer coherence though.

1Kaj Sotala6h
At least ChatGPT seems to have a longer context window, this experiment [] suggesting 8192 tokens.
3Paul Christiano4d
I don't think this is related to RLHF.

I liked the point about "the reason GPT3 isn't consequentialist is that it doesn't find it's way to the same configuration when you perturb the starting conditions." I think I could have generated that definition of consequentialism, but would have trouble making the connection on-the-fly. (At least, I didn't successfully generate it in between reading Scott's confusion and Eliezer's explanation). 

I feel like I now get it more crisply.

Not really the main point, but, I would bet:

a) something pretty close to Minecraft will be an important testing ground for some kinds of alignment work.

b) Minecraft itself will probably get a lot of use in AI research as things advance (largely due to being one of the most popular videogames of all time), whether or not it's actually quite the right test-bed. (I think the right test-bed will probably be optimized more directly for ease-of-training).

I think it might be worth Eliezer playing a minecraft LAN party with some friends* for a weekend, so that the... (read more)

Okay, no, I think I see the problem, which is that I'm failing to consider that evolutionary-learning and childhood-learning are happening at different times through different algorithms, whereas for AIs they're both happening in the same step by the same algorithm.

Is it actually the case that they're happening "in the same step" for the AI? 

I agree with "the thing going on in AI is quite different from the collective learning going on in evolutionary-learning and childhood learning", and I think trying to reason from analogy here is probably generall... (read more)

Facile answer: Why, that's just what the Soviets believed, this Skinner-box model of human psychology devoid of innate instincts, and they tried to build New Soviet Humans that way, and failed, which was an experimental test of their model that falsified it.

On one hand, I've heard a few things about blank-slate experiments that didn't work out, and I do lean towards "they basically don't work". But I... also bet not that many serious attempts actually happened, and that the people attempting them kinda sucked in obvious ways, and that you could do a lot better than however "well" the soviets did.


I liked the high-level strategic frame in the methodology section. I do sure wish we weren't pinning our alignment hopes on anything close to the current ML paradigm, but I still put significant odds on us having to do so anyway. And it seemed like the authors had a clear understanding of the problem they were trying to solve.

I did feel confused reading the actual explanation of what their experiment did, and wish some more attention had been giving to explaining it. (It may have used shorthand that a seasoned ML researcher would understand, ... (read more)

I read this and found myself wanting to understand the actual implementation. I find PDF formatting really annoying to read, so copying the methods section over here. (Not sure how much the text equations copied over)


To make progress on the goal described above, we exploit the fact that truth has special structure: it satisfies consistency properties that few other features in a language model are likely to satisfy. Our method, Contrast-Consistent Search (CCS), leverages this idea by finding a direction in activation s

... (read more)

For the sake of brevity, I won’t go into too many more details about our paper here; for more information, check out our summary on twitter or the paper itself

Hmm, I went to twitter to see if it had more detail, but found it to be more like "a shorter version of this overall post" rather than "more detail on the implementation details of the paper." But, here's a copy of it here for ease-of-reading:

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show (

... (read more)

The link here is dead, can you find a more up to date one? (if you copy-paste a screenshot into LessWrong editor it should successfully create it's own copy)

Curated. I think I had read a bunch of stuff pointing in this direction before, but somehow this post helped the concepts (i.e. the distinction between selecting for bad behavior and for goal-directedness) be a lot clearer in my mind. 

rather than actually talking about the details, which is what I would usually find useful about reviews.

I'm interested in details about what you find useful about the prospect of reviews that talk about the details. I share a sense that it'd be helpful, but I'm not sure I could justify that belief very strongly (when it comes to the opportunity cost of the people qualified to do the job)

In general, I'm legit fairly uncertain whether "effort-reviews"(whether detail-focused or big-picture focused) are worthwhile. It seems plausible to me that detail-focused-... (read more)

3Rohin Shah12d
A couple of reasons: 1. It's far easier for me to figure out how much to update on evidence when someone else has looked at the details and highlighted ways in which the evidence is stronger or weaker than a reader might naively take away from the paper. (At least, assuming the reviewer did a good job.) 1. This doesn't apply to big-picture reviews because such reviews are typically a rehash of old arguments I already know. 2. This is similar to the general idea in AI safety via debate -- when you have access to a review you are more like a judge; without a review you are more like the debate opponent. 2. Having someone else explain the paper from their perspective can surface other ways of thinking about the paper that can help with understanding it. 1. This sometimes does happen with big-picture reviews, though I think it's less common. Tbc, I'm not necessarily saying it is worth the opportunity cost of the reviewer's time; I haven't thought much about it.

Fair. Fwiw I'd be interested in your review of the followup as a standalone. 

2Stuart Armstrong7d
Here's the review, though it's not very detailed (the post explains why): []
3Stuart Armstrong14d
I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).

Man, I haven't had time to thoroughly review this, but given that it's an in-depth review of another post up for review, it seems sad not to include it.

I'd ideally like to see a review from someone who actually got started on Independent Alignment Research via this document, and/or grantmakers or senior researchers who have seen up-and-coming researchers who were influenced by this document.

But, from everything I understand about the field, this seems about right to me, and seems like a valuable resource for people figuring out how to help with Alignment. I like that it both explains the problems the field faces, and it lays out some of the realpolitik of getting grants.

Actually, rereading this, it strikes me as a pretty good "intro to the John Wentworth worldview", weaving a bunch of disparate posts together into a clear frame. 

This piece took an important topic that I hadn't realized I was confused/muddled about, convinced me I was confused/muddled about it, while simultaneously providing a good framework for thinking about it. I feel like I have a clearer sense of how Worst Case Thinking applies in alignment.

I also appreciated a lot of the comments here that explore the topic in more detail.

FWIW I think a fairly substantial amount of effort has gone into resolving longstanding disagreements. I think that effort has resulted in a lot of good works and updates from many people reading about the disagreement discussion, but not really changed the mind of the people doing the arguing. (See: the MIRI Dialogues)

And it's totally plausible to me the answer is "10-100x the amount of work that is gone in so far."

I maybe agree that people haven't literally sat and double-cruxed for six months. I don't know that it's fair to describe this as "impractical... (read more)

Oh sure, I certainly don't mean to imply that there's been little effort in absolute terms - I'm very encouraged by the MIRI dialogues, and assume there are a bunch of behind-the-scenes conversations going on. I also assume that everyone is doing what seems best in good faith, and has potentially high-value demands on their time. However, given the stakes, I think it's a time for extraordinary efforts - and so I worry that [this isn't the kind of thing that is usually done] is doing too much work. I think the "principled epistemics and EV calculations" could perfectly well be the explanation, if it were the case that most researchers put around a 1% chance on [Eliezer/Nate/John... are largely correct on the cruxy stuff]. That's not the sense I get - more that many put the odds somewhere around 5% to 25%, but don't believe the arguments are sufficiently crisp to allow productive engagement. If I'm correct on that (and I may well not be), it does not seem a principled justification for the status-quo. Granted the right course isn't obvious - we'd need whoever's on the other side of the double-cruxing to really know their stuff. Perhaps Paul's/Rohin's... time is too valuable for a 6 month cost to pay off. (the more realistic version likely involves not-quite-so-valuable people from each 'side' doing it) As for "done a thing a bunch and it doesn't seem to be working", what's the prior on [two experts in a field from very different schools of thought talk for about a week and try to reach agreement]? I'm no expert, but I strongly expect that not to work in most cases. To have a realistic expectation of its working, you'd need to be doing the kinds of thing that are highly non-standard. Experts having some discussions over a week is standard. Making it your one focus for 6 months is not. (frankly, I'd be over the moon for the one month version [but again, for all I know this may have been tried])
2Adam Shimi20d
I was mostly thinking of the efficiency assumption underlying almost all the scenarios. Critch assumes that a significant chunk of the economy always can and does make the most efficient change (everyone replacing the job, automated regulations replacing banks when they can't move fast enough). Which neglects many potential factors, like big economic actors not having to be efficient for a long time, backlash from customers, and in general all factors making economic actors and market less than efficient. I expect that most of these factors could be addressed with more work on the scenarios.

This post is among the most concrete, actionable, valuable post I read from 2021. Earlier this year, when I was trying to get a handle on the current-state-of-AI, this post transformed my opinion of Interpretability research from "man, this seems important but it looks so daunting and I can't imagine interpretability providing enough value in time" to "okay, I actually see a research framework I could expect to be scalable."

I'm not a technical researcher so I have trouble comparing this post to other Alignment conceptual work. But my impression, from seein... (read more)

Why is this specific to CAIS, as opposed to other frameworks? (Seems like this is a fairly common implication of systems that prevent people from developing rogue AGIs)

2Charlie Steiner2mo
You're right, it's not very specific. But it was non-obvious to me, at least.

Curated.  This is a bit of an older post but seemed important. I know a lot of people asking "When is it a good idea to do work that furthers AI capabilities (even if it also helps alignment?)" – both researchers, and funders. I think this post adds a crisp extra consideration to the question that I hadn't seen spelled out before.

The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period.

FYI, I think there's a huge difference between "I think humanity needs to aim for a pivotal act" and "I recommend to groups pushing the capabilities frontier forward to aim for pivotal act". I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have... (read more)

Curated. I think this domain of decision theory is easy to get confused in, and having a really explicit writeup of how it applies in the case of negotiating with AIs (or, failing to), seems quite helpful. I had had a vague understanding of the points in this post before, but feel much clearer about it now.

I tagged this "Pointers Problem" but am not 100% sure it's getting at the same thing. Curious if there's a different tag that feels more appropriate.

An angle I think is relevant here is that a sufficiently complex, "well founded" AI system is still going to be fairly difficult to understand. i.e. a large codebase, where everything is properly commented and labeled, might still have lots of unforeseen bugs and interactions the engineers didn't intend. 

So I think before you deploy a powerful "Well Founded" AI system, you'll probably still need a kind of generalized reverse-engineering/interpretability skill to explain how the entire process works in various test cases.

1David Scott Krueger3mo
I don't really buy this argument. * I think the following is a vague and slippery concept: "a kind of generalized reverse-engineering/interpretability skill". But I agree that you would want to do testing, etc. of any system before you deploy it. * It seems like the ambitious goal of mechanistic interpretability, which would get you the kind of safety properties we are after, would indeed require explaining how the entire process works. But when we are talking about such a complex system, it seems the main obstacle to understanding for either approach is our ability to comprehend such an explanation. I don't see a reason to say that we can surmount that obstacle more easily via reverse engineering than via engineering. It often seems to me that people are assuming that mechanistic interpretability addresses this obstacle (I'm skeptical), or that (effectively) the obstacle doesn't actually exist (in which case why can't we just do it via engineering?)

John's Why Not Just... sequence is a series of somewhat rough takes on a few of them. (though I think many of them are not written up super comprehensively)

Curated. I think shovel-ready projects that can help with alignment are quite helpful for the field, in particular right now when we have a bunch of smart people showing up, looking to contribute. 

Something I'm unsure about (commenting from my mod-perspective but not making a mod pronouncement) is how LW should relate to posts that lay out ideas that may advance AI capabilities. 

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these ar... (read more)

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

What is the mechanism you're imagining for this speedup?  What happens that would not have happened without this post?

Co... (read more)

so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

There's also the chance that if these posts are not gated, people who previously weren't plugged into the LW ecosystem but are interested in AI find LW through articles such as this one. And then eventually also start reading other articles here and become more interested in alignment concerns.

There's also a bit of a negative stereotype among some AI researchers as alignment people being theoreti... (read more)

Curated. I'm not sure I endorse all the specific examples, but the general principles make sense to me as considerations to help guide alignment research directions.

FYI, I've found this concept useful in thinking, but I think "atomic" is a worse word than just saying "non-interruptible". When I'm explaining this to people I just say "unbounded, uninterruptible optimization". The word atomic only seems to serve to make people say "what's that?" and then I say "uninterruptible"

Mod note: I'm frontpaging this. It's a bit of an edge case (workshops definitely aren't timeless, but we have tended to frontpage prize/contest announcements for intellectual content)

I don't think the usual arguments apply as obviously here. "Maximal Diamond" is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it's a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I've seen more detailed arguments for.

I'm partly confused about the phrasing "we have no idea how to do this." (which is stronger than "we don't currently have a plan for how to do this.")

But in the interests of actually... (read more)

0Ulisse Mini7mo
I think even without point #4 you don't necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you're bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you're out of training to reward-hack). Also, every time you say "train it by X so it learns Y" you're assuming alignment (e.g. "digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion") IMO shard theory [] provides a great frame to think about this in, it's a must-read for improving alignment intuitions.

Like, even simpler than the problem of an AGI that puts two identical strawberries on a plate and does nothing else, is the problem of an AGI that turns as much of the universe as possible into diamonds. This is easier because, while it still requires that we have some way to direct the system towards a concept of our choosing, we no longer require corrigibility. (Also, "diamond" is a significantly simpler concept than "strawberry" and "cellularly identical".)

It seems to me that we have basically no idea how to do this. We can train the AGI to be pret

... (read more)
5Thomas Larsen7mo
There is also the ontology identification problem []. The two biggest things are: we don't know how to specify exactly what a diamond is because we don't know the true base level ontology of the universe. We also don't know how diamonds will be represented in the AI's model of the world. I personally don't expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more basic physics, e.g. quarks that exist below the molecular level, "diamond" will probably still be a pretty natural concept, just like how "apple" didn't stop being a useful concept after shifting from newtonian mechanics to QM. Of course, concepts such as human values/corrigibility/whatever are a lot more fragile than diamonds, so this doesn't seem helpful for alignment.
3Ben Pace7mo
Hm? It's as Nate says in the quote. It's the same type of problem as humans inventing birth-control out of distribution. If you have an alternative proposal for how to build a diamond-maximizer, you can specify that for a response, but the commonly discussed idea of "train on examples of diamonds" will fail at inner-alignment, and it will just optimize diamonds in a particular setting and then elsewhere do crazy other things that look like all kinds of white noise to you. Also "expect this to fail" already seems to jump the gun. Who has a proposal for successfully building an AGI that can do this, other than saying gradient-descent will surprise us with one?

Curated. My sense is there is no existing AI company with adequate infrastructure for safely deploying AGI, and this is a pretty big deal. I like this writeup for laying out a bunch of considerations.

A few times in this article, Eliezer notes "it'd be great if we could get X, but, the process of trying to get X would cause some bad consequences." I'd like to see further exploration/models of, "given the state of the current world, which approaches are actually tractable.

Are you actually gonna remember the apostrophe?

I just tested that, and it works both ways.

Curated. As previously noted, I'm quite glad to have this list of reasons written up. I like Robby's comment here which notes:

The point is not 'humanity needs to write a convincing-sounding essay for the thesis Safe AI Is Hard, so we can convince people'. The point is 'humanity needs to actually have a full and detailed understanding of the problem so we can do the engineering work of solving it'.

I look forward to other alignment thinkers writing up either their explicit disagreements with this list, or things that the list misses, or their own frame on th... (read more)

Note: I think there's a bunch of additional reasons for doom, surrounding "civilizational adequacy / organizational competence / societal dynamics". Eliezer briefly alluded to these, but AFAICT he's mostly focused on lethality that comes "early", and then didn't address them much. My model of Andrew Critch has a bunch of concerns about doom that show up later, because there's a bunch of additional challenges you have to solve if AI doesn't dramatically win/lose early on (i.e. multi/multi dynamics and how they spiral out of control)

I know a bunch of people ... (read more)

I read an early draft of this awhile and am glad to have it publicly available.  And I do think the updates in structure/introduction were worth the wait. Thanks!

My sense is that Anthropic is somewhat oriented around this idea. I'm not sure if this is their actual plan or just some guesswork I read between the lines.

But I vaguely recall something like "develop capabilities that you don't publish, while also developing interpretability techniques which you do publish, and try to have a competitive edge on capabilities which you then have some lead time to try to inspect via intepretability techniques and the practice alignment on various capability-scales.

(I may have just made this up while trying to steelman them to myself)

government-security-clearance-style screening

What does that actually involve?

If e.g. the government of Iceland suddenly understood how serious things had gotten and granted sanction and security to a project, that would fit this description, but I think that trying to arrange anything like this would probably make things worse globally because of the mindset it promoted.

I think I have an reasonable guess, but interested in more details about what goes wrong here and what mindset it promotes. (i.e. governments generally trying to regulate AI whether or not they understand how to do so, and then getting both random-ineffective regulations and rent capture and stuff?)

Curated. Thanks to Steve for writing up all these thoughts throughout the sequence.

Normally when we curate a post-from-a-sequence-that-represents-the-sequence, we end up curating the first post, which points roughly to where the sequence is going. I like the fact that this time, there was a post that does a particularly nice job tying-everything-together, while sending people off with a roadmap of further work to do.

I appreciate the honesty about your epistemic state about the "Is Steve full of crap research program?". :P

Curated. I've heard a few offhand comments about this type of research work in the past few months, but wasn't quite sure how seriously to take it. 

I like this writeup for spelling out details of why it blackbox investigators might be useful, what skills it requires and how you might go about it. 

I expect this sort of skillset to have major limitations, but I think I agree with the stated claims that it's a useful skillset to have in conjunction with other techniques.

GPT-N that you can prompt with "I am stuck with this transformer architecture trying to solve problem X". GPT-N would be AIHHAI if it answers along the lines of "In this arXiv article, they used trick Z to solve problems similar to X. Have you considered implementing it?", and using an implementation of Z would solve X >50% of the time.

I haven't finished reading the post, but I found it worthwhile for this quote alone. This is the first description I've read of how GPT-N could be transformative. (Upon reflection this was super obvious and I'm embarrasse... (read more)

Well yeah, that's my point. It seems to me that any pivotal act worthy of the name would essentially require the AI team to become an AGI-powered world government, which seems pretty darn difficult to pull off safely. The superpowered-AI-propaganda plan falls under this category.

Yeah. I think this sort of thing is why Eliezer thinks we're doomed – getting the humanity to coordinate collectively seems doomed (i.e. see Gain of Function Research), and there are no weak pivotal acts that aren't basically impossible to execute safely.

The nanomachine gpu-melting... (read more)

Hmm, interesting...but wasn't he more optimistic a few years ago, when his plan was still "pull off a pivotal act with a limited AI"? I thought the thing that made him update towards doom was the apparent difficulty of safely making even a limited AI, plus shorter timelines. Ah, that actually seems like it might work. I guess the problem is that an AI that can competently do neuroscience well enough to do this would have to be pretty general. Maybe a more realistic plan along the same lines might be to try using ML to replicate the functional activity of various parts of the human brain and create 'pseudo-uploads'. Or just try to create an AI with similar architecture and roughly-similar reward function to us, hoping that human values are more generic than they might appear [] .

Followup point on the Gain-of-Function-Ban as practice-run for AI:

My sense is that the biorisk people who were thinking about Gain-of-Function-Ban were not primarily modeling it as a practice run for regulating AGI. This may result in them not really prioritizing it.

I think biorisk is significantly lower than AGI risk, so if it's tractable and useful to regulate Gain of Function research as a practice run for regulating AGI, it's plausible this is actually much more important than business-as-usual biorisk. 

BUT I think smart people I know seem to disa... (read more)

Various thoughts that this inspires:

Gain of Function Ban as Practice-Run/Learning for relevant AI Bans

I have heard vague-musings-of-plans in the direction of "get the world to successfully ban Gain of Function research, as a practice-case for getting the world to successfully ban dangerous AI." 

I have vague memories of the actual top bio people around not being too focused on this, because they thought there were easier ways to make progress on biosecurity. (I may be conflating a few different statements – they might have just critiquing a particular ... (read more)

+1 to the distinction between "Regulating AI is possible/impossible" vs "pivotal act framing is harmful/unharmful".

I'm sympathetic to a view that says something like "yeah, regulating AI is Hard, but it's also necessary because a unilateral pivotal act would be Bad". (TBC, I'm not saying I agree with that view, but it's at least coherent and not obviously incompatible with how the world actually works.) To properly make that case, one has to argue some combination of:

  • A unilateral pivotal act would be so bad that it's worth accepting a much higher chance of
... (read more)
4Raymond Arnold9mo
Followup point on the Gain-of-Function-Ban as practice-run for AI: My sense is that the biorisk people who were thinking about Gain-of-Function-Ban were not primarily modeling it as a practice run for regulating AGI. This may result in them not really prioritizing it. I think biorisk is significantly lower than AGI risk, so if it's tractable and useful to regulate Gain of Function research as a practice run for regulating AGI, it's plausible this is actually much more important than business-as-usual biorisk. BUT I think smart people I know seem to disagree about how any of this works, so the "if tractable and useful" conditional is pretty non-obvious to me. If bio-and-AI-people haven't had a serious conversation about this where they mapped out the considerations in more detail, I do think that should happen.
Load More