All of Raemon's Comments + Replies

The case for becoming a black-box investigator of language models

Curated. I've heard a few offhand comments about this type of research work in the past few months, but wasn't quite sure how seriously to take it. 

I like this writeup for spelling out details of why it blackbox investigators might be useful, what skills it requires and how you might go about it. 

I expect this sort of skillset to have major limitations, but I think I agree with the stated claims that it's a useful skillset to have in conjunction with other techniques.

Why Copilot Accelerates Timelines

GPT-N that you can prompt with "I am stuck with this transformer architecture trying to solve problem X". GPT-N would be AIHHAI if it answers along the lines of "In this arXiv article, they used trick Z to solve problems similar to X. Have you considered implementing it?", and using an implementation of Z would solve X >50% of the time.

I haven't finished reading the post, but I found it worthwhile for this quote alone. This is the first description I've read of how GPT-N could be transformative. (Upon reflection this was super obvious and I'm embarrasse... (read more)

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Well yeah, that's my point. It seems to me that any pivotal act worthy of the name would essentially require the AI team to become an AGI-powered world government, which seems pretty darn difficult to pull off safely. The superpowered-AI-propaganda plan falls under this category.

Yeah. I think this sort of thing is why Eliezer thinks we're doomed – getting the humanity to coordinate collectively seems doomed (i.e. see Gain of Function Research), and there are no weak pivotal acts that aren't basically impossible to execute safely.

The nanomachine gpu-melting... (read more)

Hmm, interesting...but wasn't he more optimistic a few years ago, when his plan was still "pull off a pivotal act with a limited AI"? I thought the thing that made him update towards doom was the apparent difficulty of safely making even a limited AI, plus shorter timelines. Ah, that actually seems like it might work. I guess the problem is that an AI that can competently do neuroscience well enough to do this would have to be pretty general. Maybe a more realistic plan along the same lines might be to try using ML to replicate the functional activity of various parts of the human brain and create 'pseudo-uploads'. Or just try to create an AI with similar architecture and roughly-similar reward function to us, hoping that human values are more generic than they might appear [] .
“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Followup point on the Gain-of-Function-Ban as practice-run for AI:

My sense is that the biorisk people who were thinking about Gain-of-Function-Ban were not primarily modeling it as a practice run for regulating AGI. This may result in them not really prioritizing it.

I think biorisk is significantly lower than AGI risk, so if it's tractable and useful to regulate Gain of Function research as a practice run for regulating AGI, it's plausible this is actually much more important than business-as-usual biorisk. 

BUT I think smart people I know seem to disa... (read more)

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Various thoughts that this inspires:

Gain of Function Ban as Practice-Run/Learning for relevant AI Bans

I have heard vague-musings-of-plans in the direction of "get the world to successfully ban Gain of Function research, as a practice-case for getting the world to successfully ban dangerous AI." 

I have vague memories of the actual top bio people around not being too focused on this, because they thought there were easier ways to make progress on biosecurity. (I may be conflating a few different statements – they might have just critiquing a particular ... (read more)

+1 to the distinction between "Regulating AI is possible/impossible" vs "pivotal act framing is harmful/unharmful".

I'm sympathetic to a view that says something like "yeah, regulating AI is Hard, but it's also necessary because a unilateral pivotal act would be Bad". (TBC, I'm not saying I agree with that view, but it's at least coherent and not obviously incompatible with how the world actually works.) To properly make that case, one has to argue some combination of:

  • A unilateral pivotal act would be so bad that it's worth accepting a much higher chance of
... (read more)
4Raymond Arnold1mo
Followup point on the Gain-of-Function-Ban as practice-run for AI: My sense is that the biorisk people who were thinking about Gain-of-Function-Ban were not primarily modeling it as a practice run for regulating AGI. This may result in them not really prioritizing it. I think biorisk is significantly lower than AGI risk, so if it's tractable and useful to regulate Gain of Function research as a practice run for regulating AGI, it's plausible this is actually much more important than business-as-usual biorisk. BUT I think smart people I know seem to disagree about how any of this works, so the "if tractable and useful" conditional is pretty non-obvious to me. If bio-and-AI-people haven't had a serious conversation about this where they mapped out the considerations in more detail, I do think that should happen.
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

Also, like, Berkeley heat waves may just significantly different than, like, Reno heat waves. My current read is that part of the issue here is that a lot of places don't actually get that hot so having less robustly good air conditioners is fine.

I bought my single-hose AC for the 2019 heat wave in Mountain View (which was presumably basically similar to Berkeley). When I was in Vegas, summer was just three months of permanent extreme heat during the day; one does not stay somewhere without built-in AC in Vegas.
2Paul Christiano1mo
I think labeling requirements are based on the expectation of cooling from 95 to 80 (and I expect typical use cases for portable AC are more like that). Actually hot places will usually have central air or window units.
[Link] A minimal viable product for alignment

Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it's more like "as competent as a deceptive four-year old" (my parents totally caught me when I told my first lie), than "as competent as a silver-tongued sociopath playing a long game."

I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don't-notice deception.

That falls squarely under the "other reasons to think our models are not yet deceptive" - i.e. we have priors that we'll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.
A Longlist of Theories of Impact for Interpretability


I've long had a vague sense that interpretability should be helpful somehow, but recently when I tried to spell out exactly how it helped I had a surprisingly hard time. I appreciated this post's exploration of the concept.

Job Offering: Help Communicate Infrabayesianism

I've had on my TODO to try reading the LW post transcript of that and seeing if it could be distilled further.

A future episode might include a brief distillation of that episode ;)
Job Offering: Help Communicate Infrabayesianism

As someone kinda confused about InfraBayesianism but has some sense that it's important, I am glad to see this initiative. :)

Late 2021 MIRI Conversations: AMA / Discussion

To what extent do you think pivotal-acts-in-particular are strategically important (i.e. "successfully do a pivotal act, and if necessary build an AGI to do it" is the primary driving goal), vs "pivotal acts are useful shorthand to refer to the kind of intelligence level where it matters than an AGI be 'really safe'".

I'm interested in particular in responses from Eliezer, Rohin, and perhaps Richard Ngo. (I've had private chats with Rohin that I thought were useful to share and this comment is sort of creating a framing device for sharing them, but I've bee... (read more)

7Rohin Shah2mo
The goal is to bring x-risk down to near-zero, aka "End the Acute Risk Period". My usual story for how we do this is roughly "we create a methodology for building AI systems that allows you to align them at low cost relative to the cost of gaining capabilities; everyone uses this method, we have some governance / regulations to catch any stragglers who aren't using it but still can make dangerous systems". If I talk to Eliezer, I expect him to say "yes, in this story you have executed a pivotal act, via magical low-cost alignment that we definitely do not get before we all die". In other words, the crux is in whether you can get an alignment solution with the properties I mentioned (and maybe also in whether people will be sensible enough to use the method + do the right governance). So with Eliezer I end up talking about those cruxes, rather than talking about "pivotal acts" per se, but I'm always imagining the "get an alignment solution, have everyone use it" plan. When I talk to people who are attempting to model Eliezer, or defer to Eliezer, or speaking out of their own model that's heavily Eliezer-based, and I present this plan to them, and then they start thinking about pivotal acts, they do not say the thing Eliezer says above. I get the sense that they see "pivotal act" as some discrete, powerful, gameboard-flipping action taken at a particular point in time that changes x-risk from non-trivial to trivial, rather than as a referent to the much broader thing of "whatever ends the acute risk period". My plan doesn't involve anything as discrete and powerful as "melt all the GPUs", so from their perspective, a pivotal act hasn't happened, and the cached belief is that if a pivotal act hasn't happened, then we all die, therefore my plan leads to us all dying. With those people I end up talking about how "pivotal act" is a referent to the goal of "End the Acute Risk Period" and if you achieve that you have won and there's nothing else left to do; it doesn't mat

My Eliezer-model thinks pivotal acts are genuinely, for-real, actually important. Like, he's not being metaphorical or making a pedagogical point when he says (paraphrasing) 'we need to use the first AGI systems to execute a huge, disruptive, game-board-flipping action, or we're all dead'.

When my Eliezer-model says that the most plausible pivotal acts he's aware of involve capabilities roughly at the level of 'develop nanotech' or 'put two cellular-identical strawberries on a plate', he's being completely literal. If some significantly weaker capability le... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

Curated. I found the entire sequence of conversations quite valuable, and it seemed good both to let people know it had wrapped up, and curate it while the AMA was still going on.

Late 2021 MIRI Conversations: AMA / Discussion

Thanks. I wasn't super satisfied with the way I phrased my questions. I just made some slight edits to them (labeled as such), although they still don't feel like they quite do the thing. (I feel like I'm looking at a bunch of subtle frame disconnects, while multiple other frame disconnects are going on, so pinpointing the thing is hard_

I think "is any of this actually cruxy" is maybe the most important question and I should have included it. You answered "not supermuch, at least compared to models of intelligence". Do you think there's any similar nearby ... (read more)

5Rohin Shah2mo
It's definitely cruxy in the sense that changing my opinions on any of these would shift my p(doom) some amount. My rough model is that there's an unknown quantity about reality which is roughly "how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do". p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes. This seems primarily affected by the quality of technical alignment solutions, but certainly civilizational adequacy also affects the answer.
Late 2021 MIRI Conversations: AMA / Discussion

[I think there's a thing Eliezer does a lot, which I have mixed feelings about, which is matching people's statements to patterns and then responding to the generator of the pattern in Eliezer's head, which only sometimes corresponds to the generator in the other person's head.]

I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They'd go around telling me "Ray, you're exhibiting that bias right now. Whatever rationalization you're coming up with right now, it's not the real reason you're arguing X." An... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

It seems to me that a major crux about AI strategy routes through "is civilization generally adequate or not?". It seems like people have pretty different intuitions and ontologies here. Here's an attempt at some questions of varying levels of concreteness, to tease out some worldview implications. 

(I normally use the phrase "civilizational adequacy", but I think that's kinda a technical term that means a specific thing and I think maybe I'm pointing at a broader concept.)

"Does civilization generally behave sensibly?" This is a vague question, some po... (read more)

I don't think this is the main crux -- disagreements about mechanisms of intelligence seem far more important -- but to answer the questions:

Do you think major AI orgs will realize that AI is potentially worldendingly dangerous, and have any kind of process at all to handle that?

Clearly yes? They have safety teams that are focused on x-risk? I suspect I have misunderstood your question.

(Maybe you mean the bigger tech companies like FAANG, in which case I'm still at > 95% on yes, but I suspect I am still misunderstanding your question.)

(I know less about... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

There's something I had interpreted the original CEV paper to be implying, but wasn't sure if it was still part of the strategic landscape, which was "have the alignment project being working towards a goal that was highly visibly fair, to disincentive races." Was that an intentional part of the goal, or was it just that CEV seemed something like "the right thing to do" (independent of it's impact on races?)

How does Eliezer think about it now?

Yes, it was an intentional part of the goal.

If there were any possibility of surviving the first AGI built, then it would be nice to have AGI projects promising to do something that wouldn't look like trying to seize control of the Future for themselves, when, much later (subjectively?), they became able to do something like CEV.  I don't see much evidence that they're able to think on the level of abstraction that CEV was stated on, though, nor that they're able to understand the 'seizing control of the Future' failure mode that CEV is meant to preve... (read more)

How I Formed My Own Views About AI Safety

I'm currently working to form my own models here. I'm not sure if this post concretely helped me but it's nice to see other people grappling with it.

One thing I notice is that this post is sort of focused on "developing inside views as a researcher, so you can do research." But an additional lens here is "Have an inside view so you can give advice to other people, or do useful community building, or build useful meta-tools for research, or fund research.

In my case I already feel like I have a solid inside view of "AGI is important", and "timelines might be... (read more)

The Big Picture Of Alignment (Talk Part 1)

Are there already plans for a transcript of this? (I could set in motion of a transcription)

No plans in motion. Thank you very much if you decide to do so! Also, you might want to message Rob to get the images.
1Logan Riggs Smith3mo
How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?
Ngo and Yudkowsky on scientific reasoning and pivotal acts


Again, this is an argument that I believed less after looking into the details, because right now it's pretty difficult to throw more compute at neural networks at runtime.

Which is not to say that it's a bad argument, the differences in compute-scalability between humans and AIs are clearly important. But I'm confused about the structure of your argument that knowing more details will predictably update me in a certain direction.


I suppose the genericized version of my actual response to that would be, "architectu

... (read more)
AGI safety from first principles: Introduction

I haven't had time to reread this sequence in depth, but I wanted to at least touch on how I'd evaluate it. It seems to be aiming to be both a good introductory sequence, while being a "complete and compelling case I can for why the development of AGI might pose an existential threat".

The question is who is this sequence for,  what is it's goal, and how does it compare to other writing targeting similar demographics. 

Some writing that comes to mind to compare/contrast it with includes:

... (read more)
Inaccessible information

It strikes me that this post looks like a (AFAICT?) a stepping stone towards the Eliciting Latent Knowledge research agenda, which currently has a lot of support/traction. Which makes this post fairly historically important.

Some AI research areas and their relevance to existential safety

I've highly voted this post for a few reasons. 

First, this post contains a bunch of other individual ideas I've found quite helpful for orienting. Some examples:

  • Useful thoughts on which term definitions have "staying power," and are worth coordinating around.
  • The zero/single/multi alignment framework.
  • The details on how to anticipate legitimize and fulfill governance demands.

But my primary reason was learning Critch's views on what research fields are promising, and how they fit into his worldview. I'm not sure if I agree with Critch, but I think "Figur... (read more)

AGI safety from first principles: Introduction

A year later, as we consider this for the 2020 Review, I think figuring out a better name is worth another look.

Another option is "AI Catastrophe from First Principles"

EfficientZero: How It Works

Curated. EfficientZero seems like an important advance, and I appreciate this post's length explanation, broken into sections that made it easy to skim past parts I already understood.

How To Get Into Independent Research On Alignment/Agency

Curated. This post matched my own models of how folk tend to get into independent alignment research, and I've seen some people whose models I trust more endorse the post as well. Scaling good independent alignment research seems very important.

I do like that the post also specifies who shouldn't be going to independent research.

Yudkowsky and Christiano discuss "Takeoff Speeds"

So... I totally think there are people who sort of nod along with Paul, using it as an excuse to believe in a rosier world where things are more comprehensible and they can imagine themselves doing useful things without having a plan for solving the actual hard problems. Those types of people exist. I think there's some important work to be done in confronting them with the hard problem at hand.

But, also... Paul's world AFAICT isn't actually rosier. It's potentially more frightening to me. In Smooth Takeoff world, you can't carefully plan your pivotal act ... (read more)

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Update: I originally posted this question over here, then realized this post existed and maybe I should just post the question here. But then it turned out people had already started answering my question-post, so, I am declaring that the canonical place to answer the question.

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Can someone give a rough explanation of how this compares to the recent Deepmind atari-playing AI:

And, for that matter, how both of them compare to the older deepmind paper:

Are they accomplishing qualitatively different things? The same thing but better?

Update: I originally posted this question over here, then realized this post existed and maybe I should just post the question here. But then it turned out people had already started answering my question-post, so, I am declaring that the canonical place to answer the question.

AMA: Paul Christiano, alignment researcher

Curated. I don't think we've curated an AMA before, and not sure if I have a principled opinion on doing that, but this post seems chock full of small useful incites, and fragments of ideas that seem like they might otherwise take awhile to get written up more comprehensively, which I think is good.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Curated. I appreciated this post for a combination of:

  • laying out several concrete stories about how AI could lead to human extinction
  • layout out a frame for how think about those stories (while acknowledging other frames one could apply to the story)
  • linking to a variety of research, with more thoughts what sort of further research might be helpful.

I also wanted to highlight this section:

Finally, should also mention that I agree with Tom Dietterich’s view (dietterich2019robust) that we should make AI safer to society by learning from high-reliability organiz

... (read more)
Another (outer) alignment failure story

There's a lot of intellectual meat in this story that's interesting. But, my first comment was: "I'm finding myself surprisingly impressed about some aesthetic/stylistic choices here, which I'm surprised I haven't seen before in AI Takeoff Fiction."

In normal english phrasing across multiple paragraphs, there's a sort of rise-and-fall of tension. You establish a minor conflict, confusion, or an open loop of curiosity, and then something happens that resolves it a bit. This isn't just about the content of 'what happens', but also what sort of phrasing one us... (read more)

How do we prepare for final crunch time?


I found this a surprisingly obvious set of strategic considerations (and meta-considerations), that for some reason I'd never seen anyone actually attempt to tackle before.

I found the notion of practicing "no cost too large" periods quite interesting. I'm somewhat intimidated by the prospect of trying it out, but it does seem like a good idea.

How do we prepare for final crunch time?

Seems true, but also didn't seem to be what this post was about?

Epistemological Framing for AI Alignment Research

On the meta-side: an update I made writing this comment is that inline-google-doc-style commenting is pretty important. It allows you to tag a specific part of the post and say "hey, these seems wrong/confused" without making that big a deal about it, whereas writing a LW comment you sort of have to establish the context which intrinsically means making into A Thing.

Epistemological Framing for AI Alignment Research

(I tried writing up comments here as if I were commenting on a google doc, rather than a LW post, as part of an experiment I had talked about with AdamShimi. I found that actually it was fairly hard – both because I couldn't make quick comments on. a given section without it feeling like a bigger-deal than I meant it to be, and also because the overall thing came out more critical feeling than feels right on a public post. This is ironic since I was the the one who told Adam "I bet if you just ask people to comment on it as if it's a google doc it'll go fi... (read more)

1Raymond Arnold1y
On the meta-side: an update I made writing this comment is that inline-google-doc-style commenting is pretty important. It allows you to tag a specific part of the post and say "hey, these seems wrong/confused" without making that big a deal about it, whereas writing a LW comment you sort of have to establish the context which intrinsically means making into A Thing.
The case for aligning narrowly superhuman models

I had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)

7Paul Christiano1y
That's what I have in mind. If all goes well you can think of it like "a human thinking a long time." We don't know if all will go well. It's also not really clear what "a human thinking 10,000 years" means, HCH is kind of an operationalization of that, but there's a presumption of alignment in the human-thinking-a-long-time that we don't get for free here. (Of course you also wouldn't get it for free if you somehow let a human live for 10,000 years...)
adamShimi's Shortform

I think there are a number for features LW could build to improve this situation, but first curious for more detail on “what feels wrong about explicitly asking individuals for feedback after posting on AF” similar to how you might ask for feedback on a gDoc?

5Steve Byrnes1y
Not Adam, but 1. Maybe there's a sense in which everyone has already implicitly declared that they don't want to give feedback, because they could have if they wanted to, so it feels like more of an imposition. 2. Maybe it feels like "I want feedback for my own personal benefit" when it's already posted, as opposed to "I want feedback to improve this document which I will share with the community" when it's not yet posted. So it feels more selfish, instead of part of a community project. For that problem, maybe you'd want to frame it as "I'm planning to rewrite this post / write a follow-up to this post / give a talk based on this post / etc., can you please offer feedback on this post to help me with that?" (Assuming that's in fact the case, of course, but most posts have follow-up posts...)
The Commitment Races problem

Okay, so now having thought about this a bit...

I at first read this and was like "I'm confused – isn't this what the whole agent foundations agenda is for? Like, I know there are still kinks to work out, and some of this kinks are major epistemological problems. But... I thought this specific problem was not actually that confusing anymore."

"Don't have your AGI go off and do stupid things" is a hard problem, but it seemed basically to be restating "the alignment problem is hard, for lots of finnicky confusing reasons."

Then I realized "holy christ most AGI ... (read more)

The Commitment Races problem

Yeah I'm interested in chatting about this. 

I feel I should disclaim "much of what I'd have to say about this is a watered down version of whatever Andrew Critch would say". He's busy a lot, but if you haven't chatted with him about this yet you probably should, and if you have I'm not sure whether I'll have much to add.

But I am pretty interested right now in fleshing out my own coordination principles and fleshing out my understanding of how they scale up from "200 human rationalists" to 1000-10,000 sized coalitions to All Humanity and to AGI and beyond. I'm currently working on a sequence that could benefit from chatting with other people who think seriously about this.

The Commitment Races problem

I was confused about this post, and... I might have resolve my confusion by the time I got ready to write this comment. Unsure. Here goes:

My first* thought: 

Am I not just allowed to precommit to "be the sort of person who always figures out whatever the optimal game theory was, and commit to that?". I thought that was the point. 

i.e. I wouldn't precommit to treating either the Nash Bargaining Solution or Kalai-Smorodinsky Solution as "the permanent grim trigger bullying point", I'd precommit to something like "have a meta-policy of not giving int... (read more)

4Daniel Kokotajlo1y
Thanks! Reading this comment makes me very happy, because it seems like you are now in a similar headspace to me back in the day. Writing this post was my response to being in this headspace. This sounds like a plausibly good rule to me. But that doesn't mean that every AI we build will automatically follow it. Moreover, thinking about acausal trade is in some sense engaging in acausal trade. As I put it: As for your handwavy proposals, I do agree that they are pretty good. They are somewhat similar to the proposals I favor, in fact. But these are just specific proposals in a big space of possible strategies, and (a) we have reason to think there might be flaws in these proposals that we haven't discovered yet, and (b) even if these proposals work perfectly there's still the problem of making sure that our AI follows them: If you want to think and talk more about this, I'd be very interested to hear your thoughts. Unfortunately, while my estimate of the commitment races problem's importance has only increased over the past year, I haven't done much to actually make intellectual progress on it.
The Credit Assignment Problem

I think I have juuust enough background to follow the broad strokes of this post, but not to quite grok the parts I think Abram was most interested in. 

I definitely caused me to think about credit assignment. I actually ended up thinking about it largely through the lens of Moral Mazes (where challenges of credit assignment combine with other forces to create a really bad environment). Re-reading this post, while I don't quite follow everything, I do successfully get a taste of how credit assignment fits into a bunch of different domains.

For the "myop... (read more)

The Commitment Races problem

This feels like an important question in Robust Agency and Group Rationality, which are major topics of my interest.

Why Subagents?

This post feels probably important but I don't know that I actually understood it or used it enough to feel right nominating it myself. But, bumping it a bit to encourage others to look into it.

Alignment Research Field Guide

This post is a great tutorial on how to run a research group. 

My main complain about it is that it had the potential to be a way more general post that was obviously relevant to anyone building a serious intellectual community, but the framing makes it feel only relevant to Alignment research.

Some AI research areas and their relevance to existential safety

Curated, for several reasons.

I think it's really hard to figure out how to help with beneficial AI. Various career and research paths vary in how likely they are to help, or harm, or fit together. I think many prominent thinkers in the AI landscape have developed nuanced takes on how to think about the evolving landscape, but often haven't written up those thoughts. 

I like this post both for laying out a lot of object-level thoughts about that, and also for demonstrating a possible framework for organizing those object-level thoughts, and for doing it... (read more)

The Solomonoff Prior is Malign

Curated. This post does a good job of summarizing a lot of complex material, in a (moderately) accessible fashion.

4Ben Pace2y
+1 I already said I liked it, but this post is great and will immediately be the standard resource on this topic. Thank you so much.
Draft report on AI timelines

I'm assuming part of the point is the LW crosspost still buries things in a hard-to-navigate google doc, which prevents it from easily getting cited or going viral, and Ajeya is asking/hoping for trust that they can get the benefit of some additional review from a wider variety of sources.

Load More