Recommended Sequences

Embedded Agency
AGI safety from first principles
Iterated Amplification

Recent Discussion


My PhD thesis probably wins the prize of weirdest ever defended at my research lab. Not only was it a work of theory of distributed computing in a formal methods lab, but it didn’t even conform to what theory of distributed computing is supposed to look like. With the exception of one paper (interestingly, the only one accepted at the most prestigious conference in the field), none of my published works proposed new algorithms, impossibility results, complexity lower bounds, or even the most popular paper material, a brand new model of distributed computing to crowd even more the literature.

Instead, I looked at a specific formalism introduced years before, and how it abstracted the more familiar models used by most researchers. It had been introduced as such an...

2Steve Byrnes10hIs there any good AI alignment research that you don't classify as deconfusion? If so, can you give some examples?


... (read more)

Just like every Monday now, researchers in AI Alignment are invited for a coffee time, to talk about their research and what they're into.

Here is the link

And here is the everytimezone time.

Note that the link to the walled garden now only works for AF members. Anyone who wants to come but isn't an AF member needs to go by me. I'll broadly apply the following criteria for admission:

  • If working in a AI Alignment lab or funded for independent research, automatic admission
  • If recommended by AF member, automatic admission
  • Otherwise, to my discretion

I prefer to not allow people who might have been interesting but who I'm not sure will not derail the conversation, because this is supposed to be the place where AI Alignment researchers can talk about their current research without having to explain everything.

See you then!

1Donald Hobson7dThere seems to be some technical problem with the link. It gives me a "Our apologies, your invite link has now expired (actually several hours ago, but we hate to rush people). We hope you had a really great time! :)" message. Edit: As of a few minutes after stated start time. It worked last week.

Hey, it seems like other could use the link, so I'm not sure what went wrong. If you have the same problem tomorrow, just send me a PM.

Cross-posted to the EA forum.


  • In August 2020, we conducted an online survey of prominent AI safety and governance researchers. You can see a copy of the survey at this link.[1]
  • We sent the survey to 135 researchers at leading AI safety/governance research organisations (including AI Impacts, CHAI, CLR, CSER, CSET, FHI, FLI, GCRI, MILA, MIRI, Open Philanthropy and PAI) and a number of independent researchers. We received 75 responses, a response rate of 56%.
  • The survey aimed to identify which AI existential risk scenarios[2] (which we will refer to simply as “risk scenarios”) those researchers find most likely, in order to (1) help with prioritising future work on exploring AI risk scenarios, and (2) facilitate discourse and understanding within the AI safety and governance community, including between researchers who

Planned summary for the Alignment Newsletter:

While the previous survey asked respondents about the overall probability of existential catastrophe, this survey seeks to find which particular risk scenarios respondents find more likely. The survey was sent to 135 researchers, of which 75 responded. The survey presented five scenarios along with an “other”, and asked people to allocate probabilities across them (effectively, conditioning on an AI-caused existential catastrophe, and then asking which scenario happened).

The headline result is that all of the sc

... (read more)
3Evan Hubinger5d(Moderation note: added to the Alignment Forum from LessWrong.)

I sent a two-question survey to ~117 people working on long-term AI risk, asking about the level of existential risk from "humanity not doing enough technical AI safety research" and from "AI systems not doing/optimizing what the people deploying them wanted/intended".

44 people responded (~38% response rate). In all cases, these represent the views of specific individuals, not an official view of any organization. Since some people's views may have made them more/less likely to respond, I suggest caution in drawing strong conclusions from the results below. Another reason for caution is that respondents added a lot of caveats to their responses (see the anonymized spreadsheet), which the aggregate numbers don't capture.

I don’t plan to do any analysis on this data, just share it; anyone who wants to analyze...

Planned summary for the Alignment Newsletter:

This post reports on the results of a survey sent to about 117 people working on long-term AI risk (of which 44 responded), asking about the magnitude of the risk from AI systems. I’d recommend reading the exact questions asked, since the results could be quite sensitive to the exact wording, and as an added bonus you can see the visualization of the responses. In addition, respondents expressed _a lot_ of uncertainty in their qualitative comments. And of course, there are all sorts of selection effects that mak

... (read more)
3Rohin Shah13hI know at least one person who works on long-term AI risk who I am confident really does assign this high a probability to the questions as asked. I don't know if this person responded to the survey, but still, I expect that the people who gave those answers really did mean them.

I've been thinking about situations where alignment fails because "predict what a human would say" (or more generally "game the loss function," what I call the instrumental policy) is easier to learn than "answer questions honestly" (overview).

One way to avoid this situation is to avoid telling our agents too much about what humans are like, or hiding some details of the training process, so that they can't easily predict humans and so are encouraged to fall back to "answer questions honestly." (This feels closely related to the general phenomena discussed in Thoughts on Human Models.)

Setting aside other reservations with this approach, could it resolve our problem?

  • One way to get the instrumental policy is to "reuse" a human model to answer questions (discussed here). If our AI has
Load More