I've had many conversations over the last few years about the health of the AI Alignment field and one of the things that has come up most frequently (including in conversations with Rohin, Buck and various Open Phil people) is that many people wish there was more of a review process in the AI Alignment field. 

I also think there is a bunch of value in better review processes, but have felt hesitant to create something very official and central, since AI Alignment is a quite preparadigmatic field, which makes creating shared standards of quality hard, and because I haven't had the time to really commit to maintain something great here. 

Separately, I am also quite proud of the LessWrong review, and am very happy about the overall institution that we've created there, and I realized that the LessWrong review might just be a good test bed and bandaid for having a better AI Alignment review process. I think the UI we built for it is quite good, and I think the vote does have real stakes and a lot of the people voting are also people quite active in AI Alignment. 

So this year, I would like to encourage many of the people who expressed a need for better review processes in AI Alignment to try reviewing some AI Alignment posts from 2021 as part of the LessWrong review. I personally got quite a bit of personal value out of doing that, and e.g. found that my review of the MIRI dialogues helped crystallize some helpful new directions for me to work towards, and I am also hoping to write a longer review of Eliciting Latent Knowledge that I also think will help clarify some things for me, and is something that I will feel comfortable linking to later when people ask me about my takes on ELK-adjacent AI Alignment research. 

I am also interested in comments on this post with takes for better review-processes in AI Alignment. I am currently going through a period where I feel quite confused how to relate to the field at large, so it might be a good time to also have a conversation about what kind of standards we even want to have in the field.

Current AI Alignment post frontrunners in the review

We've had an initial round of preliminary voting, in which people cast non-binding votes that help prioritize posts during the Review Phase. Among Alignment Forum voters, the top Alignment Forum posts were:

  1. ARC's first technical report: Eliciting Latent Knowledge
  2. What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
  3. Another (outer) alignment failure story
  4. Finite Factored Sets
  5. Ngo and Yudkowsky on alignment difficulty
  6. My research methodology
  7. Fun with +12 OOMs of Compute
  8. The Plan
  9. Comments on Carlsmith's “Is power-seeking AI an existential risk?”
  10. Ngo and Yudkowsky on AI capability gains

There are also a lot of other great alignment posts in the review (a total of 88 posts were nominated), and I do expect things to shift around a bit, but I do think all 10 of these top essays deserve some serious engagement and a relatively in-depth review, since I expect most of them will get read by people for many years to come, and people might be basing new research approaches and directions on them.

To review a post, you can navigate to the post page, and click the "Review" button at the top of the page (just under the post title). It looks like this:


New Comment
13 comments, sorted by Click to highlight new comments since: Today at 10:00 AM

one of the things that has come up most frequently (including in conversations with Rohin, Buck and various Open Phil people) is that many people wish there was more of a review process in the AI Alignment field. 

Hmm, I think I've complained a bunch about lots of AI alignment work being conceptually confused, or simply stating points rather than arguing for them, or being otherwise epistemically sketchy. But I also don't particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epistemically competent than the post authors, and that currently doesn't seem likely to happen.

Also, when I actually imagine what the reviews would look like, I mostly think of people talking about the same old cruxes and disagreements that change whether or not the work is worth doing at all, rather than actually talking about the details, which is what I would usually find useful about reviews.

(Tbc, it's possible I did express optimism about a review process in conversation with you; my opinions could have changed a bunch. I would be a bit surprised though.)

rather than actually talking about the details, which is what I would usually find useful about reviews.

I'm interested in details about what you find useful about the prospect of reviews that talk about the details. I share a sense that it'd be helpful, but I'm not sure I could justify that belief very strongly (when it comes to the opportunity cost of the people qualified to do the job)

In general, I'm legit fairly uncertain whether "effort-reviews"(whether detail-focused or big-picture focused) are worthwhile. It seems plausible to me that detail-focused-reviews are more useful soon after a work is published than 2 years later, and that big-picture-reviews are more useful in the "two year retrospective" sense (and maybe we should figure out some way to get detail-oriented reviews done more frequently, faster). 

It does seem to me that, by the time a book is being considered for "long-term-valuable', I would like someone, at some point, to have done a detail-oriented review examining all of the fiddly pieces. In some cases, that review has been done before the post was even published, in a private google doc.

A couple of reasons:

  1. It's far easier for me to figure out how much to update on evidence when someone else has looked at the details and highlighted ways in which the evidence is stronger or weaker than a reader might naively take away from the paper. (At least, assuming the reviewer did a good job.)
    1. This doesn't apply to big-picture reviews because such reviews are typically a rehash of old arguments I already know.
    2. This is similar to the general idea in AI safety via debate -- when you have access to a review you are more like a judge; without a review you are more like the debate opponent.
  2. Having someone else explain the paper from their perspective can surface other ways of thinking about the paper that can help with understanding it.
    1. This sometimes does happen with big-picture reviews, though I think it's less common.

Tbc, I'm not necessarily saying it is worth the opportunity cost of the reviewer's time; I haven't thought much about it.

But I also don't particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epistemically competent than the post authors, and that currently doesn't seem likely to happen.


For what it's worth, this is also where I'm at on an Alignment Forum review.

I've been trying to articulate some thoughts since Rohin's original comment, and maybe going to just rant-something-out now.

On one hand: I don't have a confident belief that writing in-depth reviews is worth Buck or Rohin's time (or their immediate colleague's time for that matter). It's a lot of work, there's a lot of other stuff worth doing. And I know at least Buck and Rohin have already spent quite a lot of time arguing about the conceptual deep disagreements for many of the top-voted posts.

On the other hand, the combination of "there's stuff epistemically wrong or confused or sketchy about LW", but "I don't trust a review process to actually work because I don't believe the it'll get better epistemics than what have already been demonstrated" seems a combination of "self-defeatingly wrong" and "also just empirically (probably) wrong". 

Presumably Rohin and Buck and similar colleagues think they have at least (locally) better epistemics than the writers they're frustrated by. 

I'm guessing your take is like "I, Buck/Rohin, could write a review that was epistemically adequate, but I'm busy and don't expect it to accomplish anything that useful." Assuming that's a correct characterization, I don't necessarily disagree (at least not confidently). But something about the phrasing feels off.

Some reasons it feels off:

  • Even if there are clusters of research that seem too hopeless to be worth engaging with, I'd be very surprised if there weren't at least some clusters of research that Rohin/Buck/etc are more optimistic about. If what happens is "people write reviews of the stuff that feels real/important enough to be worth engaging with", that still seems valuable to me.
  • It seems like people are sort of treating this like a stag-hunt, and it's not worth participating if a bunch of other effort isn't going in. I do think there are network effects that make it more valuable as more people participate. But I also think "people incrementally do more review work each year as it builds momentum" is pretty realistic, and I think individual thoughtful reviews are useful in isolation for building clarity on individual posts.
  • The LessWrong/Alignment Review process is pretty unopinionated at the moment. If you think a particular type of review is more valuable than other types, there's nothing stopping you from doing that type of review.
  • If the highest review-voted work is controversial, I think it's useful for the field orienting to know that it's controversial. It feels pretty reasonable to me to publish an Alignment Forum Journal-ish-thing that includes the top-voted content, with short reviews from other researchers saying "FYI I disagree conceptually here about this post being a good intellectual output"
    • (or, stepping out of the LW-review frame: if the alignment field is full of controversy and people who think each other are confused, I think this is a fairly reasonable fact to come out of any kind of review process)
  • I'm skeptical that the actual top-voted posts trigger this reaction. At the time of this post, the top voted posts were:

I do think a proper alignment review should likely have more content that wasn't published on alignment forum. This was technically available this year (we allowed people to submit non-LW content during the nomination phase), but we didn't promote it very heavily and it didn't frame it as a "please submit all Alignment progress you think was particularly noteworthy" to various researchers.

I don't know that the current review process is great, but, again, it's fairly unopinionated and leaves plenty of room to be-the-change-you-want-to-see in the alignment scene meta-reflection.

(aside: I apologize for picking on Rohin and Buck when they bothered to stick their neck out and comment, presumably there are other people who feel similarly who didn't even bother commenting. I appreciate you sharing your take, and if this feels like dragging you into something you don't wanna deal with, no worries. But, I think having concrete people/examples is helpful. I also think a lot of what I'm saying applies to people I'd characterize as "in the MIRI camp", who also haven't done much reviewing, although I'd frame my response a bit differently)

for that to fix these problems the reviewers would have to be more epistemically competent than the post authors

I think this is an overstatement. They'd need to notice issues the post authors missed. That doesn't require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.

Certainly there's a point below which the signal-to-noise ratio is too low. I agree that high reviewer quality is important.

On the "same old cruxes and disagreements" I imagine you're right - but to me that suggests we need a more effective mechanism to clarify/resolve them (I think you're correct in implying that review is not that mechanism - I don't think academic review achieves this either). It's otherwise unsurprising that they bubble up everywhere.

I don't have any clear sense of the degree of time and effort that has gone into clarifying/resolving such cruxes, and I'm sure it tends to be a frustrating process. However, my guess is that the answer is "nowhere close to enough". Unless researchers have very high confidence that they're on the right side of such disagreements, it seems appropriate to me to spend ~6 months focusing on purely this (of course this would require coordination, and presumably seems wildly impractical).

My sense is that nothing on this scale happens (right?), and that the reasons have more to do with (entirely understandable) impracticality, coordination difficulties and frustration, than with principled epistemics and EV calculations.
But perhaps I'm way off? My apologies if this is one of the same old cruxes and disagreements :).

That doesn't require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.

Yes, that's true, I agree my original comment is overstated for this reason. (But it doesn't change my actual prediction about what would happen; I still don't expect reviewers to catch issues.)

My sense is that nothing on this scale happens (right?)

I'd guess that I've spent around 6 months debating these sorts of cruxes and disagreements (though not with a single person of course). I think the main bottleneck is finding avenues that would actually make progress.

Ah, well that's mildly discouraging (encouraging that you've made this scale of effort; discouraging in what it says about the difficulty of progress).

I'd still be interested to know what you'd see as a promising approach here - if such crux resolution were the only problem, and you were able to coordinate things as you wished, what would be a (relatively) promising strategy?
But perhaps you're already pursuing it? I.e. if something like [everyone works on what they see as key problems, increases their own understanding and shares insights] seems most likely to open up paths to progress.

Assuming review wouldn't do much to help on this, have you thought about distributed mechanisms that might? E.g. mapping out core cruxes and linking all available discussions where they seem a fundamental issue (potentially after holding/writing-up a bunch more MIRI Dialogues style interactions [which needn't all involve MIRI]).
Does this kind of thing seem likely to be of little value - e.g. because it ends up clearly highlighting where different intuitions show up, but shedding little light on their roots or potential justification?

I suppose I'd like to know what shape of evidence seems most likely to lead to progress - and whether much/any of it might be unearthed through clarification/distillation/mapping of existing ideas. (where the mapping doesn't require connections that only people with the deepest models will find)

Personally if I were trying to do this I'd probably aim to do a combination of:

  1. Identify what kinds of reasoning people are employing, investigate under what conditions they tend to lead to the truth. E.g. one way that I think I differ from many others is that I am skeptical of analogies as direct evidence about the truth; I see the point of analogies as (a) tools for communicating ideas more effectively and (b) locating hypotheses that you then verify by understanding the underlying mechanism and checking that the mechanism ports (after which you don't need the analogy any more). 
  2. State arguments more precisely and rigorously, to narrow in on more specific claims that people disagree about (note there are a lot of skulls along this road)

FWIW I think a fairly substantial amount of effort has gone into resolving longstanding disagreements. I think that effort has resulted in a lot of good works and updates from many people reading about the disagreement discussion, but not really changed the mind of the people doing the arguing. (See: the MIRI Dialogues)

And it's totally plausible to me the answer is "10-100x the amount of work that is gone in so far."

I maybe agree that people haven't literally sat and double-cruxed for six months. I don't know that it's fair to describe this as "impracticality, coordination difficulties and frustration" instead of "principled epistemics and EV calculations." Like, if you've done a thing a bunch and it doesn't seem to be working and you feel like you have traction on another thing, it's not crazy to do the other thing.

(That said, I do still have the gut level feeling of 'man it's absolutely bonkers that in the so-called rationality community a lot of prominent thinkers still disagree about such fundamental stuff.')

Oh sure, I certainly don't mean to imply that there's been little effort in absolute terms - I'm very encouraged by the MIRI dialogues, and assume there are a bunch of behind-the-scenes conversations going on.
I also assume that everyone is doing what seems best in good faith, and has potentially high-value demands on their time.

However, given the stakes, I think it's a time for extraordinary efforts - and so I worry that [this isn't the kind of thing that is usually done] is doing too much work.

I think the "principled epistemics and EV calculations" could perfectly well be the explanation, if it were the case that most researchers put around a 1% chance on [Eliezer/Nate/John... are largely correct on the cruxy stuff].

That's not the sense I get - more that many put the odds somewhere around 5% to 25%, but don't believe the arguments are sufficiently crisp to allow productive engagement.

If I'm correct on that (and I may well not be), it does not seem a principled justification for the status-quo. Granted the right course isn't obvious - we'd need whoever's on the other side of the double-cruxing to really know their stuff. Perhaps Paul's/Rohin's... time is too valuable for a 6 month cost to pay off. (the more realistic version likely involves not-quite-so-valuable people from each 'side' doing it)

As for "done a thing a bunch and it doesn't seem to be working", what's the prior on [two experts in a field from very different schools of thought talk for about a week and try to reach agreement]? I'm no expert, but I strongly expect that not to work in most cases.

To have a realistic expectation of its working, you'd need to be doing the kinds of thing that are highly non-standard. Experts having some discussions over a week is standard. Making it your one focus for 6 months is not. (frankly, I'd be over the moon for the one month version [but again, for all I know this may have been tried])

This was quite a while ago, probably over 2 years, though I do feel like I remember it quite distinctly. I guess my model of you has updated somewhat here over the years, and now is more interested in heads-down work.

Yeah, that sounds entirely plausible if it was over 2 years ago, just because I'm terrible at remembering my opinions from that long ago.