Raymond Arnold

I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.


AMA: Paul Christiano, alignment researcher

Curated. I don't think we've curated an AMA before, and not sure if I have a principled opinion on doing that, but this post seems chock full of small useful incites, and fragments of ideas that seem like they might otherwise take awhile to get written up more comprehensively, which I think is good.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Curated. I appreciated this post for a combination of:

  • laying out several concrete stories about how AI could lead to human extinction
  • layout out a frame for how think about those stories (while acknowledging other frames one could apply to the story)
  • linking to a variety of research, with more thoughts what sort of further research might be helpful.

I also wanted to highlight this section:

Finally, should also mention that I agree with Tom Dietterich’s view (dietterich2019robust) that we should make AI safer to society by learning from high-reliability organizations (HROs), such as those studied by social scientists Karlene Roberts, Gene Rochlin, and Todd LaPorte (roberts1989research, roberts1989new, roberts1994decision, roberts2001systems, rochlin1987self, laporte1991working, laporte1996high).  HROs have a lot of beneficial agent-agnostic human-implemented processes and control loops that keep them operating.  Again, Dietterich himself is not as yet a proponent of existential safety concerns, however, to me this does not detract from the correctness of his perspective on learning from the HRO framework to make AI safer.

Which is a thing I think I once heard Critch talk about, but which I don't think had been discussed much on LessWrong, and which I'd be interested in seeing more thoughts and distillation of.

Another (outer) alignment failure story

There's a lot of intellectual meat in this story that's interesting. But, my first comment was: "I'm finding myself surprisingly impressed about some aesthetic/stylistic choices here, which I'm surprised I haven't seen before in AI Takeoff Fiction."

In normal english phrasing across multiple paragraphs, there's a sort of rise-and-fall of tension. You establish a minor conflict, confusion, or an open loop of curiosity, and then something happens that resolves it a bit. This isn't just about the content of 'what happens', but also what sort of phrasing one uses. In verbal audio storytelling, this often is accompanied with the pitch of your voice rising and falling. 

And this story... even moreso than Accelerando or other similar works, somehow gave me this consistent metaphorical vibe of "rising pitch". Like, some club music where it keeps sounding like the bass is about to drop, but instead it just keeps rising and rising. Something about most of the paragraph structures feel like they're supposed to be the first half of a two-paragraph-long-clause, and then instead... another first half of a clause happens, and another.

And this was incredibly appropriate for what the story was trying to do. I dunno how intentional any of that was but I quite appreciated it, and am kinda in awe and boggled and what precisely created the effect – I don't think I'd be able to do it on purpose myself without a lot of study and thought.

How do we prepare for final crunch time?


I found this a surprisingly obvious set of strategic considerations (and meta-considerations), that for some reason I'd never seen anyone actually attempt to tackle before.

I found the notion of practicing "no cost too large" periods quite interesting. I'm somewhat intimidated by the prospect of trying it out, but it does seem like a good idea.

How do we prepare for final crunch time?

Seems true, but also didn't seem to be what this post was about?

Epistemological Framing for AI Alignment Research

On the meta-side: an update I made writing this comment is that inline-google-doc-style commenting is pretty important. It allows you to tag a specific part of the post and say "hey, these seems wrong/confused" without making that big a deal about it, whereas writing a LW comment you sort of have to establish the context which intrinsically means making into A Thing.

Epistemological Framing for AI Alignment Research

(I tried writing up comments here as if I were commenting on a google doc, rather than a LW post, as part of an experiment I had talked about with AdamShimi. I found that actually it was fairly hard – both because I couldn't make quick comments on. a given section without it feeling like a bigger-deal than I meant it to be, and also because the overall thing came out more critical feeling than feels right on a public post. This is ironic since I was the the one who told Adam "I bet if you just ask people to comment on it as if it's a google doc it'll go fine.")

((I started writing this before the exchange between Adam and Daniel and am not sure if it's redundant with that))

I think my primary comment (perhaps similar to Daniel's) is that it's not clear to me that this post is arguing against a position that anyone holds. What I hear lots of people saying is "AI Alignment is preparadigmatic, and that makes it confusing and hard to navigate", which is fairly different from the particular solution of "and we should get to a point of paradigmaticness soon". I don't know of anyone who seems to me to be pushing for that.

(fake edit: Adam says in another comment that these arguments have come up in person, not online. That's annoying to deal with for sure. I'd still put some odds on there being some kind of miscommunication going on, where Adam's interlocuter says something like "it's annoying/bad that alignment is pre-paradigmatic", and Adam assumed that meant "we should get it to the paradigmatic stage ASAP". But, it's hard to have a clear opinion here without knowing more about the side-channel arguments. I think in the world where there are people who are saying in private "we should get Alignment to a paradigmatic state", I think it might be good to have some context-setting in the post explaining that)


I do agree with the stated problem of "in order to know how to respond to a given Alignment post, I kinda need to know who wrote it and what sort of other things they think about, and that's annoying." But I'm not sure what to do with that.

The two solutions I can think of are basically "Have the people with established paradigms spend more time clarifying their frame" (for the benefit of people who don't already know), and "have new up-and-coming people put in more time clarifying their frame" (also for the benefit of those who don't already know, but in this case many fewer people know them so it's more obviously worth it). 

Something about this feels bad to me because I think most attempts to do it will create a bunch of boilerplate cruft that makes the posts harder to read. (This might be a UI problem that lesswrong should try to fix)

I read the OP as suggesting

  • a meta-level strategy of "have a shared frame for framing things, which we all pay a one-time cost of getting up to speed with"
  • a specific shared-frame of "1. Defining the terms of the problem, 2. Exploring these definitions, 3. Solving the now well-defined problem." I think these things are... plausibly fine, but I don't know how strongly to endorse them because I feel like I haven't gotten to see much of the space of possible alternative shared-frames.
The case for aligning narrowly superhuman models

I had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)

adamShimi's Shortform

I think there are a number for features LW could build to improve this situation, but first curious for more detail on “what feels wrong about explicitly asking individuals for feedback after posting on AF” similar to how you might ask for feedback on a gDoc?

Load More