Ben Pace

I'm an admin of this site; I work full-time on trying to help people on LessWrong refine the art of human rationality.

Longer bio:


AI Alignment Writing Day 2019
AI Alignment Writing Day 2018


My Current Take on Counterfactuals

I've felt like the problem of counterfactuals is "mostly settled" for about a year, but I don't think I've really communicated this online.

Wow that's exciting! Very interesting that you think that.

Reflective Bayesianism

The rules say we must use consequentialism, but good people are deontologists, and virtue ethics is what actually works.

—Eliezer Yudkowsky, Twitter

Disentangling Corrigibility: 2015-2021

You're welcome. Yeah "invented the concept" and "named the concept" are different (and both important!).

Disentangling Corrigibility: 2015-2021

Here it is:

Rob Miles (May 2014):

Ok, I've given this some thought, and I'd call it:

"Corrigible Reasoning"

using the definition of corrigible as "capable of being corrected, rectified, or reformed". (And of course AIs that don't meet this criterion are "Incorrigible")

Disentangling Corrigibility: 2015-2021

I'm 94% confident it came from a Facebook thread where you blegged for help naming the concept and Rob suggested it. I'll have a look now to find it and report back.

Edit: having a hard time finding it, though note that Paul repeats the claim at the top of his post on corrigibility in 2017.

My research methodology

Curated. This post gives me a lot of context on your prior writing (unaligned benchmark, strategy stealing assumption, iterated amplification, imitative generalization), it helps me understand your key intuitions behind the plausibility of alignment, and it helps me understand where your research is headed. 

When I read Embedded Agency, I felt like I then knew how to think productively about the main problems MIRI is working on by myself. This post leaves me feeling similarly about the problems you've been working on for the past 6+ years.

So thanks for that.

I'd like to read a version of this post where each example is 10x the length and analyzes it more thoroughly... I could just read all of your previous posts on each subject, though they're fairly technical. (Perhaps Mark Xu will write it such a post, he did a nice job previously on Solomonoff Induction.)

I'd also be pretty interested in people writing more posts presenting arguments for/against plausible stories for the failure of Imitative Generalization, or fleshing out the details of a plausible story such that we can see more clearly if the story is indeed plausible. Basically, making contributions in the ways you outline.

Aside: Since the post was initially published, some of the heading formatting was lost in an edit, so I fixed that before curating it.

Edit: Removed the line "After reading it I have a substantially higher probability of us solving the alignment problem." Understanding Paul's research is a big positive, but I'm not actually sure I stand by it leading to a straightforward change in my probability.

My research methodology

This post gives great insight into your research methodology, thanks for writing it.

After that point I think you are mostly in applied world, and I think that applied investments are likely to ultimately dwarf the empirical investments by orders of magnitude even if it turns out that we found a really good algorithm on paper.

You contrast ‘applied’ and ‘empirical’ here, but they sound the same to me. Is it a typo and you meant ‘applied’ and ‘theoretical’? That would make sense to me.

The case for aligning narrowly superhuman models

This was a very solid post and I've curated it. Here are some of the reasons:

  • I think that the post is a far more careful analysis of questions around what research to do, what research is scalable, and what are the potential negative effects, than most any other proposals I've seen, whilst also containing clear ideas and practical recommendations. (Many posts that optimize for this level of carefulness end up not saying much at all, or at least little of any practical utility, yet this post says quite a lot of interesting things that are practically useful.) There kx a lot of valuable advice, not merely to try to help making narrow superhuman models useful, but how to do it in a way that is helpful for alignment. The section "What kind of projects do and don't "count"" is really helpful here.
  • I appreciate the efforts that Ajeya has made to understand and build consensus around these ideas, talking to people at various orgs (OpenAI, MIRI, more), and this again makes me feel more confident signal-boosting it, given that it contains information about many others' perspectives on the topic. And more broadly, the whole "Objections and responses" section felt like it did a great job at perspective-taking on others' concerns and addressing them head on.
  • I like a lot of the discussion around this post, both in the comments section here and also in the comments by Eliezer and Evan in Robby's post. (I recommend everyone who reads the OP also reads the discussion in the linked post.)

The main hesitation I have around curating this is that I'm kind of scared of all recommendations for "interesting ideas for using machine learning that might be really important for AGI". This part of me feels like everything has the potential to be used for capabilities, and that someone pursuing this line of work may do so quite inadvertently (and it will not be possible to "put the ball back in the urn"), or just end up proving their competence at building scalably useful models and then getting hired by a research lab to do work that will give them lots of money to do stuff with big models.

I am scared about most everything in this space, yet I don't think I endorse "no action", and this does seem to me like one of the most promising and careful posts and approaches. For me the most cruxy sections were "Isn't this not neglected because lots of people want useful AI" and "Will this cause harm by increasing investment in scaling AI?". For the first, I think if I believed the claims Ajeya makes as strongly as I think Ajeya does, I'd feel notably more relieved overall about encouraging this kind of work. For the second I didn't feel persuaded by the arguments. I think that there are few people who are respected remotely on these sorts of questions or who are thinking strategically about them (especially in public). I think the x-risk and alignment communities have in the past had 100:1 outsized impact with actions taken and research directions pursued.

In sum, I continue to be generally terrified, but this was an excellent post and one of the relatively least terrifying things. Thank you very much for the post.

Full-time AGI Safety!

Woop! I'm very pleased for you. I occasional plot about how to help people (including you-in-particular) to do this, and I'm very pleased that it has happened without my intervention! Beth continues to do good things, my appreciation to her.

I look forward greatly to your proportional increase in insights being posted to LessWrong :D

Risks from Learned Optimization: Introduction

For me, this is the paper where I learned to connect ideas about delegation to machine learning. The paper sets up simple ideas of mesa-optimizers, and shows a number of constraints and variables that will determine how the mesa-optimizers will be developed – in some environments you want to do a lot of thinking in advance then delegate execution of a very simple algorithm to do your work (e.g. this simple algorithm Critch developed that my group house uses to decide on the rent for each room), and in some environments you want to do a little thinking and then delegate a very complex algorithm to figure out what to do (e.g. evolution is very stupid and then makes very complex brains to figure out what to do in lots of situations that humans encountered in the EEA).

Seeing this more clearly in ML shocked me with the level of inadequacy that ML has for being able to do this with much direction whatsoever. It just doesn't seem like something that we have much control of. Of course I may be wrong, and there are some simple proposals (though that have not worked so far). Nonetheless, it's a substantial step forward in discussing delegation in modern ML systems. It discusses lots of related ideas very clearly.

Definitely should be included in the review. I expect to vote on this with something like +5 to +8.

I don't do research in this area, I expect others like Daniel Filan and Adam Shimi will have more detailed opinions of the sequence's strengths and weaknesses. (Nonetheless I stand by my assessment and will vote accordingly.)

Load More