All of Ben Pace's Comments + Replies

Where are intentions to be found?

This reminds me that it's hard for me to say where "I" am, in both space and time.

I read a story recently (which I'm going to butcher because I don't remember the URL), about a great scientist who pulled a joke: after he died, his wife had a seance or used a ouija board or something, which told her to look at the first sentence of the 50th page of his book, and the first sentence was "<The author> loved to find creative ways to communicate with people."

After people die, their belongings and home often contain an essence of 'them'. I think that some p... (read more)

2Alex Flint1dYes, I agree. I once stayed in Andrew Critch's room for a few weeks while he was out of town. I felt that I was learning from him in his absence because he had all these systems and tools and ways that things were organized. I described it at the time as "living inside Critch's brain for two weeks", which was a great experience. Thanks Critch!
Another (outer) alignment failure story

Curated. This was perhaps the most detailed yet informative story I've read about how failure will go down. As you say at the start it's making several key assumptions, it's not your 'mainline' failure story. Thx for making the assumptions explicit, and discussing how to vary them at the end. I'd like to see more people write stories written under different assumptions.

The sorts of stories Eliezer has told in the past have focused on 10-1000x faster takeoffs than discussed here, so those stories are less extended (you kinda just wake up one day then everyo... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

At this point, my plan is try to consolidate what I think the are main confusions in the comments of this post, into one or more new concepts to form the topic of a new post.

Sounds great! I was thinking myself about setting aside some time to write a summary of this comment section (as I see it).

My Current Take on Counterfactuals

I've felt like the problem of counterfactuals is "mostly settled" for about a year, but I don't think I've really communicated this online.

Wow that's exciting! Very interesting that you think that.

3Abram Demski9dNow I feel like I should have phrased it more modestly, since it's really "settled modulo math working out", even though I feel fairly confident some version of the math should work out.
Reflective Bayesianism

The rules say we must use consequentialism, but good people are deontologists, and virtue ethics is what actually works.

—Eliezer Yudkowsky, Twitter

Disentangling Corrigibility: 2015-2021

You're welcome. Yeah "invented the concept" and "named the concept" are different (and both important!).

Disentangling Corrigibility: 2015-2021

Here it is:

Rob Miles (May 2014):

Ok, I've given this some thought, and I'd call it:

"Corrigible Reasoning"

using the definition of corrigible as "capable of being corrected, rectified, or reformed". (And of course AIs that don't meet this criterion are "Incorrigible")

Thank you very much!  It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.

The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.

Eg I'd suggest that to avoid confusion this kind of language should be something like "The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced..." &c.

Disentangling Corrigibility: 2015-2021

I'm 94% confident it came from a Facebook thread where you blegged for help naming the concept and Rob suggested it. I'll have a look now to find it and report back.

Edit: having a hard time finding it, though note that Paul repeats the claim at the top of his post on corrigibility in 2017.

3Robert Miles21dNote that the way Paul phrases it in that post is much clearer and more accurate: > "I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles"

Here it is:

Rob Miles (May 2014):

Ok, I've given this some thought, and I'd call it:

"Corrigible Reasoning"

using the definition of corrigible as "capable of being corrected, rectified, or reformed". (And of course AIs that don't meet this criterion are "Incorrigible")

My research methodology

Curated. This post gives me a lot of context on your prior writing (unaligned benchmark, strategy stealing assumption, iterated amplification, imitative generalization), it helps me understand your key intuitions behind the plausibility of alignment, and it helps me understand where your research is headed. 

When I read Embedded Agency, I felt like I then knew how to think productively about the main problems MIRI is working on by myself. This post leaves me feeling similarly about the problems you've been working on for the past 6+ years.

So thanks for... (read more)

My research methodology

This post gives great insight into your research methodology, thanks for writing it.

After that point I think you are mostly in applied world, and I think that applied investments are likely to ultimately dwarf the empirical investments by orders of magnitude even if it turns out that we found a really good algorithm on paper.

You contrast ‘applied’ and ‘empirical’ here, but they sound the same to me. Is it a typo and you meant ‘applied’ and ‘theoretical’? That would make sense to me.

3Paul Christiano1moYeah, thanks for catching that.
The case for aligning narrowly superhuman models

This was a very solid post and I've curated it. Here are some of the reasons:

  • I think that the post is a far more careful analysis of questions around what research to do, what research is scalable, and what are the potential negative effects, than most any other proposals I've seen, whilst also containing clear ideas and practical recommendations. (Many posts that optimize for this level of carefulness end up not saying much at all, or at least little of any practical utility, yet this post says quite a lot of interesting things that are practically useful
... (read more)
Full-time AGI Safety!

Woop! I'm very pleased for you. I occasional plot about how to help people (including you-in-particular) to do this, and I'm very pleased that it has happened without my intervention! Beth continues to do good things, my appreciation to her.

I look forward greatly to your proportional increase in insights being posted to LessWrong :D

Risks from Learned Optimization: Introduction

For me, this is the paper where I learned to connect ideas about delegation to machine learning. The paper sets up simple ideas of mesa-optimizers, and shows a number of constraints and variables that will determine how the mesa-optimizers will be developed – in some environments you want to do a lot of thinking in advance then delegate execution of a very simple algorithm to do your work (e.g. this simple algorithm Critch developed that my group house uses to decide on the rent for each room), and in some environments you want to do a little thinking and ... (read more)

Review of 'Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More'

Yeah I agree. I think it's useful to have a public record of it, and I'm glad that public conversation happened, but I don't think it's an important part of the ongoing conversation in the rationality community, and the conversation wasn't especially insightful.

I hope some day we'll have better debates with more resources devoted by either side than a FB comment thread, and perhaps one day that will be good for the review.

Utility ≠ Reward

For another datapoint, I'll mention that I didn't read this post nor Gradient Hacking at the time, I read the sequence, and I found that to be pretty enlightening and quite readable.

2020 AI Alignment Literature Review and Charity Comparison

hurrah! victory for larks, with yet another comprehensive review! how long can he keep it up? another decade? i hope so!

(Also I had 3 laugh-out-loud moments. I will let the studious reader find all your hidden jokes.)

Gradient hacking

This is one of the scarier posts I've read on LW. I feel kinda freaked out by this post. It's an important technical idea.

Six AI Risk/Strategy Ideas

The first three examples here have been pretty helpful to me in considering how DSAs and takeoffs will go and why they may be dangerous.

AGI will drastically increase economies of scale

Seems like an important consideration, and explained concisely.

Alignment Research Field Guide

I'm currently feeling confused about whether this is the right type signature for the review, but it is a truly excellent guide to discovering new ideas together with others, better than anything I've read in its reference class.

Classifying specification problems as variants of Goodhart's Law

I like the thing this post does, and I like the diagrams. I'd like to see this reviewed and voted on.

Utility ≠ Reward

I nominate this alongside the sequence, as a less formal explanation of the core ideas. I can imagine this essay being the more widely read and intuitive one.

The strategy-stealing assumption

I found this an interesting analysis, and would like to see it reviewed.

The Credit Assignment Problem

I found this a very interesting frame on things, and am glad I read it.

Selection vs Control

This felt to me like an important distinction to think about when thinking about optimization.

Risks from Learned Optimization: Introduction

I know it’s already been nominated twice, but I still want to nominate it again. This sequence (I’m nominating the sequence) helped me think clearly about optimization, and how delegation works between an optimizer and mesa-optimizer, and what constraints lie between them (e.g. when does an optimizer want a system it’s developing to do optimization?). Changed a lot of the basic ways in which I think about optimization and AI. 

Draft papers for REALab and Decoupled Approval on tampering

PSA: You can write comment on PDFs in google drive!

There's a button in the top right that says "Add a comment" on hover-over, then you get to click-and-drag to highlight a box in the PDF where your comment goes. I will leave a test comment on the first PDF so everyone can see that.

(I literally just found this out.)

Introduction to Cartesian Frames


I'm exceedingly excited about this sequence. The Embedded Agency sequence laid out a core set of confusions, and it seems like this is a formal system that deals with those issues far better than the current alternatives e.g. the cybernetics model. This post lays out the basics of Cartesian Frames clearly and communicates key parts of the overall approach ("reasoning like Pearl's to objects like game theory's, with a motivation like Hutter's"). I've also never seen math explained with as much helpful philosophical justification (e.g. "Part of the p... (read more)

The Solomonoff Prior is Malign

+1 I already said I liked it, but this post is great and will immediately be the standard resource on this topic. Thank you so much.

The Solomonoff Prior is Malign

Such a great post.

Note that I changed the formatting of your headers a bit, to make some of them just bold text. They still appear in the ToC just fine. Let me know if you'd like me to revert it or have any other issues.

AGI safety from first principles: Introduction

Oli suggests that there are no fields with three-word-names, and so "AI Existential Risk" is not a choice. I think "AI Alignment" is the currently most accurate name for the field that encompasses work like Paul's and Vanessa's and Scott/Abram's and so on. I think "AI Alignment From First Principles" is probably a good name for the sequence.

AGI safety from first principles: Introduction

It seems a definite improvement on the axis of specificity, I do prefer it over the status quo for that reason.

But it doesn't address the problem of scope-sensitivity. I don't think this sequence is about preventing medium-sized failures from AGI. It's about preventing extinction-level risks to our future.

"A First-Principles Explanation of the Extinction-Level Threat of AGI: Introduction"

"The AGI Extinction Threat from First Principles: Introduction"

"AGI Extinction From First Principles: Introduction"

2Richard Ngo6moYeah, I agree that's a problem. Bur I don't think it's a big problem, because who's talking about medium-size risks from AGI? In particular, the flag I want to plant is something like: "when you're talking about AGI, it's going to be So Big that existential safety is the default type of safety to be concerned with." Also I think having the big EXTINCTION in the title costs weirdness points, because even within the field people don't use that word very much. So I'm leaning towards AGI safety.
AGI safety from first principles: Introduction

Critch recently made the argument (and wrote it in his ARCHES paper, summarized by Rohin here) that "AI safety" is a straightforwardly misleading name because "safety" is a broader category than is being talked about in (for example) this sequence – it includes things like not making self-driving cars crash. (To quote directly: "the term “AI safety” should encompass research on any safety issue arising from the use of AI systems, whether the application or its impact is small or large in scope".) I wanted to raise the idea here and ask Richard what he thin... (read more)

1Richard Ngo6moYeah, this seems like a reasonable point. But I'm not that much of a fan of the alternatives you suggest. What do you think about "AGI safety"?
My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

I still don't understand how corrigibility and intent alignment are different.


AI Research Considerations for Human Existential Safety (ARCHES)

I listened to this yesterday! Was quite interesting, I'm glad I listened to it.

Draft report on AI timelines

I expect the examples Ajeya has in mind are more like sharing one-line summaries in places that tend to be positively selected for virality and anti-selected for nuance (like tweets), but that substantive engagement by individuals here or in longer posts will be much appreciated.

1Ajeya Cotra7moThanks Ben, this is right!
Radical Probabilism

Thank you, they were all helpful. I'll write more if I have more questions.

("sadly that's unprobable to work" lol)

Radical Probabilism

Thank you, those points all helped a bunch. 

(I feel most resolved on the calibration one. If I think more about the other two and have more questions, I'll come back and write them.)

Radical Probabilism

I made notes while reading about things that I was confused about or that stood out to me. Here they are:

  • The post says that radical probabilism rejects #3-#5, but also that Jeffrey's updates is derived from having rigidity (#5), which sounds like a contradiction. (I feel most dumb about this bullet, it's probably obvious.)
  • The convergence section blew me away. The dialogue here correctly understood my confusion (why would I only believe either h(1/3) or h(2/3)) and then hit me with the 'embedded world models' point, and that was so persuasive. This felt rea
... (read more)
5Abram Demski8mohmmmm. My attempt at an English translation of my example: A and B are correlated, so moving B to 60% (up from 50%) makes A more probable as well. But then moving A up to 60% is less of a move for A. This means that (A&¬B) ends up smaller than (B&¬A): both get dragged up and then down, but (B&¬A) was dragged up by the larger update and down by the smaller. It would be nice to write a whole post on this, but the first thing you need to do is distinguish between likelihoods and probabilities. likelihood(A|B)=probability(B|A) The notation may look pointless at first. The main usage has to do with the way we usually regard the first argument as variable an the second as fixed. IE, "a probability function sums to one" can be understood as P(A|B)+P(¬A|B)=1; we more readily think of A as variable here. In a Bayesian update, we vary the hypothesis, not the evidence, so it's more natural to think in terms of a likelihood function, L(H|E). In a Bayesian network, you propagate probability functions down links, and likelihood functions up links. Hence Pearl distinguished between the two strongly. Likelihood functions don't sum to 1. Think of them as fragments of belief which aren't meaningful on their own until they're combined with a probability. Base-rate neglect can be thought of as confusion of likelihood for probability. The conjunction fallacy could also be explained in this way. I wish it were feasible to get people to use "likely" vs "probable" in this way. Sadly, that's unprobable to work. What I'm pointing at is really much more outside-view than that. Standard warnings about outside view apply. ;p An example of exchanging probabilities is: I assert X, and another person agrees. I now know that they assign a high probability to X. But that does not tell me very much about how to update. Exchanging likelihoods instead: I assert X, and the other person tells me they already thought that for unrelated reasons. This tells me that their agreement is further evid
6Abram Demski8moJeffrey doesn't see Jeffrey updates as normative! Like Bayesian updates, they're merely one possible way to update. This is also part of why Pearl sounds like a critic of Jeffrey when in fact the two largely agree -- you have to realize that Jeffrey isn't advocating Jeffrey updating in a strong way, only using it as a kind of gateway drug to the more general fluid updates. Hmm. It seems like a proper reply to this would be to step through the argument more carefully -- maybe later? But no, the argument doesn't require either of those. It requires only that you have some expectation about your update, and the bookie knows what that is (which is pretty standard, because in dutch book arguments the bookies generally have access to your beliefs). You might have a very broad distribution over your possible updates, but there will still be an expected value, which is what's used in the argument. Like convergence, this is dependent on the prior, so I can't say that classical Bayesians are never calibrated (although one could potentially prove some pretty strong negative results, as is the case with convergence?). I didn't really include any argument, I just stated it as a fact. What I can say is that classical Bayesianism doesn't give you tools for getting calibrated. How do you construct a prior so that it'll have a calibration property wrt learning? Classical Bayesianism doesn't, to my knowledge, talk about this. Hence, by default, I expect most priors to be miscalibrated in practice when grain-of-truth (realizability) doesn't hold. For example, I'm not sure whether Solomonoff induction has a calibration property -- nor whether it has a convergence property. These strike me as mathematically complex questions. What I do know is that the usual path to prove nice properties for Solomonoff induction doesn't let you prove either of these things. (IE, we can't just say "there's a program in the mixture that's calibrated/convergent, so...." ... whereas logical induction l
Radical Probabilism

Sh*t. Wow. This is really impressive. 

Speaking for myself, this (combined with your orthodox case against utility functions) feels like the next biggest step for me since Embedded Agency in understanding what's wrong with our models of agency and how to improve them.

If I were to put it into words, I'm getting a strong vibe of "No really, you're starting the game inside the universe, stop assuming you've got all the hypotheses in your head and that you've got clean input-output, you need far fewer assumptions if you're going to get around this space at... (read more)

I made notes while reading about things that I was confused about or that stood out to me. Here they are:

  • The post says that radical probabilism rejects #3-#5, but also that Jeffrey's updates is derived from having rigidity (#5), which sounds like a contradiction. (I feel most dumb about this bullet, it's probably obvious.)
  • The convergence section blew me away. The dialogue here correctly understood my confusion (why would I only believe either h(1/3) or h(2/3)) and then hit me with the 'embedded world models' point, and that was so persuasive. This felt rea
... (read more)
Forecasting Thread: AI Timelines

(I can't see your distribution in your image.)

Forecasting Thread: AI Timelines

For example, a main consideration of my prediction is using the heurastic With 50% probability, things will last twice as long as they already have, with the starting time of 1956, the time of the Dartmouth College summer AI conference.

A counter hypothesis I’ve heard (not original to me) is: With 50% probability, we will be half-way through the AI researcher-years required to get AGI.

I think this suggests much shorter timelines, as most researchers have been doing research in the last ~10 years.

It's not clear to me what reference class makes sense he... (read more)

I was looking at the NIPS growth numbers last June and I made a joke:

AI researcher anthropics: 'researchers [should] tend to think AI is ~20 years away because given exponential growth of researchers & careers of ~30 years, the final generation of researchers will make up a majority of all researchers, hence, by SSA+Outside View, one must assume 20 years.'

(Of course, I'm making a rather carbon-chauvinistic assumption here that it's only human researchers/researcher-years which matter.)

Forecasting Thread: AI Timelines

Comment here if you have technical issues with the Elicit tool, with putting images in your comments, or with anything else.

Forecasting Thread: AI Timelines

Here's my quick forecast, to get things going. Probably if anyone asks me questions about it I'll realise I'm embarrassed by it and change it.


It has three buckets:

10%: We get to AGI with the current paradigm relatively quickly without major bumps.

60%: We get to it eventually sometime in the next ~50 years.

30%: We manage to move into a stable state where nobody can unilaterally build an AGI, then we focus on alignment for as long as it takes before we build it.

2nd attempt

Adele Lopez is right that 30% is super optimistic. Also I accidentally put a bunch... (read more)

Search versus design

He can be vice president.

Search versus design

Then I will prepare for combat.

1Ruben Bloom8moHonestly, Pace and Pence should team up to make a super team. Nomitive similarity ought to be a Schelling feature for coordination.
2Alex Flint8moAnd thus the wheel of the Dharma was set in motion once again, for one more great turning of time
Search versus design

This write-up benefited from feedback from ...Ben Pence.

Did I give you feedback on this writeup? Or do I have a dark arch-nemesis out there that someday I will need to fight?

1Alex Flint8moAh this is a different Ben.
Load More