In many ways, this post is frustrating to read. It isn't straigthforward, it needlessly insults people, and it mixes irrelevant details with the key ideas.

And yet, as with many of Eliezer's post, its key points are right.

What this post does is uncover the main epistemological mistakes made by almost everyone trying their hands at figuring out timelines. Among others, there is:

Taking arbitrary guesses within a set of options that you don't have enough evidence to separate
Piling on arbitrary assumption on arbitraty assumption, leading to completely uninformative outputs
Comparing biological processes to human engineering in term of speed, without noticing that the optimization path is the key variable (and the big uncertainty)
Forcing the prediction to fit within a massively limited set of distributions, biasing it towards easy to think about distributions rather than representative ones.

Before reading this post I was already dubious of most timeline work, but this crystallized many of my objections and issues with this line of work.

So I got a lot out of this post. And I expect that many people would if they spent the time I took to analyze it in detail. But I don't expect most people to do so, and so am ambivalent on whether this post should be included in the final selection.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Adam Shimi2y20

I was mostly thinking of the efficiency assumption underlying almost all the scenarios. Critch assumes that a significant chunk of the economy always can and does make the most efficient change (everyone replacing the job, automated regulations replacing banks when they can't move fast enough). Which neglects many potential factors, like big economic actors not having to be efficient for a long time, backlash from customers, and in general all factors making economic actors and market less than efficient.

I expect that most of these factors could be addressed with more work on the scenarios.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Adam Shimi2y70Review for 2021 Review

I consider this post as one of the most important ever written on issues of timelines and AI doom scenario. Not because it's perfect (some of its assumptions are unconvincing), but because it highlights a key aspect of AI Risk and the alignment problem which is so easy to miss coming from a rationalist mindset: it doesn't require an agent to take over the whole world. It is not about agency.

What RAAPs show instead is that even in a purely structural setting, where agency doesn't matter, these problem still crop up!

This insight was already present in Drexler's work, but however insightful Eric is in person, CAIS is completely unreadable and so no one cared. But this post is well written. Not perfectly once again, but it gives short, somewhat minimal proofs of concept for this structural perspective on alignment. And it also managed to tie alignment with key ideas in sociology, opening ways for interdisciplinarity.

I have made every person I have ever mentored on alignment study this post. And I plan to continue doing so. Despite the fact that I'm unconvinced by most timeline and AI risk scenarios post. That's how good and important it is.

Applications for Deconfusing Goal-Directedness

Adam Shimi2y20

What are you particularly interested in? I expect I could probably write it with a bit of rereading.

Methodological Therapy: An Agenda For Tackling Research Bottlenecks

Adam Shimi2y20

Thanks for your comment!

Actually, I don't think we really disagree. I might have just not made my position very clear in the original post.

The point of the post is not to say that these activities are not often valuable, but instead to point out that they can easily turn into "To do science, I need to always do [activity]". And what I'm getting from the examples is that in some cases, you actually don't need to do [activity]. There's a shortcut, or maybe just you're in a different phase of the problem.

Do you think there is still a disagreement after this clarification?

The First Filter

Adam Shimi2y20

In a limited context, the first example that comes to me is high performers in competitive sports and games. Because if they truly only give a shit about winning (and the best generally do), they will throw away their legacy approaches when they find a new one, however it pains them.

What I Learned Running Refine

Adam Shimi2y30

Thanks for the kind words!

I'm not aware of any such statistics, but I'm guessing that MATS organizers might have some.

Don't align agents to evaluations of plans

Adam Shimi2y20

I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.

This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the "difficulties" are only positive numbers, then if the difficulty for the direct instillation is and the one for the grader optimization is $d_{i n s t i l l a t i o n} + d_{e v a l u a t i o n}$ , then there's no debate that the latter is bigger than the former.

But if you allow directionality (even in one dimension), then there's the risk that the sum leads to less difficulty in total (by having the $d_{e v a l u a t i o n}$ move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don't see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.

Don't align agents to evaluations of plans

Adam Shimi2y20

Thanks for taking time to answer my questions in detail!

About your example for other failure modes

Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?

If so, it's not clear to me why this is valuable for the actor to do? How is it supposed to get better reward from confusion only? If it has another agenda (making paperclips instead of diamonds for example), then the obfuscation is clearly valuable to allow it to work on its main goal. But here its goal is to improve evaluation, and so confusion doesn't seem like it helps.

About the cost/competitiveness argument

I think that a values-executing AGI can also search over as many plans which actually make sense, I don't think its options are limited or anything. But it'll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. "don't think about basilisks").

This is the part where I currently cannot reconstruct your model (maybe because I haven't read shard theory in a while). Because from an abstract perspective, the trick part of starting from significantly more limited plans is how to expand the range of plans without allowing adversarial ones. And I don't have a model of what mechanism you think make it easier to go from the bottom up (from few plans accepted to more plans accepted) safely rather than from the top down (from all plans accepted to less plans accepted) safely.

About the "negative result" of not even trying to do grader optimization

I don't see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide.
See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.

But why don't you see that it should exist? What are the parts of your model that make this prediction? Because by definition, there is not a single innovation or progress that most people expected to exist. That's what it means to figure out something new. So you need an actual reason for why you're not falling into that trap of only being able to imagine what already exists.

Here are some guesses:

Specific scientific claim that there should be no true name/clean concept of what is adversarial for a large class of cognitive architectures (in opposition to Natural Abstraction Hypothesis I guess?)
Human limitation claim that even though there might be such a true name/clear concept, we have neither the time not the ability to get it right under the current conditions
Security mindset claim that because the actor is allowed to get arbitrary smart, you should expect that if it optimizes against you, it wins.

For the second point, what do you make of the claim that there are ridiculously few researchers who even attempted to work on the problem in the last 10 years, compared to almost all easier fields of knowledge? Do you think that this should not alter the evidence of absence you get from the absence of evidence?

Corrigibility Via Thought-Process Deference

Adam Shimi2y30

It... seems to be a significant simplification of the problem? I mean, not needing all the interpretability and surgery tools would be a bigger improvement, but that's probably not something we can have.

Why do you think so? Currently I'm seeing a couple of massive difficulties here that don't generally or necessarily appear in alternatives approaches:

You need to know that you're going to reach an AGI before it becomes superintelligent, or you'll waste your time training an AI that will be taken over by the competitors. Whereas many approaches don't require this.
You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities
You need to figure out the right translation to bootstrap it, and there seem to be risks if you get it wrong.
You need to figure out the right thought similarity measure to bootstrap it, and there seem to be risks if you get it wrong.

Can you help me understand why you think that these strong requirements nonethless are simpler than most versions or approaches of the problem that you know about?