Raymond Arnold

I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.

Comments

The "Commitment Races" problem

Okay, so now having thought about this a bit...

I at first read this and was like "I'm confused – isn't this what the whole agent foundations agenda is for? Like, I know there are still kinks to work out, and some of this kinks are major epistemological problems. But... I thought this specific problem was not actually that confusing anymore."

"Don't have your AGI go off and do stupid things" is a hard problem, but it seemed basically to be restating "the alignment problem is hard, for lots of finnicky confusing reasons."

Then I realized "holy christ most AGI research isn't built off the agent foundations agenda and people regularly say 'well, MIRI is doing cute math things but I don't see how they're actually relevant to real AGI we're likely to build.'"

Meanwhile, I have several examples in mind of real humans who fell prey to something similar to commitment-race concerns. i.e. groups of people who mutually grim-triggered each other because they were coordinating on slightly different principles. (And these were humans who were trying to be rationalist and even agent-foundations-based)

So, yeah actually it seems pretty likely that many AGIs that humans might build might accidentally fall into these traps. 

So now I have a vague image in my head of a rewrite of this post that ties together some combo of:

  • The specific concerns noted here
  • The rocket alignment problem "hey man we really need to make sure we're not fundamentally confused about agency and rationality."
  • Possibly some other specific agent-foundations-esque concerns

Weaving those into a central point of:

"If you're the sort of person who's like 'Why is MIRI even helpful? I get how they might be helpful but they seem more like a weird hail-mary or a 'might as well given that we're not sure what else to do?'... here is a specific problem you might run into if you didn't have a very thorough understanding of robust agency when you built your AGI. This doesn't (necessarily) imply any particular AGI architecture, but if you didn't have a specific plan for how to address these problems, you are probably going to get them wrong by default.

(This post might already exist somewhere, but currently these ideas feel like they just clicked together in my mind in a way they hadn't previously. I don't feel like I have the ability to write up the canonical version of this post but feel like "someone with better understanding of all the underlying principles" should)

The "Commitment Races" problem

Yeah I'm interested in chatting about this. 

I feel I should disclaim "much of what I'd have to say about this is a watered down version of whatever Andrew Critch would say". He's busy a lot, but if you haven't chatted with him about this yet you probably should, and if you have I'm not sure whether I'll have much to add.

But I am pretty interested right now in fleshing out my own coordination principles and fleshing out my understanding of how they scale up from "200 human rationalists" to 1000-10,000 sized coalitions to All Humanity and to AGI and beyond. I'm currently working on a sequence that could benefit from chatting with other people who think seriously about this.

The "Commitment Races" problem

I was confused about this post, and... I might have resolve my confusion by the time I got ready to write this comment. Unsure. Here goes:

My first* thought: 

Am I not just allowed to precommit to "be the sort of person who always figures out whatever the optimal game theory was, and commit to that?". I thought that was the point. 

i.e. I wouldn't precommit to treating either the Nash Bargaining Solution or Kalai-Smorodinsky Solution as "the permanent grim trigger bullying point", I'd precommit to something like "have a meta-policy of not giving into bullying, pick my best-guess-definition-of-bullying as my default trigger, and my best-guess grim-trigger response, but include an 'oh shit I didn't think about X' parameter." (with some conditional commitments thrown in)

Where X can't be an arbitrary new belief – the whole point of having a grim trigger clause is to be able to make appropriately weighted threats that AGI-Bob really thinks will happen. But, if I legitimately didn't think of the Kalai-Smordinwhatever solution as something an agent might legitimately think was a good coordination tool, I want to be able to say. depending on circumstances:

  1. If the deal hasn't resolved yet "oh, shit I JUUUST thought of the Kalai-whatever thing and this means I shouldn't execute my grim trigger anti-bullying clause without first offering some kind of further clarification step."
  2. If the deal already resolved before I thought of it, say "oh shit man I really should realized the Kalai-Smorodinsk thing was a legitimate schelling point and not started defecting hard as punishment. Hey, fellow AGI, would you like me to give you N remorseful utility in return for which I stop grim-triggering you and you stop retaliating at me and we end the punishment spiral?"

My second* thought:

Okay. So. I guess that's easy for me to say. But, I guess the whole point of all this updateless decision theory stuff was to actually formalize that in a way that you could robustly program an AGI that you were about to give the keys to the universe. 

Having a vague handwavy notion of it isn't reassuring enough if you're about to build a god. 

And while it seems to me like this is (relatively) straightforward... do I really want to bet that?

I guess my implicit assumption was that game theory would turn out to not be that complicated in the grand scheme of thing. Surely once you're a Jupiter Brain you'll have it figured out? And, hrmm, maybe that's true, but but maybe it's not, or maybe it turns out the fate of the cosmos gets decided with smaller AGIs fighting over Earth which much more limited compute.

Third thought:

Man, just earlier this year, someone offered me a coordination scheme that I didn't understand, and I fucked it up, and the deal fell through because I didn't understand the principles underlying it until just-too-late. (this is an anecdote I plan to write up as a blogpost sometime soon)

And... I guess I'd been implicitly assuming that AGIs would just be able to think fast enough that that wouldn't be a problem.

Like, if you're talking to a used car salesman, and you say "No more than $10,000", and then they say "$12,000 is final offer", and then you turn and walk away, hoping that they'll say "okay, fine, $10,000"... I suppose metaphorical AGI used car buyers could say "and if you take more than 10 compute-cycles to think about it, the deal is off." And that might essentially limit you to only be able to make choices you'd precomputed, even if you wanted to give yourself the option to think more.

That seems to explain why my "Just before deal resolves, realize I screwed up my decision theory" idea doesn't work. 

It seems like my "just after deal resolves and I accidentally grim trigger, turn around and say 'oh shit, I screwed up, here is remorse payment + a costly proof that that I'm not fudging my decision theory'" should still work though?

I guess in the context of Acausal Trade, I can imagine things like "they only bother running a simulation of you for 100 cycles, and it doesn't matter if on the 101st cycle you realize you made a mistake and am sorry." They'll never know it.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

I dunno. Maybe I'm still confused.

But, I wanted to check in on whether I was on the right track in understanding what considerations were at play here.

...

*actually there were like 20 thoughts before I got to the one I've labeled 'first thought' here. But, "first thought that seemed worth writing down."

The Credit Assignment Problem

I think I have juuust enough background to follow the broad strokes of this post, but not to quite grok the parts I think Abram was most interested in. 

I definitely caused me to think about credit assignment. I actually ended up thinking about it largely through the lens of Moral Mazes (where challenges of credit assignment combine with other forces to create a really bad environment). Re-reading this post, while I don't quite follow everything, I do successfully get a taste of how credit assignment fits into a bunch of different domains.

For the "myopia/partial-agency" aspects of the post, I'm curious how Abram's thinking has changed. This post AFAICT was a sort of "rant". A year after the fact, did the ideas here feel like they panned out?

It does seem like someone should someday write a post about credit assignment that's a bit more general. 

The "Commitment Races" problem

This feels like an important question in Robust Agency and Group Rationality, which are major topics of my interest.

Why Subagents?

This post feels probably important but I don't know that I actually understood it or used it enough to feel right nominating it myself. But, bumping it a bit to encourage others to look into it.

Alignment Research Field Guide

This post is a great tutorial on how to run a research group. 

My main complain about it is that it had the potential to be a way more general post that was obviously relevant to anyone building a serious intellectual community, but the framing makes it feel only relevant to Alignment research.

Some AI research areas and their relevance to existential safety

Curated, for several reasons.

I think it's really hard to figure out how to help with beneficial AI. Various career and research paths vary in how likely they are to help, or harm, or fit together. I think many prominent thinkers in the AI landscape have developed nuanced takes on how to think about the evolving landscape, but often haven't written up those thoughts. 

I like this post both for laying out a lot of object-level thoughts about that, and also for demonstrating a possible framework for organizing those object-level thoughts, and for doing it very comprehensively.  

I haven't finished processing all of the object level points and am not sure which ones I endorse at this point. But I'm looking forward to debate on the various points here. I'd welcome other thinkers in the AI Existential Safety space writing up similarly comprehensive posts about how they think about all of this.

The Solomonoff Prior is Malign

Curated. This post does a good job of summarizing a lot of complex material, in a (moderately) accessible fashion.

Draft report on AI timelines

I'm assuming part of the point is the LW crosspost still buries things in a hard-to-navigate google doc, which prevents it from easily getting cited or going viral, and Ajeya is asking/hoping for trust that they can get the benefit of some additional review from a wider variety of sources.

Load More