You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).

I also note that if one agent becomes way way smarter than the other, that this balance may not work out.

Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!

Overall a very interesting idea.

AGI Ruin: A List of Lethalities

Ben Pace3mo32Review for 2022 Review

+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.

The Plan - 2022 Update

Ben Pace3mo50Review for 2022 Review

Someone working full-time on an approach to the alignment problem that they feel optimistic about, and writing annual reflections on their work, is something that has been sorely lacking. +4

Benito's Shortform Feed

Ben Pace5mo20

I don't want to double the comment count I submit to Recent Discussion, so I'll just update this comment with the things I've cut.

12/06/2023 Comment on Originality vs. Correctness

It's fun to take the wins of one culture and apply them to the other, people are very shocked that you found some hidden value to be had (though it often isn't competitive value / legible to the culture). And if you manage to avoid some terrible decison people speak about how wise you are to have noticed.
(Those are the best cases, often of course people are like "this is odd, I'm going to pretend I didn't see this" and then move on.)

Benito's Shortform Feed

Ben Pace5mo61

For too long, I have erred on the side of writing too much.

The first reason I write is in order to find out what I think.

This often leaves my writing long and not very defensible.

However, editing the whole thing is so much extra work after I already did all the work figuring out what I think.

Sometimes it goes well if I just scrap the whole thing and concisely write my conclusion.

But typically I don't want to spend the marginal time.

Another reason my writing is too long is because I have extra thoughts I know most people won't find useful.

But I've picked up a heuristic that says it's good to share actual thinking because sometimes some people find it surprisingly useful, so I hit publish anyway.

Nonetheless, I endeavor to write shorter.

So I think I shall experiment with cutting the bits off of comments that represent me thinking aloud, but aren't worth the space in the local conversation.

And I will put them here, as the dregs of my cognition. I shall hopefully gather data over the next month or two and find out whether they are in fact worthwhile.

Shah and Yudkowsky on alignment failures

Ben Pace5mo10Review for 2022 Review

I just gave this a re-read, I forgot what a trip it is to read the thoughts of Eliezer Yudkowsky. It continues to be some of my favorite stuff in recent years written on LessWrong.

It's hard to relate to the world with a level of mastery over basic ideas as Eliezer has. I don't mean with this to vouch that his perspective is certainly correct, but I believe it is at least possible, and so I think he aspires to a knowledge of reality that I rarely if ever aspire to. Reading it inspires me to really think about how the world works, and really figure out what I know and what I don't. +9

(And the smart people dialoguing with him here are good sports for keeping up their side of the argument.)

TurnTrout's shortform feed

Ben Pace5mo517

They are not being treated worse than foot soldiers, because they do not have an enemy army attempting to murder them during the job. (Unless 'foot soldiers' itself more commonly used as a metaphor for 'grunt work' and I'm not aware of that.)

Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

Ben Pace6mo1218

I am surprised to see the Open Philanthropy network taking all of the powerful roles here.

The initial Trustees are:
Jason Matheny: CEO of the RAND Corporation
Kanika Bahl: CEO & President of Evidence Action
Neil Buddy Shah: CEO of the Clinton Health Access Initiative (Chair)
Paul Christiano: Founder of the Alignment Research Center
Zach Robinson: Interim CEO of Effective Ventures US

In case it's not apparent:

Jason Matheny started an org with over $40M of OpenPhil funding (link to the biggest grant)
Kanika Bahl runs Evidence Action, an org that OpenPhil has donated over $100M to over the past 7 years
Neil Buddy Shah was the Managing Director at GiveWell for 2.5 years. GiveWell is the organization that OpenPhil grew out of between 2011 and 2014 and continued to maintain close ties with over the decade (e.g. sharing an office for ~5 more years, deferring substantially to GiveWell recommendations for somewhere in the neighborhood of $500M of Global Health grants, etc).
Paul Christiano is technically the least institutionally connected to OpenPhil. Most commenters here are quite aware of Paul's relationship to Open Philanthropy, but to state some obvious legible things for those who are not: Paul is married to Ajeya Cotra, who leads the AI funding at OpenPhil, has been a technical advisor to OpenPhil in some capacity since 2012 (see him mentioned in the following pages: 1, 2, 3, 4, 5, 6), and Holden Karnofsky has collaborated with / advised Paul's organization ARC (collaboration mentioned here and shown as an advisor here).
Zach Robinson worked at OpenPhil for 3 years, over half of which was spent as Chief of Staff.

And, as a reminder for those who've forgotten, Holden Karnofsky's wife Daniela Amodei is President of Anthropic (and so Holden is the brother-in-law of Dario Amodei).

Views on when AGI comes and on strategy to reduce existential risk

Ben Pace6mo30

I think the argument here basically implies that language models will not produce any novel, useful concepts in any existing industries or research fields that get substantial adoption (e.g. >10% of ppl use it, or a widely cited paper) in those industries, in the next 3 years, and if it did this, then the end would be nigh (or much nigher).

To be clear, you might get new concepts from language models about language if you nail some Chris Olah style transparency work, but the language model itself will not output ones that aren't about language in the text.

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wiki Contributions

Comments