All of Rob Bensinger's Comments + Replies

Late 2021 MIRI Conversations: AMA / Discussion

Yeah, I wanted to hear your actual thoughts first, but I considered going into four possible objections:

  1. If there's no way to build a "wall", perhaps you can still ensure a multipolar outcome via the threat of mutually assured destruction.
  2. If MAD isn't quite an option, perhaps you can still ensure a multipolar outcome via "mutually assured severe damage": perhaps both sides would take quite a beating in the conflict, such that they'll prefer to negotiate a truce rather than actually attack each other.
  3. If an AGI wanted to avoid destruction, perhaps it could ju
... (read more)
Late 2021 MIRI Conversations: AMA / Discussion

Offense is favored over defense because, e.g., one AI can just nuke the other. The asymmetries come from physics, where you can't physically build shields that are more resilient than the strongest shield-destroying tech. Absent new physics, extra intelligence doesn't fundamentally change this dynamic, though it can buy you more time in which to strike first.

(E.g., being smarter may let you think faster, or may let you copy yourself to more locations so it takes more time for nukes or nanobots to hit every copy of you. But it doesn't let you build a wall where you can just hang out on Earth with another superintelligence and not worry about the other superintelligence breaking your wall.)

1Yonadav Shavit2mo
I want to push back on your "can't make an unbreakable wall" metaphor. We have an unbreakable wall like that today where two super-powerful beings are just hanging out sharing earth; it's called the survivable nuclear second-strike capability. (For clarity, here I'll assume that aligned AGI-cohort A and unaligned AGI-cohort B have both FOOMed and have nanotech.) There isn't obviously an infinite amount of energy available for B to destroy every last trace of A. This is just like how in our world, neither the US nor Russia have enough resources to have certainty that they could destroy all of their opponents' nuclear capabilities in a first strike. If any of the Americans' nuclear capabilities survive a Russian first strike, those remaining American forces' objective switches from "uphold the constitution" to "destroy the enemy no matter the cost, to follow through on tit-for-tat". Humans are notoriously bad at this kind of precommitment-to-revenge-amid-the-ashes-of-civilization, but AGIs/their nanotech can probably be much more credible. Note the key thing here: once B attempts to destroy A, A is no longer "bound" by the constraints of being an aligned agent. Its objective function switches to being just as ruthless (or moreso) as B, and so raw post-first-strike power/intelligence on each side becomes a much more reasonable predictor of who will win. If B knows A is playing tit-for-tat, and A has done the rational thing of creating a trillion redundant copies of itself (each of which will also play tit-for-tat) so they couldn't all be eliminated in one strike without prior detection, then B has a clear incentive not to pick a fight it is highly uncertain it can win. One counterargument you might have: maybe offensive/undetectable nanotech is strictly favored over defensive/detection nanotech. If you assign nontrivial probability to the statement: "it is possible to destroy 100% of a nanotech-wielding defender with absolutely no previously-detectable traces of of
Late 2021 MIRI Conversations: AMA / Discussion

I got the impression Eliezer's claiming that a dangerous superintelligence is merely sufficient for nanotech.

No, I'm pretty confident Eliezer thinks AGI is both necessary and sufficient for nanotech. (Realistically/probabilistically speaking, given plausible levels of future investment into each tech. Obviously it's not logically necessary or sufficient.) Cf. my summary of Nate's view in Nate's reply to Joe Carlsmith:

Nate agrees that if there's a sphexish way to build world-saving nanosystems, then this should immediately be the top priority, and would be

... (read more)
Late 2021 MIRI Conversations: AMA / Discussion

Reply by acylhalide on the EA Forum:

The AGI would rather write programs to do the grunt work, than employ humans, as they can be more reliable, controllable, etc. It could create such agents by looking into its own source code and copying / modifying it. If it doesn't have this capability it will spend time researching (could be years) until it does. On a thousand-year timescale it isn't clear why an AGI would need us for anything besides say, specimens for experiments.

Also as reallyeli says, having a single misaligned agent with absolute control of our future seems terrible no matter what the agent does.

Late 2021 MIRI Conversations: AMA / Discussion

[W]iping out humanity is the most expensive of these options and the AGI would likely get itself destroyed while trying to do that[.]

It would be pretty easy and cheap for something much smarter than a human to kill all humans. The classic scenario is:

A.  [...] The notion of a 'superintelligence' is not that it sits around in Goldman Sachs's basement trading stocks for its corporate masters.  The concrete illustration I often use is that a superintelligence asks itself what the fastest possible route is to increasing its real-world power, and then

... (read more)
Late 2021 MIRI Conversations: AMA / Discussion

Reply by reallyeli on the EA Forum:

Toby Ord's definition of an existential catastrophe is "anything that destroys humanity's longterm potential." The worry is that misaligned AGI which vastly exceeds humanity's power would be basically in control of what happens with humans, just as humans are, currently, basically in control of what happens with chimpanzees. It doesn't need to kill all of us in order for this to be a very, very bad outcome.

E.g. the enslavement by the steel-loving AGI you describe sounds like an existential catastrophe, if that AGI is suff

... (read more)
Late 2021 MIRI Conversations: AMA / Discussion

My Eliezer-model thinks pivotal acts are genuinely, for-real, actually important. Like, he's not being metaphorical or making a pedagogical point when he says (paraphrasing) 'we need to use the first AGI systems to execute a huge, disruptive, game-board-flipping action, or we're all dead'.

When my Eliezer-model says that the most plausible pivotal acts he's aware of involve capabilities roughly at the level of 'develop nanotech' or 'put two cellular-identical strawberries on a plate', he's being completely literal. If some significantly weaker capability le... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

For practical purposes, I'd say the pandemic is already over. MIRI isn't doing much hiring, though it's doing a little. The two big things we feel bottlenecked on are:

  • (1) people who can generate promising new alignment ideas. (By far the top priority, but seems empirically rare.)
  • (2) competent executives who are unusually good at understanding the kinds of things MIRI is trying to do, and who can run their own large alignment projects mostly-independently.

For 2, I think the best way to get hired by MIRI is to prove your abilities via the Visible Thoughts Pr... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

Echoing that I loved these conversations and I'm super grateful to everyone who participated — especially Richard, Paul, Eliezer, Nate, Ajeya, Carl, Rohin, and Jaan, who contributed a lot.

I don't plan to try to summarize the discussions or distill key take-aways myself (other than the extremely cursory job I did on https://intelligence.org/late-2021-miri-conversations/), but I'm very keen on seeing others attempt that, especially as part of a process to figure out their own models and do some evaluative work.

I think I'd rather see partial summaries/respons... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

Question from evelynciara on the EA Forum:

Do you believe that AGI poses a greater existential risk than other proposed x-risk hazards, such as engineered pandemics? Why or why not?

For sure. It's tricky to wipe out humanity entirely without optimizing for that in particular -- nuclear war, climate change, and extremely bad natural pandemics look to me like they're at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that's specifically optimized for this task (than it is to develop AGI), but we don't see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety.

What specific actions do you have in mind when you say "global coordination on safety", and how much of the problem do you think these actions solve?

My own view is that 'caring about AI x-risk at all' is a pretty small (albeit indispensable) step. There are lots of decisions that hinge on things other than 'is AGI risky at all'.

I agree with Rohin that the useful thing is trying t... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

But also my sense is that there's some deep benefit from "having mainlines" and conversations that are mostly 'sentences-on-mainline'?

I agree with this. Or, if you feel ~evenly split between two options, have two mainlines and focus a bunch on those (including picking at cruxes and revising your mainline view over time).

But:

Like, it feels to me like Eliezer was generating sentences on his mainline, and Richard was responding with 'since you're being overly pessimistic, I will be overly optimistic to balance', with no attempt to have his response match his

... (read more)
Shah and Yudkowsky on alignment failures

If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

Alternatively, maybe you think that som... (read more)

2Vanessa Kosoy3mo
I suggested [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/shortform?commentId=h3Ww6nyt9fpj7BLyo] doing this using quantilization.
3Rohin Shah3mo
It sounds like you think my position is "here is my plan to save the world and I have a story for how it will work", whereas my actual view is "here is a story in which humanity is stupid and covers itself in shame by taking on huge amounts of x-risk (e.g. 5%), where we have no strong justification for being confident that we'll survive, but the empirical situation ends up being such that we survive anyway". In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in. My sense is that Eliezer would say that this story is completely implausible, i.e. this hypothesized empirical situation is ruled out by knowledge that Eliezer has. But I don't know what knowledge rules this out. (I'm pretty sure it has to do with his intuitions about a Core of General Intelligence, and/or why capabilities generalize faster than alignment, but I don't know where those intuitions come from, nor do I share them.) Idk, I'm also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that's more efficient than scaled-up reflex-like things).
Shah and Yudkowsky on alignment failures

In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "trying" is a very strong word that implies taking over the world to win the debate).

Suppose I'm debating someone about gun control, and they say 'guns don't kill people; people kill people'. Here are four different scenarios for how I might respond:

  • 1. Almost as a pure reflex, before I can stop myself, I blurt out 'That's bullshit!' in response. It's not the best way to win the debate, but heck, I've heard that zinger a thousand times and it ju
... (read more)
3Rohin Shah3mo
I totally agree those are on a continuum. I don't think this changes my point? It seems like Eliezer is confident that "reduce x-risk to EDIT: sub-50%" requires being all the way on the far side of that continuum, and I don't see why that's required.
1Rob Bensinger3mo
If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process). If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.) Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act — no smart search for strategies at all. But surely there has to be smart search going on somewhere the system, or how is it doing a bunch of useful novel scientific work?
Christiano and Yudkowsky on AI predictions and human intelligence

Like, fundamentally the question is something like "how efficient and accurate is the AI research market?"

I would distinguish two factors:

  • How powerful and well-directed is the field's optimization?
  • How much does the technology inherently lend itself to information asymmetries?

You could turn the "powerful and well-directed" dial up to the maximum allowed by physics, and still not thereby guarantee that information asymmetries are rare, because the way that a society applies maximum optimization pressure to reaching AGI ASAP might route through a lot of indiv... (read more)

Comments on Carlsmith's “Is power-seeking AI an existential risk?”

or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world

Note that this is still better than 'honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don't need to halt'!

Visible Thoughts Project and Bounty Announcement

We have now received the first partial run that meets our quality bar. The run was submitted by LessWrong user Vanilla_cabs. Vanilla's team is still expanding the run (and will probably fix some typos, etc. later), but I'm providing a copy of it here with Vanilla's permission, to give others an example of the kind of thing we're looking for:

https://docs.google.com/document/d/1Wsh8L--jtJ6y9ZB35mEbzVZ8lJN6UDd6oiF0_Bta8vM/edit

Vanilla's run is currently 266 steps long. Per the Visible Thoughts Project FAQ, we're willing to pay authors $20 / step for partial ru... (read more)

Visible Thoughts Project and Bounty Announcement

In case you missed it: we now have an FAQ for this project, last updated Jan. 7.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?

I think the intended visualization is simply that you create a very small self-replicating machine, and have it replicate enough times in the atmosphere that every human-sized organism on the planet will on average contain many copies of it.

One of my co-workers at MIRI comments:

(further conjunctive detail for

... (read more)
2DanielFilan4mo
Ah, that makes sense - thanks!
Biology-Inspired AGI Timelines: The Trick That Never Works

Making a map of your map is another one of those techniques that seem to provide more grounding but do not actually.

Sounds to me like one of the things Eliezer is pointing at in Hero Licensing:

Look, thinking things like that is just not how the inside of my head is organized. There’s just the book I have in my head and the question of whether I can translate that image into reality. My mental world is about the book, not about me.

You do want to train your brain, and you want to understand your strengths and weaknesses. But dwelling on your biases at the ex... (read more)

Conversation on technology forecasting and gradualism

Is this 5 years of engineering effort and then humans leaving it alone with infinite compute?

Maybe something like '5 years of engineering effort to start automating work that qualitatively (but incredibly slowly and inefficiently) is helping with AI research, and then a few decades of throwing more compute at that for the AI to reach superintelligence'?

With infinite compute you could just recapitulate evolution, so I doubt Paul thinks there's a crux like that? But there could be a crux that's about whether GPT-3.5 plus a few decades of hardware progress achieves superintelligence, or about whether that's approximately the fastest way to get to superintelligence, or something.

Biology-Inspired AGI Timelines: The Trick That Never Works

When I try to mentally simulate negative reader-reactions to the dialogue, I usually get a complicated feeling that's some combination of:

  • Some amount of conflict aversion: Harsh language feels conflict-y, which is inherently unpleasant.
  • Empathy for, or identification with, the people or views Eliezer was criticizing. It feels bad to be criticized, and it feels doubly bad to be told 'you are making basic mistakes'.
  • Something status-regulation-y: My reader-model here finds the implied threat to the status hierarchy salient (whether or not Eliezer is just tryin
... (read more)

I think part of what I was reacting to is a kind of half-formed argument that goes something like:

  • My prior credence is very low that all these really smart, carefully thought-through people are making the kinds of stupid or biased mistakes they are being accused of.
  • In fact, my prior for the above is sufficiently low that I suspect it's more likely that the author is the one making the mistake(s) here, at least in the sense of straw-manning his opponents.
  • But if that's the case then I shouldn't trust the other things he says as much, because it looks lik
... (read more)

I had mixed feelings about the dialogue personally. I enjoy the writing style and think Eliezer is a great writer with a lot of good opinions and arguments, which made it enjoyable.

But at the same time, it felt like he was taking down a strawman. Maybe you’d label it part of “conflict aversion”, but I tend to get a negative reaction to take-downs of straw-people who agree with me.

To give an unfair and exaggerated comparison, it would be a bit like reading a take-down of a straw-rationalist in which the straw-rationalist occasionally insists such things as ... (read more)

Shulman and Yudkowsky on AI progress

Note: I've written up short summaries of each entry in this sequence so far on https://intelligence.org/late-2021-miri-conversations/,  and included links to audio recordings of most of the posts.

Biology-Inspired AGI Timelines: The Trick That Never Works

I've gotten one private message expressing more or less the same thing about this post, so I don't think this is a super unusual reaction.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

I don't know Eliezer's view on this — presumably he either disagrees that the example he gave is "mundane AI safety stuff", or he disagrees that "mundane AI safety stuff" is widespread? I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.

Safety Interruptible Agents is an example Eliezer's given in the past of work that isn't "real" (back in 2017):

[...]

It seems to me that I've watched organizations like OpenPhil try to sponsor academics to work on AI alignment, and

... (read more)
4Vanessa Kosoy6mo
There is ample discussion of distribution shifts ("seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set") by other people. Random examples: Christiano [https://ai-alignment.com/some-thoughts-on-training-highly-reliable-models-2c78c17e266d] , Shah [https://www.alignmentforum.org/posts/nM99oLhRzrmLWozoM/an-134-underspecification-as-a-cause-of-fragility-to] , DeepMind [https://arxiv.org/pdf/2110.11328.pdf]. Maybe Eliezer is talking specifically about the context of transparency. Personally, I haven't worked much on transparency because IMO (i) even if we solve transparency perfectly but don't solve actual alignment, we are still dead, (ii) if we solve actual alignment without transparency, then theoretically we might succeed (although in practice it would sure help a lot to have transparency to catch errors in time) and (iii) there are less strong reasons to think transparency must be robustly solvable compared to reasons to think alignment must be robustly solvable. In any case, I really don't understand why Eliezer thinks the rest of AI safety are unaware of the type of attack vectors he describes. I agree that currently publishing in mainstream venues seems to require dumbing down, but IMO we should proceed by publishing dumbed-down versions in the mainstream + smarted-up versions/commentary in our own venues. And, not all of AI safety is focused on publishing in mainstream venues? There is plenty of stuff on the alignment forum, on various blogs etc. Overall I actually agree that lots of work by the AI safety community is unimpressive (tbh I wish MIRI would lead by example instead of going stealth-mode, but maybe I don't understand the considerations). What I'm confused by is the particular example in the OP. I also dunno about "fancy equations and math results", I feel like the field would benefit from getting a lot more mathy (ofc in meaningful ways rather than just using mathematical notation as dec
Christiano, Cotra, and Yudkowsky on AI progress

Not believing theories which don’t make new testable predictions just because they retrodict lots of things in a way that the theories proponents claim is more natural, but that you don’t understand, because that seems generally suspicious

My Eliezer-model doesn't categorically object to this. See, e.g., Fake Causality:

[Phlogiston] feels like an explanation. It’s represented using the same cognitive data format. But the human mind does not automatically detect when a cause has an unconstraining arrow to its effect. Worse, thanks to hindsight bias, it may fe

... (read more)
Biology-Inspired AGI Timelines: The Trick That Never Works

(This post was partly written as a follow-up to Eliezer's conversations with Paul and Ajeya, so I've inserted it into the conversations sequence.)

It does fit well there, but I think it was more inspired by the person I met who thought I was being way too arrogant by not updating in the direction of OpenPhil's timeline estimates to the extent I was uncertain.

Christiano, Cotra, and Yudkowsky on AI progress

My Eliezer-model is a lot less surprised by lulls than my Paul-model (because we're missing key insights for AGI, progress on insights is jumpy and hard to predict, the future is generally very unpredictable, etc.). I don't know exactly how large of a lull or winter would start to surprise Eliezer (or how much that surprise would change if the lull is occurring two years from now, vs. ten years from now, for example).

In Yudkowsky and Christiano Discuss "Takeoff Speeds", Eliezer says:

I have a rough intuitive feeling that it [AI progress] was going faster in

... (read more)
2Paul Christiano6mo
I generally expect smoother progress, but predictions about lulls are probably dominated by Eliezer's shorter timelines. Also lulls are generally easier than spurts, e.g. I think that if you just slow investment growth you get a lull and that's not too unlikely (whereas part of why it's hard to get a spurt is that investment rises to levels where you can't rapidly grow it further).
1Vanessa Kosoy6mo
Makes some sense, but Yudkowsky's prediction that TAI will arrive before AI has large economic impact does forbid a lot of plateau scenarios. Given a plateau that's sufficiently high and sufficiently long, AI will land in the market, I think. Even if regulatory hurdles are the bottleneck for a lot of things atm, eventually in some country AI will become important and the others will have to follow or fall behind.
Christiano, Cotra, and Yudkowsky on AI progress

Found two Eliezer-posts from 2016 (on Facebook) that I feel helped me better grok his perspective.

Sep. 14, 2016:

It is amazing that our neural networks work at all; terrifying that we can dump in so much GPU power that our training methods work at all; and the fact that AlphaGo can even exist is still blowing my mind. It's like watching a trillion spiders with the intelligence of earthworms, working for 100,000 years, using tissue paper to construct nuclear weapons.

And earlier, Jan. 27, 2016:

People occasionally ask me about signs that the remaining timeline

... (read more)
Soares, Tallinn, and Yudkowsky discuss AGI cognition

Minor note: This post comes earlier in the sequence than Christiano, Cotra, and Yudkowsky on AI progress. I posted the Christiano/Cotra/Yudkowsky piece sooner, at Eliezer's request, to help inform the ongoing discussion of "Takeoff Speeds".

Christiano, Cotra, and Yudkowsky on AI progress

Transcript error fixed -- the line that previously read

[Yudkowsky][17:40]  

I expect it to go away before the end of days

but with there having been a big architectural innovation, not Stack More Layers

[Christiano][17:40]  

I expect it to go away before the end of days

but with there having been a big architectural innovation, not Stack More Layers

[Yudkowsky][17:40]  

if you name 5 possible architectural innovations I can call them small or large

should be

[Yudkowsky][17:40]  

I expect it to go away before the end of days

but with there having b

... (read more)
Yudkowsky and Christiano discuss "Takeoff Speeds"

It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul's camp still maintaining less than 50%).

My model of Eliezer may be wrong, but I'd guess that this isn't a domain where he has many over-50% predictions of novel events at all? See also 'I don't necessarily expect self-driving cars before the apocalypse'.

My Eliezer-model has a more flat prior over what might happen, which therefore includes stuff like 'maybe we'll make insane progress on theorem-proving (or whatever) out of the bl... (read more)

1Rob Bensinger6mo
(Ah, EY already replied.)
Yudkowsky and Christiano discuss "Takeoff Speeds"

(... Admittedly, you read fast enough that my 'skimming' is your 'reading'. 😶)

Yudkowsky and Christiano discuss "Takeoff Speeds"

Yeah, even I wasn't sure you'd read those three things, Eliezer, though I knew you'd at least glanced over 'Takeoff Speeds' and 'Biological Anchors' enough to form opinions when they came out. :)

1Rob Bensinger6mo
(... Admittedly, you read fast enough that my 'skimming' is your 'reading'. 😶)
Yudkowsky and Christiano discuss "Takeoff Speeds"

I grimly predict that the effect of this dialogue on the community will be polarization

Beware of self-fulfilling prophecies (and other premature meta)! If both sides in a dispute expect the other side to just entrench, then they're less likely to invest the effort to try to bridge the gap.

This very comment section is one of the main things that will determine the community's reaction, and diverting our focus to 'what will our reaction be?' before we've talked about the object-level claims can prematurely lock in a certain reaction.

(That said, I think you'r... (read more)

Fair enough! I too dislike premature meta, and feel bad that I engaged in it. However... I do still feel like my comment probably did more to prevent polarization than cause it? That's my independent impression at any rate. (For the reasons you mention).

I certainly don't want to give up! In light of your pushback I'll edit to add something at the top.

Ngo and Yudkowsky on AI capability gains

You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!

I'ma guess that Eliezer thinks there's a long list of sequences he could write meeting these conditions, each on a different topic.

3Adam Shimi6mo
Good point, I hadn't thought about that one. Still, I have to admit that my first reaction is that this particular sequence seems quite uniquely in a position to increase the quality of the debate and of alignment research singlehandedly. Of course, maybe I only feel that way because it's the only one of the long list that I know of. ^^ (Another possibility I just thought of is that maybe this subsequence requires a lot of new preliminary subsequences, such that the work is far larger than you could expect from reading the words "a subsequence". Still sounds like it would be really valuable though.
Ngo and Yudkowsky on alignment difficulty

Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

I'm not sure that I understand the question, but my intuition is to say: they funnel world-states into particular outcomes in the same sense that literal funnels funnel water into particular spaces, or in the same sense that a slope makes things roll down it.

If you find water in a previously-empty space with a small aperture, and you'... (read more)

Thanks for the replies! I'm still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.

What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?

To clarify a little what the options in my original comment were, I'll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually n... (read more)

Ngo and Yudkowsky on alignment difficulty

This is the first post in a sequence, consisting of the logs of a Discord server MIRI made for hashing out AGI-related disagreements with Richard Ngo, Open Phil, etc.

I did most of the work of turning the chat logs into posts, with lots of formatting help from Matt Graves and additional help from Oliver Habryka, Ray Arnold, and others. I also hit the 'post' button for Richard and Eliezer. (I don't plan to repeat this note on future posts in this sequence, unless folks request it.)

Discussion with Eliezer Yudkowsky on AGI interventions

I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/understandability/etc. in the first AGI systems. Most ML experiments either aren't about interpretability and 'cracking open the hood', or they're not approaching the problem in a way that MIRI's excited by.

E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about 'looking in the learned model's brain' to spot mesa-optimizat... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

This is a good comment! I also agree that it's mostly on MIRI to try to explain its views, not on others to do painstaking exegesis. If I don't have a ready-on-hand link that clearly articulates the thing I'm trying to say, then it's not surprising if others don't have it in their model.

And based on these comments, I update that there's probably more disagreement-about-MIRI than I was thinking, and less (though still a decent amount of) hyperbole/etc. If so, sorry about jumping to conclusions, Adam!

Discussion with Eliezer Yudkowsky on AGI interventions

(I suspect there are a bunch of other disagreements going into this too, including basic divergences on questions like 'What's even the point of aligning AGI? What should humanity do with aligned AGI once it has it?'.)

Discussion with Eliezer Yudkowsky on AGI interventions

I would agree with you that "MIRI hates all experimental work" / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.

Ooh, that's really interesting. Thinking about it, I think my sense of what's going on is (and I'd be interested to hear how this differs from your sense):

  1. Compared to the average alignment researcher, MIRI tends to put more weight on reasoning like 'sufficient
... (read more)

Thanks. For time/brevity, I'll just say which things I agree / disagree with:

> sufficiently capable and general AI is likely to have property X as a strong default [...] 

I generally agree with this, although for certain important values of X (such as "fooling humans for instrumental reasons") I'm probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I'm also probably more optimistic (but not certain) that those efforts will succeed.

[inside view, modest epistemology]: I don't have... (read more)

I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/understandability/etc. in the first AGI systems. Most ML experiments either aren't about interpretability and 'cracking open the hood', or they're not approaching the problem in a way that MIRI's excited by.

E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about 'looking in the learned model's brain' to spot mesa-optimizat... (read more)

2Rob Bensinger6mo
(I suspect there are a bunch of other disagreements going into this too, including basic divergences on questions like 'What's even the point of aligning AGI? What should humanity do with aligned AGI once it has it?'.)
Discussion with Eliezer Yudkowsky on AGI interventions

So, the point of my comments was to draw a contrast between having a low opinion of "experimental work and not doing only decision theory and logic", and having a low opinion of "mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc." I didn't intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as 'oh yeah, th... (read more)

So, the point of my comments was to draw a contrast between having a low opinion of "experimental work and not doing only decision theory and logic", and having a low opinion of "mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc." I didn't intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as 'oh yeah, th

... (read more)

Not sure if this helps, and haven't read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I'd have are that:

 * For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
 * For almost all experimental results you would think they were so much less informative as to not be worthwhile.
 * There's a sma... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

'Experimental work is categorically bad, but Redwood's work doesn't count' does not sound like a "slight caveat" to me! What does this generalization mean at all if Redwood's stuff doesn't count?

(Neither, for that matter, does the difference between 'decision theory and logic' and 'all mathy stuff MIRI has ever focused on' seem like a 'slight caveat' to me -- but in that case maybe it's because I have a lot more non-logic, non-decision-theory examples in my mind that you might not be familiar with, since it sounds like you haven't read much MIRI stuff?).

7Adam Shimi6mo
Not planning to answer more on this thread, but given how my last messages seem to have confused you, here is my last attempt of sharing my mental model (so you can flag in an answer where I'm wrong in your opinion for readers of this thread) Also, I just checked on the publication list, and I've read or skimmed most things MIRI published since 2014 (including most newsletters and blog posts on MIRI website). My model of MIRI is that initially, there was a bunch of people including EY who were working mostly on decision theory stuff, tiling, model theory, the sort of stuff I was pointing at. That predates Nate's arrival, but in my model it becomes far more legible after that (so circa 2014/2015). In my model, I call that "old school MIRI", and that was a big chunk of what I was pointing out in my original comment. Then there are a bunch of thing that seem to have happened: * Newer people (Abram and Scott come to mind, but mostly because they're the one who post on the AF and who I've talked to) join this old-school MIRI approach and reshape it into Embedded Agency. Now this new agenda is a bit different from the old-school MIRI work, but I feel like it's still not that far from decision theory and logic (with maybe a stronger emphasis on the bayesian part for stuff like logical induction). That might be a part where we're disagreeing. * A direction related to embedded agency and the decision theory and logic stuff, but focused on implementations through strongly typed programming languages like Haskell and type theory. That's technically practical, but in my mental model this goes in the same category as "decision theory and logic stuff", especially because that sort of programming is very close to logic and natural deduction. * MIRI starts it's ML-focused agenda, which you already mentioned. The impression I still have is that this didn't lead to much published work that was actually experimental, instead focusing on re

(Responding to entire comment thread) Rob, I don't think you're modeling what MIRI looks like from the outside very well.

  • There's a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)
  • There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.
  • There was a non-public agenda that involved Haskell programmers. That's a
... (read more)
Discussion with Eliezer Yudkowsky on AGI interventions

Thanks for adding the note! :)

I'm confused. When I say 'that's just my impression', I mean something like 'that's an inside-view belief that I endorse but haven't carefully vetted'. (See, e.g., Impression Track Records, referring to Naming Beliefs.)

Example: you said that MIRI has "contempt with experimental work and not doing only decision theory and logic".

My prior guess would have been that you don't actually, for-real believe that -- that it's not your 'impression' in the above sense, more like 'unendorsed venting/hyperbole that has a more complicated r... (read more)

3Adam Shimi6mo
I would say that with slight caveats (make "decision theory and logic" a bit larger to include some more mathy stuff and make "all experimental work" a bit smaller to not includes Redwood's work), this was indeed my model. What made me update from our discussion is the realization that I interpreted the dismissal of basically all alignment research as "this has no value whatsoever and people doing it are just pretending to care on alignment", where it should have been interpreted as something like "this is potentially interesting/new/exciting, but it doesn't look like it brings us closer to solving alignment in a significant way, hence we're still failing".
Load More