## AI ALIGNMENT FORUMAF

Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.

# Sequences

Late 2021 MIRI Conversations

# Wiki Contributions

Visible Thoughts Project and Bounty Announcement

We have now received the first partial run that meets our quality bar. The run was submitted by LessWrong user Vanilla_cabs. Vanilla's team is still expanding the run (and will probably fix some typos, etc. later), but I'm providing a copy of it here with Vanilla's permission, to give others an example of the kind of thing we're looking for:

Vanilla's run is currently 266 steps long. Per the Visible Thoughts Project FAQ, we're willing to pay authors $20 / step for partial runs that meet our quality bar (up to at least the first 5,000 total steps we're sent), so the partial run here will receive$5320 from the prize pool (though the final version will presumably be much longer and receive more; we expect a completed run to be about 1000 steps).

Vanilla_cabs is open to doing paid consultation for anyone who's working on this project. So if you want feedback from someone who understands our quality bar and can demonstrably pass it, contact Vanilla_cabs via their LessWrong profile.

Visible Thoughts Project and Bounty Announcement

In case you missed it: we now have an FAQ for this project, last updated Jan. 7.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?

I think the intended visualization is simply that you create a very small self-replicating machine, and have it replicate enough times in the atmosphere that every human-sized organism on the planet will on average contain many copies of it.

One of my co-workers at MIRI comments:

(further conjunctive detail for visualizer-plausibility: most of your replication time is in all the doublings before the last doubling, and in particular you can make a shitload in a pretty small space before launching it into the jet stream to disperse. the jet stream can be used to disperse stuff throughout the atmosphere (and it can use solar radiation, at least, to keep reproducing). it could in principle be powered and do minor amounts of steering.

example things the [AGI] who has no better plan than this paltry human-conceivable plan has to think about are "how does the time-cost of making sure [I hit the people] at the south pole base and [on] all the cruise liners and in all the nuclear submarines, trade off against the risk-cost of leaving that fragment of humanity alive", etc.)

Regarding the idea of diamondoid nanotechnology, Drexler's Nanosystems and http://www.molecularassembler.com/Nanofactory/index.htm talk about the general concept.

Biology-Inspired AGI Timelines: The Trick That Never Works

Making a map of your map is another one of those techniques that seem to provide more grounding but do not actually.

Sounds to me like one of the things Eliezer is pointing at in Hero Licensing:

Look, thinking things like that is just not how the inside of my head is organized. There’s just the book I have in my head and the question of whether I can translate that image into reality. My mental world is about the book, not about me.

You do want to train your brain, and you want to understand your strengths and weaknesses. But dwelling on your biases at the expense of the object level isn't actually usually the best way to give your brain training data and tweak its performance.

I think there's a lesson here that, e.g., Scott Alexander hadn't fully internalized as of his 2017 Inadequate Equilibria review. There's a temptation to "go meta" and find some cleaner, more principled, more objective-sounding algorithm to follow than just "learn lots and lots of object-level facts so you can keep refining your model, learn some facts about your brain too so you can know how much to trust it in different domains, and just keep doing that".

But in fact there's no a priori reason to expect there to be a shortcut that lets you skip the messy unprincipled your-own-perspective-privileging Bayesian Updating thing. Going meta is just a tool in the toolbox, and it's risky to privilege it on 'sounds more objective/principled' grounds when there's neither a theoretical argument nor an empirical-track-record argument for expecting that approach to actually work.

Teaching the low-description-length principles of probability to your actual map-updating system is much more feasible (or at least more cost-effective) than emitting your actual map into a computationally realizable statistical model.

I think this is a good distillation of Eliezer's view (though I know you're just espousing your own view here). And of mine, for that matter. Quoting Hero Licensing again:

STRANGER:  I believe the technical term for the methodology is “pulling numbers out of your ass.” It’s important to practice calibrating your ass numbers on cases where you’ll learn the correct answer shortly afterward. It’s also important that you learn the limits of ass numbers, and don’t make unrealistic demands on them by assigning multiple ass numbers to complicated conditional events.

ELIEZER:  I’d say I reached the estimate… by thinking about the object-level problem? By using my domain knowledge? By having already thought a lot about the problem so as to load many relevant aspects into my mind, then consulting my mind’s native-format probability judgment—with some prior practice at betting having already taught me a little about how to translate those native representations of uncertainty into 9:1 betting odds.

One framing I use is that there are two basic perspectives on rationality:

• Prosthesis: Human brains are naturally bad at rationality, so we can identify external tools (and cognitive tech that's too simple and straightforward for us to misuse) and try to offload as much of our reasoning as possible onto those tools, so as to not have to put weight down (beyond the bare minimum necessary) on our own fallible judgment.
• Strength training: There's a sense in which every human has a small AGI (or a bunch of AGIs) inside their brain. If we didn't have access to such capabilities, we wouldn't be able to do complicated 'planning and steering of the world into future states' at all.

It's true that humans often behave 'irrationally', in the sense that we output actions based on simpler algorithms (e.g., reinforced habits and reflex behavior) that aren't doing the world-modeling or future-steering thing. But if we want to do better, we mostly shouldn't be leaning on weak reasoning tools like pocket calculators; we should be focusing our efforts on more reliably using (and providing better training data) the AGI inside our brains. Nearly all of the action (especially in hard foresight-demanding domains like AI alignment) is in improving your inner AGI's judgment, intuitions, etc., not in outsourcing to things that are way less smart than an AGI.

In practice, of course, you should do some combination of the two. But I think a lot of the disagreements MIRI folks have with other people in the existential risk ecosystem are related to us falling on different parts of the prosthesis-to-strength-training spectrum.

Techniques that give the illusion of objectivity are usually not useless. But to use them effectively, you have to see through the illusion of objectivity, and treat their outputs as observations of what those techniques output, rather than as glimpses at the light of objective reasonableness.

Strong agreement. I think this is very well-put.

Conversation on technology forecasting and gradualism

Is this 5 years of engineering effort and then humans leaving it alone with infinite compute?

Maybe something like '5 years of engineering effort to start automating work that qualitatively (but incredibly slowly and inefficiently) is helping with AI research, and then a few decades of throwing more compute at that for the AI to reach superintelligence'?

With infinite compute you could just recapitulate evolution, so I doubt Paul thinks there's a crux like that? But there could be a crux that's about whether GPT-3.5 plus a few decades of hardware progress achieves superintelligence, or about whether that's approximately the fastest way to get to superintelligence, or something.

Biology-Inspired AGI Timelines: The Trick That Never Works

When I try to mentally simulate negative reader-reactions to the dialogue, I usually get a complicated feeling that's some combination of:

• Some amount of conflict aversion: Harsh language feels conflict-y, which is inherently unpleasant.
• Empathy for, or identification with, the people or views Eliezer was criticizing. It feels bad to be criticized, and it feels doubly bad to be told 'you are making basic mistakes'.
• Something status-regulation-y: My reader-model here finds the implied threat to the status hierarchy salient (whether or not Eliezer is just trying to honestly state his beliefs), and has some version of an 'anti-cheater' or 'anti-rising-above-your-station' impulse.

How right/wrong do you think this is, as a model of what makes the dialogue harder or less pleasant to read from your perspective?

(I feel a little wary of stating my model above, since (a) maybe it's totally off, and (b) it can be rude to guess at other people's mental states. But so far this conversation has felt very abstract to me, so maybe this can at least serve as a prompt to go more concrete. E.g., 'I find it hard to read condescending things' is very vague about which parts of the dialogue we're talking about, about what makes them feel condescending, and about how the feeling-of-condescension affects the sentence-parsing-and-evaluating experience.)

Shulman and Yudkowsky on AI progress

Note: I've written up short summaries of each entry in this sequence so far on https://intelligence.org/late-2021-miri-conversations/,  and included links to audio recordings of most of the posts.

Biology-Inspired AGI Timelines: The Trick That Never Works

I've gotten one private message expressing more or less the same thing about this post, so I don't think this is a super unusual reaction.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

I don't know Eliezer's view on this — presumably he either disagrees that the example he gave is "mundane AI safety stuff", or he disagrees that "mundane AI safety stuff" is widespread? I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.

Safety Interruptible Agents is an example Eliezer's given in the past of work that isn't "real" (back in 2017):

[...]

It seems to me that I've watched organizations like OpenPhil try to sponsor academics to work on AI alignment, and it seems to me that they just can't produce what I'd consider to be real work. The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal. It had to be stripped of all the caveats and any mention of explicit incompleteness, which is necessary meta-information for any ongoing incremental progress, not to mention important from a safety standpoint. The root cause can be debated but the observable seems plain. If you want to get real work done, the obvious strategy would be to not subject yourself to any academic incentives or bureaucratic processes. Particularly including peer review by non-"hobbyists" (peer commentary by fellow "hobbyists" still being potentially very valuable), or review by grant committees staffed by the sort of people who are still impressed by academic sage-costuming and will want you to compete against pointlessly obscured but terribly serious-looking equations.

[...]

The rest of Intellectual Progress Inside and Outside Academia may be useful context. Or maybe this is also not a representative example of the stuff EY has in mind in the OP conversation?