Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.


2022 MIRI Alignment Discussion
2021 MIRI Conversations

Wiki Contributions

Load More


FYI, the timestamp is for the first Discord message. If the log broke out timestamps for every part of the message, it would look like this:

[2:21 PM]

It's about the size of the information bottleneck. The human genome is 3 billion base pairs drawn from 4 possibilities, so 750 megabytes. Let's say 90% of that is junk DNA, and 10% of what's left is neural wiring algorithms. So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this. Your spinal cord is about 70 million neurons so probably just your spinal cord has more information than this. That vastly greater amount of runtime info inside the adult organism grows out of the wiring algorithms as your brain learns to move around your muscles, and your eyes open and the retina wires itself and starts directing info on downward to more things that wire themselves, and you learn to read, and so on.

[2:22 PM]

Anything innate that makes reasoning about people out to cheat you, easier than reasoning about isomorphic simpler letters and numbers on cards, has to be packed into the 7.5MB, and gets there via a process where ultimately one random mutation happens at a time, even though lots of mutations are recombining and being selected on at a time.

[2:24 PM]

It's a very slow learning process. It takes hundreds or thousands of generations even for a pretty good mutation to fix itself in the population and become reliably available as a base for other mutations to build on. The entire organism is built out of copying errors that happened to work better than the things they were copied from. Everything is built out of everything else, the pieces that were already lying around for building other things.

[2:27 PM]

When you're building an organism that can potentially benefit from coordinating, trading, with other organisms very similar to itself, and accumulating favors and social capital over long time horizons - and your organism is already adapted to predict what other similar organisms will do, by forcing its own brain to operate in a special reflective mode where it pretends to be the other person's brain - then a very simple way of figuring out what other people will like, by way of figuring out how to do them favors, is to notice what your brain feels when it operates in the special mode of pretending to be the other person's brain.

[2:27 PM]

And one way you can get people who end up accumulating a bunch of social capital is by having people with at least some tendency in them - subject to various other forces and overrides, of course - to feel what they imagine somebody else feeling. If somebody else drops a rock on their foot, they wince.

[2:28 PM]

This is a way to solve a favor-accumulation problem by laying some extremely simple circuits down on top of a lot of earlier machinery.

I don't know Nate's response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.


I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent than centrally "EA", I think) are currently producing costs that outweigh the benefits.

Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.

The main reasons I feel more positive about the agent-foundations-ish cases I know about are:

  • The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.
  • I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.
  • The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.
    • (Footnote: On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later. “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.)
  • Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.

I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.

Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”.


The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:

In the context of AI alignment, the "Principle of Minimality" or "Principle of Least Everything" says that when we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it. Similarly, inside the AGI itself, if a class of thought seems dangerous but necessary to execute sometimes, we want to execute the fewest possible instances of that class of thought.

E.g., if we think it's a dangerous kind of event for the AGI to ask "How can I achieve this end using strategies from across every possible domain?" then we might want a design where most routine operations only search for strategies within a particular domain, and events where the AI searches across all known domains are rarer and visible to the programmers. Processing a goal that can recruit subgoals across every domain would be a dangerous event, albeit a necessary one, and therefore we want to do less of it within the AI (and require positive permission for all such cases and then require operators to validate the results before proceeding).

Ideas that inherit from this principle include the general notion of Task-directed AGI, taskishness, and mild optimization.

Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.

MIRI itself isn't pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is "make progress on figuring out AI alignment". So I can separately answer the question "what's MIRI's organizational plan for solving alignment?"

My answer to that question is: we don't currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.

  • If an alignment idea strikes us as having even a tiny scrap of hope, and isn't already funding-saturated, then we're making sure it gets funded. We don't care whether that happens at MIRI versus elsewhere — we're just seeking to maximize the amount of good work that's happening in the world (insofar as money can help with that), and trying to bring about the existence of a research ecosystem that contains a wide variety of different moonshots and speculative ideas that are targeted at the core difficulties of alignment (described in the AGI Ruin and sharp left turn write-ups).
  • If an idea seems to have a significant amount of hope, and not just a tiny scrap — either at a glance, or after being worked on for a while by others and bearing surprisingly promising fruit — then I expect that MIRI will make that our new organizational focus, go all-in, and pour everything we have into helping with it as much we can. (E.g., we went all-in on our 2017-2020 research directions, before concluding in late 2020 that these were progressing too slowly to still have significant hope, though they might still meet the "tiny scrap of hope" bar.)

None of the research directions we're aware of currently meet our "significant amount of hope" bar, but several things meet the "tiny scrap of hope" bar, so we're continuing to keep an eye out and support others' work, while not going all-in on any one approach.

Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI's research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org's attention on those directions. We just think they're worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.

Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the "Agent Foundations" problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees "Embedded Agency" less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)

(Caveat: I had input from some other MIRI staff in writing the above, but I'm speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)

The definitions given in the post are:

  • ASI-boosted humans — We solve all of the problems involved in aiming artificial superintelligence at the things we’d ideally want.


  • misaligned AI — Humans build and deploy superintelligent AI that isn’t aligned with what we’d ideally want.

I'd expect most people to agree that "We solve all of the problems involved in aiming artificial superintelligence at the things we'd ideally want" yields outcomes that are about as good as possible, and I'd expect most of the disagreement to turn (either overtly or in some subtle way) on differences in how we're defining relevant words (like "ideally", "good", and "problems").

I'd be fine with skipping over this question, except that some of the differences-in-definition might be important for the other questions, so this question may be useful for establishing a baseline.

With "misaligned AI", there are some definitional issues but I expect most of the disagreement to be substantive, since there are a lot of different levels of Badness you could expect even if you want to call all misaligned AI "bad" (at least relative to ASI-boosted humans).

In my own answers, I interpreted "misaligned AGI" as meaning: We weren't good enough at alignment to make the AGI do exactly what we wanted, so it permanently took control of the future and did "something that isn't exactly what we wanted" instead. (Which might be kinda similar to what we wanted, or might be wildly different, etc.)

If an alien only cared about maximizing the amount of computronium in the universe, and it built an AI that fills the universe with computronium because the AI values calculating pi, then I think I'd say that the AI is "aligned with that alien by default / by accident", rather than saying "the AI is misaligned with that alien but is doing ~exactly what we want anyway". So if someone thinks AI does exactly what humans want even with humans putting in zero effort to steer the AI toward that outcome, I'd classify that as "aligned-by-default AI", rather than "misaligned AI". (But there's still a huge range of possible-in-principle outcomes from misaligned AI, even if I think some a lot more likely than others.)

Predictions, using the definitions in Nate's post:

Strong Utopia
Weak Utopia 
"Pretty good" outcome
Conscious Meh outcome
Unconscious Meh outcome
Weak dystopia
Strong dystopia

My example with the 100 million referred to question 1.

Yeah, I'm also talking about question 1.

I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences.

Seems obviously false as a description of my values (and, I'd guess, just about every human's).

Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years.

If I spontaneously consider the hypothetical, I will very strongly prefer that my neighbor not be tortured. If we add the claims that I can't affect it and can't ever know about it, I don't suddenly go "Oh, never mind, fuck that guy". Stuff that happens to other people is real, even if I don't interact with it.

But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.

You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if people are missing that this comment is leaning on a premise like "stuff only matters if it adds to my own life and experiences"?

Replacing the probabilistic hypothetical with a deterministic one: the reason I wouldn't advocate killing a Graham's number of humans in order to save 100 million people (myself and my loved ones included) is that my utility function isn't saturated when my life gets saturated. Analogously, I still care about humans living on the other side of Earth even though I've never met them, and never expect to meet them. I value good experiences happening, even if they don't affect me in any way (and even if I've never met the person who they're happening to).

(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)

Also from Ronny: 

There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.

That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.

Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:

briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load

(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a tweet of engagement, but in reality I probably won't, sorry.)

I asked Nate what he meant by B, and he said:

section B seemed to me to be saying "AIs can figure out what a face is". And, ok, sure, but if you ask them for the faciest possible thing, it's not very human!facelike.

which is one of many objections, ofc (others including "ah yes but can you aim it at a human concept" )

Load More