Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.
I don't know Nate's response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.
[...]
I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent than centrally "EA", I think) are currently producing costs that outweigh the benefits.
Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.
The main reasons I feel more positive about the agent-foundations-ish cases I know about are:
- The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.
- I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.
- The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.
- (Footnote: On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later. “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.)
- Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.
I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.
Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”.
[...]
The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:
In the context of AI alignment, the "Principle of Minimality" or "Principle of Least Everything" says that when we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it. Similarly, inside the AGI itself, if a class of thought seems dangerous but necessary to execute sometimes, we want to execute the fewest possible instances of that class of thought.
E.g., if we think it's a dangerous kind of event for the AGI to ask "How can I achieve this end using strategies from across every possible domain?" then we might want a design where most routine operations only search for strategies within a particular domain, and events where the AI searches across all known domains are rarer and visible to the programmers. Processing a goal that can recruit subgoals across every domain would be a dangerous event, albeit a necessary one, and therefore we want to do less of it within the AI (and require positive permission for all such cases and then require operators to validate the results before proceeding).
Ideas that inherit from this principle include the general notion of Task-directed AGI, taskishness, and mild optimization.
Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.
MIRI itself isn't pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is "make progress on figuring out AI alignment". So I can separately answer the question "what's MIRI's organizational plan for solving alignment?"
My answer to that question is: we don't currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.
None of the research directions we're aware of currently meet our "significant amount of hope" bar, but several things meet the "tiny scrap of hope" bar, so we're continuing to keep an eye out and support others' work, while not going all-in on any one approach.
Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI's research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org's attention on those directions. We just think they're worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.
Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the "Agent Foundations" problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees "Embedded Agency" less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)
(Caveat: I had input from some other MIRI staff in writing the above, but I'm speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)
The definitions given in the post are:
- ASI-boosted humans — We solve all of the problems involved in aiming artificial superintelligence at the things we’d ideally want.
[...]
- misaligned AI — Humans build and deploy superintelligent AI that isn’t aligned with what we’d ideally want.
I'd expect most people to agree that "We solve all of the problems involved in aiming artificial superintelligence at the things we'd ideally want" yields outcomes that are about as good as possible, and I'd expect most of the disagreement to turn (either overtly or in some subtle way) on differences in how we're defining relevant words (like "ideally", "good", and "problems").
I'd be fine with skipping over this question, except that some of the differences-in-definition might be important for the other questions, so this question may be useful for establishing a baseline.
With "misaligned AI", there are some definitional issues but I expect most of the disagreement to be substantive, since there are a lot of different levels of Badness you could expect even if you want to call all misaligned AI "bad" (at least relative to ASI-boosted humans).
In my own answers, I interpreted "misaligned AGI" as meaning: We weren't good enough at alignment to make the AGI do exactly what we wanted, so it permanently took control of the future and did "something that isn't exactly what we wanted" instead. (Which might be kinda similar to what we wanted, or might be wildly different, etc.)
If an alien only cared about maximizing the amount of computronium in the universe, and it built an AI that fills the universe with computronium because the AI values calculating pi, then I think I'd say that the AI is "aligned with that alien by default / by accident", rather than saying "the AI is misaligned with that alien but is doing ~exactly what we want anyway". So if someone thinks AI does exactly what humans want even with humans putting in zero effort to steer the AI toward that outcome, I'd classify that as "aligned-by-default AI", rather than "misaligned AI". (But there's still a huge range of possible-in-principle outcomes from misaligned AI, even if I think some a lot more likely than others.)
Predictions, using the definitions in Nate's post:
My example with the 100 million referred to question 1.
Yeah, I'm also talking about question 1.
I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences.
Seems obviously false as a description of my values (and, I'd guess, just about every human's).
Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years.
If I spontaneously consider the hypothetical, I will very strongly prefer that my neighbor not be tortured. If we add the claims that I can't affect it and can't ever know about it, I don't suddenly go "Oh, never mind, fuck that guy". Stuff that happens to other people is real, even if I don't interact with it.
But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.
You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if people are missing that this comment is leaning on a premise like "stuff only matters if it adds to my own life and experiences"?
Replacing the probabilistic hypothetical with a deterministic one: the reason I wouldn't advocate killing a Graham's number of humans in order to save 100 million people (myself and my loved ones included) is that my utility function isn't saturated when my life gets saturated. Analogously, I still care about humans living on the other side of Earth even though I've never met them, and never expect to meet them. I value good experiences happening, even if they don't affect me in any way (and even if I've never met the person who they're happening to).
(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)
Also from Ronny:
There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.
That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.
Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:
briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load
(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a tweet of engagement, but in reality I probably won't, sorry.)
I asked Nate what he meant by B, and he said:
section B seemed to me to be saying "AIs can figure out what a face is". And, ok, sure, but if you ask them for the faciest possible thing, it's not very human!facelike.
which is one of many objections, ofc (others including "ah yes but can you aim it at a human concept" )
FYI, the timestamp is for the first Discord message. If the log broke out timestamps for every part of the message, it would look like this: