Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.
Does "par-human reasoning" mean at the level of an individual human or at the level of all of humanity combined?
If it's the former, what human should we compare it against? 50th percentile? 99.999th percentile?
I partly answered that here, and I'll edit some of this into the post:
By 'matching smart human performance... across all the scientific work humans do in that field' I don't mean to require that there literally be nothing humans can do that the AI can't match. I do expect this kind of AI to quickly (or immediately) blow humans out of the water, but the threshold I have in mind is more like:
STEM-level AGI is AI that's at least as scientifically productive as a human scientist who makes a variety of novel, original contributions to a hard-science field that requires understanding the physical world well. E.g., it can go toe-to-toe with highly productive human scientists on applying its abstract theories to real-world phenomena, using scientific ideas to design new tech, designing physical experiments, operating equipment, and generating new ideas that turn out to be true and that importantly advance the frontiers of our knowledge.
The way I'm thinking about the threshold, AI doesn't have to be Nobel-prize-level, but it has to be "fully doing science". I'd also be happy with a definition like 'AI that can reason about the physical world in general', but I think that emphasizing hard-science tasks makes it clearer why I'm not thinking of GPT-4 as 'reasoning about the physical world in general' in the relevant sense.
I'm not sure what the right percentile to target here is -- maybe we should be looking at the top 5% of Americans with STEM PhDs? Where Americans with STEM PhDs maybe are at the top 1% of STEM ability for Americans?
What is the "basic mental machinery" required to do par-human reasoning? What if a system has the basic mental machinery but not the more advanced mental machinery?
Do you want this to include the robotic capabilities to run experiments and use physical tools? If not, why not (that seems important to me, but maybe you disagree)?
I want it to include the ability to run experiments and use physical tools.
I don't know what the "basic mental machinery" required is -- I think GPT-4 is missing some of the basic cognitive machinery top human scientists use to advance the frontiers of knowledge (as opposed to GPT-4 doing all the same mental operations as a top scientist but slower, or something), but this is based on a gestalt impression from looking at how different their outputs are in many domains, not based on a detailed or precise model of how general intelligence works.
One way of thinking about the relevant threshold is: if you gave a million chimpanzees billions of years to try to build a superintelligence, I think they'd fail, unless maybe you let them reproduce and applied selection pressure to them to change their minds. (But the latter isn't something the chimps themselves realize is a good idea.)
In contrast, top human scientists pass the threshold 'give us enough time, and we'll be able to build a superintelligence'.
If an AI system, given enough time and empirical data and infrastructure, would eventually build a superintelligence, then I'm mostly happy to treat that as "STEM-level AGI". This isn't a necessary condition, and it's presumably not strictly sufficient (since in principle it should be possible to build a very narrow and dumb meta-learning system that also bootstraps in this way eventually), but it maybe does a better job of gesturing at where I'm drawing a line between "GPT-4" and "systems in a truly dangerous capability range".
(Though my reason for thinking systems in that capability range are dangerous isn't centered on "they can deliberately bootstrap to superintelligence eventually". It's far broader points like "if they can do that, they can probably do an enormous variety of other STEM tasks" and "falling exactly in the human capability range, and staying there, seems unlikely".)
Does a human count as a STEM-level NGI (natural general intelligence)?
I tend to think of us that way, since top human scientists aren't a separate species from average humans, so it would be hard for them to be born with complicated "basic mental machinery" that isn't widespread among humans. (Though local mutations can subtract complex machinery from a subset of humans in one generation, even if it can't add complex machinery to a subset of humans in one generation.)
Regardless, given how I defined the term, at least some humans are STEM-level.
If so, doesn't that imply that we should already be able to perform pivotal acts? You said: "If it makes sense to try to build STEM-level AGI at all in that situation, then the obvious thing to do with your STEM-level AGI is to try to leverage its capabilities to prevent other AGIs from destroying the world (a "pivotal act")."
The weakest STEM-level AGIs couldn't do a pivotal act; the reason I think you can do a pivotal act within a few years of inventing STEM-level AGI is that I think you can quickly get to far more powerful systems than "the weakest possible STEM-level AGIs".
The kinds of pivotal act I'm thinking about often involve Drexler-style feats, so one way of answering "why can't humans already do pivotal acts?" might be to answer "why can't humans just build nanotechnology without AGI?". I'd say we can, and I think we should divert a lot of resources into trying to do so; but my guess is that we'll destroy ourselves with misaligned AGI before we have time to reach nanotechnology "the hard way", so I currently have at least somewhat more hope in leveraging powerful future AI to achieve nanotech.
(The OP doesn't really talk about this, because the focus is 'is p(doom) high?' rather than 'what are the most plausible paths to us saving ourselves?'.)
In an unpublished 2017 draft, a MIRI researcher and I put together some ass numbers regarding how hard (wet, par-biology) nanotech looked to us:
We believe that the bottlenecks on current progress toward par-biology nanotechnology are (a) figuring out how to put all of the puzzle pieces together correctly, (b) executing certain difficult computations required for determining how to build materials, and (c) engineering certain basic tools that will allow us to engineer better tools, where there are likely to be mutual dependencies between progress on these fronts. If the world’s top scientific and engineering talent were actively focusing on this application and were inspired to solve the key technical problems, we would expect it to be possible to push past these bottlenecks with no more than 10x the compute that Google spent on research projects in 2016.
Assuming no advances in AI algorithms over the state of the art in 2017, we would assign a 50% probability to fifty copies of John von Neumann, divided into five teams and supplied with a large number of lab technicians and other support staff, being able to achieve nanotechnology within 25 calendar years at a level that would be sufficient for a decisive advantage if the technology were available to a group in 2017.
(footnote: We stipulate “in 2017” because we would not necessarily expect par-biology nanotechnology to confer a decisive advantage in a world where nanotechnology had been gradually advanced to that level by human engineers over multiple decades; in that scenario, factors such as leaks, regulations, and competition from other developers would make it harder for one group to strongly pull ahead. We would expect it to be much easier for one group to strongly pull ahead if nanotechnology advances too quickly for leaks, regulations, and competition to be significant factors on the relevant timescale, as we believe is possible using AGI.)
Translating this into a more realistic scenario: we would assign a 40% probability to an organization with a $10 billion budget and the involvement of someone who can attract top researchers and leadership (e.g., Elon Musk) being able to reach this level of technological capability within 25 years, absent AI advances. Our probability would lower to 15% if there were only 10 calendar years available to the hypothetical Musk project instead of 25, and would rise to 85% if there were 50 calendar years and $20 billion available instead of 25 calendar years and $10 billion, holding these conditions stable and assuming no other large global disruptions.
As in §1.3, the predictions here are rough and intuitive, and were not generated by a formal model. It would be difficult for our probability to rise much higher than 85% given additional time or other resources. Our inside-view evaluation of the arguments assigns high probability to par-biology nanotechnology being achievable in fifty years under these idealized conditions, such that the remaining uncertainty in our informal aggregate models largely stems from model uncertainty and deference to experts who disagree with our view and consider par-biology nanotechnology much more difficult. We would be very surprised to learn that par-biology nanotechnology were much more difficult (say, requiring more than 500 VNG research years), and this would have a fairly large impact on our overall expectations about early AGI systems’ potential uses and impact.
(500 VNG research years = 500 von-Neumann-group research year, defined as 'how much progress ten copies of John von Neumann would make if they worked together on the problem, hard, for 500 serial years'.)
This is also why I think humanity should probably put lots of resources into whole-brain emulation: I don't think you need qualitatively superhuman cognition in order to get to nanotech, I think we're just short on time given how slowly whole-brain emulation has advanced thus far.
With STEM-level AGI I think we'll have more than enough cognition to do basically whatever we can align; but given how tenuous humanity's grasp on alignment is today, it would be prudent to at least take a stab at a "straight to whole-brain emulation" Manhattan Project. I don't think humanity as it exists today has the tech capabilities to hit the pause button on ML progress indefinitely, but I think we could readily do that with "run a thousand copies of your top researchers at 1000x speed" tech.
(Note that having dramatically improved hardware to run a lot of ems very fast is crucial here. This is another reason the straight-to-WBE path doesn't look hopeful at a glance, and seems more like a desperation move to me; but maybe there's a way to do it.)
Thanks for posting this, Andrea_Miotti and remember! I noticed a lot of substantive errors in the transcript (and even more errors in vonk's Q&A transcript), so I've posted an edited version of both transcripts. I vote that you edit your own post to include the revisions I made.
Here's a small sample of the edits I made, focusing on ones where someone may have come away from your transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because too many filler words and false starts to sentences were left in):
How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?
I think the misuse vs. accident dichotomy is clearer when you don't focus exclusively on "AGI kills every human" risks. (E.g., global totalitarianism risks strike me as small but non-negligible if we solve the alignment problem. Larger are risks that fall short of totalitarianism but still involve non-morally-humble developers damaging humanity's long-term potential.)
The dichotomy is really just "AGI does sufficiently bad stuff, and the developers intended this" versus "AGI does sufficiently bad stuff, and the developers didn't intend this". The terminology might be non-ideal, but the concepts themselves are very natural.
It's basically the same concept as "conflict disaster" versus "mistake disaster". If something falls into both category to a significant extent (e.g., someone tries to become dictator but fails to solve alignment), then it goes in the "accident risk" bucket, because it doesn't actually matter what you wanted to do with the AI if you're completely unable to achieve that goal. The dynamics and outcome will end up looking basically the same as other accidents.
FYI, the timestamp is for the first Discord message. If the log broke out timestamps for every part of the message, it would look like this:
[2:21 PM]
It's about the size of the information bottleneck. The human genome is 3 billion base pairs drawn from 4 possibilities, so 750 megabytes. Let's say 90% of that is junk DNA, and 10% of what's left is neural wiring algorithms. So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this. Your spinal cord is about 70 million neurons so probably just your spinal cord has more information than this. That vastly greater amount of runtime info inside the adult organism grows out of the wiring algorithms as your brain learns to move around your muscles, and your eyes open and the retina wires itself and starts directing info on downward to more things that wire themselves, and you learn to read, and so on.
[2:22 PM]
Anything innate that makes reasoning about people out to cheat you, easier than reasoning about isomorphic simpler letters and numbers on cards, has to be packed into the 7.5MB, and gets there via a process where ultimately one random mutation happens at a time, even though lots of mutations are recombining and being selected on at a time.
[2:24 PM]
It's a very slow learning process. It takes hundreds or thousands of generations even for a pretty good mutation to fix itself in the population and become reliably available as a base for other mutations to build on. The entire organism is built out of copying errors that happened to work better than the things they were copied from. Everything is built out of everything else, the pieces that were already lying around for building other things.
[2:27 PM]
When you're building an organism that can potentially benefit from coordinating, trading, with other organisms very similar to itself, and accumulating favors and social capital over long time horizons - and your organism is already adapted to predict what other similar organisms will do, by forcing its own brain to operate in a special reflective mode where it pretends to be the other person's brain - then a very simple way of figuring out what other people will like, by way of figuring out how to do them favors, is to notice what your brain feels when it operates in the special mode of pretending to be the other person's brain.
[2:27 PM]
And one way you can get people who end up accumulating a bunch of social capital is by having people with at least some tendency in them - subject to various other forces and overrides, of course - to feel what they imagine somebody else feeling. If somebody else drops a rock on their foot, they wince.
[2:28 PM]
This is a way to solve a favor-accumulation problem by laying some extremely simple circuits down on top of a lot of earlier machinery.
I don't know Nate's response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.
[...]
I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent than centrally "EA", I think) are currently producing costs that outweigh the benefits.
Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.
The main reasons I feel more positive about the agent-foundations-ish cases I know about are:
- The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.
- I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.
- The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.
- (Footnote: On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later. “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.)
- Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.
I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.
Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”.
[...]
The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:
In the context of AI alignment, the "Principle of Minimality" or "Principle of Least Everything" says that when we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it. Similarly, inside the AGI itself, if a class of thought seems dangerous but necessary to execute sometimes, we want to execute the fewest possible instances of that class of thought.
E.g., if we think it's a dangerous kind of event for the AGI to ask "How can I achieve this end using strategies from across every possible domain?" then we might want a design where most routine operations only search for strategies within a particular domain, and events where the AI searches across all known domains are rarer and visible to the programmers. Processing a goal that can recruit subgoals across every domain would be a dangerous event, albeit a necessary one, and therefore we want to do less of it within the AI (and require positive permission for all such cases and then require operators to validate the results before proceeding).
Ideas that inherit from this principle include the general notion of Task-directed AGI, taskishness, and mild optimization.
Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.
MIRI itself isn't pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is "make progress on figuring out AI alignment". So I can separately answer the question "what's MIRI's organizational plan for solving alignment?"
My answer to that question is: we don't currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.
None of the research directions we're aware of currently meet our "significant amount of hope" bar, but several things meet the "tiny scrap of hope" bar, so we're continuing to keep an eye out and support others' work, while not going all-in on any one approach.
Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI's research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org's attention on those directions. We just think they're worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.
Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the "Agent Foundations" problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees "Embedded Agency" less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)
(Caveat: I had input from some other MIRI staff in writing the above, but I'm speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)
The definitions given in the post are:
- ASI-boosted humans — We solve all of the problems involved in aiming artificial superintelligence at the things we’d ideally want.
[...]
- misaligned AI — Humans build and deploy superintelligent AI that isn’t aligned with what we’d ideally want.
I'd expect most people to agree that "We solve all of the problems involved in aiming artificial superintelligence at the things we'd ideally want" yields outcomes that are about as good as possible, and I'd expect most of the disagreement to turn (either overtly or in some subtle way) on differences in how we're defining relevant words (like "ideally", "good", and "problems").
I'd be fine with skipping over this question, except that some of the differences-in-definition might be important for the other questions, so this question may be useful for establishing a baseline.
With "misaligned AI", there are some definitional issues but I expect most of the disagreement to be substantive, since there are a lot of different levels of Badness you could expect even if you want to call all misaligned AI "bad" (at least relative to ASI-boosted humans).
In my own answers, I interpreted "misaligned AGI" as meaning: We weren't good enough at alignment to make the AGI do exactly what we wanted, so it permanently took control of the future and did "something that isn't exactly what we wanted" instead. (Which might be kinda similar to what we wanted, or might be wildly different, etc.)
If an alien only cared about maximizing the amount of computronium in the universe, and it built an AI that fills the universe with computronium because the AI values calculating pi, then I think I'd say that the AI is "aligned with that alien by default / by accident", rather than saying "the AI is misaligned with that alien but is doing ~exactly what we want anyway". So if someone thinks AI does exactly what humans want even with humans putting in zero effort to steer the AI toward that outcome, I'd classify that as "aligned-by-default AI", rather than "misaligned AI". (But there's still a huge range of possible-in-principle outcomes from misaligned AI, even if I think some a lot more likely than others.)
Predictions, using the definitions in Nate's post:
My example with the 100 million referred to question 1.
Yeah, I'm also talking about question 1.
I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences.
Seems obviously false as a description of my values (and, I'd guess, just about every human's).
Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years.
If I spontaneously consider the hypothetical, I will very strongly prefer that my neighbor not be tortured. If we add the claims that I can't affect it and can't ever know about it, I don't suddenly go "Oh, never mind, fuck that guy". Stuff that happens to other people is real, even if I don't interact with it.
My own suggestion would be to use a variety of different phrasings here, including both "capabilities" and "intelligence", and also "cognitive ability", "general problem-solving ability", "ability to reason about the world", "planning and inference abilities", etc. Using different phrases encourages people to think about the substance behind the terminology -- e.g., they're more likely to notice their confusion if the stuff you're saying makes sense to them under one of the phrasings you're using, but doesn't make sense to them under another of the phrasings.
Phrases like "cognitive ability" are pretty important, I think, because they make it clearer why these different "capabilities" often go hand-in-hand. It also clarifies that the central problems are related to minds / intelligence / cognition / etc., not (for example) the strength of robotic arm, even though that too is a "capability".