Another take on agent foundations: formalizing zero-shot reasoning

zhukeepa

After spending more time thinking about MIRI’s agenda, I’ve come upon another framing of it, which I’m coining zero-shot reasoning. [1]

This post is a distillation of my intuitions around zero-shot reasoning. These views should be taken as my own, and not MIRI’s.

A quick summary of this post:

“Zero-shot reasoning” refers to the ability to get things right on the first try, no matter how novel or complicated.
In simple domains, like mathematics, zero-shot reasoning is fully captured by formal verification of proofs. In more general domains, zero-shot reasoning requires an extension of formal verification that can be applied to real-world plans.
MIRI-esque philosophical work is necessary to extend formal verification to more general domains.
A formal account of zero-shot reasoning will likely be unimportant for aligning the world’s first AGIs, but will likely be essential for aligning a recursively self-improving AGI.
Humanity will most likely end up building a recursively self-improving AGI (plausibly because of insufficient coordination around not building a recursively self-improving AGI).
We can probably delegate much of the work of formalizing zero-shot reasoning to a post-AGI society, but working on zero-shot reasoning today nevertheless substantially increases the odds that our first recursively self-improving AGI is aligned.

What is zero-shot reasoning?

Few-shot reasoning vs zero-shot reasoning

The world is largely chaotic and unpredictable, yet humans can ideate and successfully execute on hugely conjunctive plans with many moving parts, many of which are the first of their kind. We can ship massive software projects and send rockets to space. Some people can build billion-dollar companies over and over again.

On the other hand, it’s clear that human abilities to do this are limited. Big software projects invariably have critical bugs and security flaws. Many of our spacecraft have exploded before making it to space. Most people can’t build a company to save their lives, and most successful entrepreneurs fail when building a second company.

Native human software is capable of doing something I’ll call few-shot reasoning—when performing complex reasoning and planning to accomplish something novel, humans can usually get it right after a few rounds of iteration. The more dissimilar this reasoning is to prior reasoning they've done, the more rounds of iteration they need.

I think something like zero-shot reasoning—the ability to perform arbitrarily novel and complex reasoning, while have calibrated high confidence in its soundness—is possible in principle. A superintelligent zero-shot reasoner would be able to:

Build an operating system as complex as Microsoft Windows in assembly language, without serious bugs, without once running the code
Build one spacecraft that lands on Mars, after observing Earth for one day, without building any other spacecrafts
Amass $1 trillion over three years

It should be able to do these all with extremely high confidence. [2]

Zero-shot reasoning might seem like magic, but in fact humans have some native capacity for it in some limited domains, like pure mathematics. Given a conjecture to prove or disprove, a human can start from a small set of axioms and combine them in extraordinarily novel and complex ways to confidently arrive at a proof or disproof of the conjecture.

That said, humans do make mistakes in mathematical proofs. But with the assistance of formal verification tools like Coq, humans can become extremely confident that their proofs are error-free, no matter how novel or complex they are. [3]

In a similar vein, humans can in principle build an operating system as complex as Microsoft Windows in assembly language, without serious bugs. Even if they’re writing a huge amount of code, they could formally verify each component they build up, and formally verify that compositions of these components work as desired. While this can only give them guarantees about what they prove, they can very confidently avoid certain classes of bugs (like memory leaks, buffer overflows, and deadlocks) without ever having to run the code.

Formalizing zero-shot reasoning

Formal verification provides a formal account of zero-shot reasoning in limited domains (namely, those describable by formal axiomatic systems, like mathematics or software). I think a formal account of more general zero-shot reasoning will involve an extension of formal verification to real-world, open-ended domains, that would give us some way of “formally verifying” that plans for e.g. building a rocket or amassing wealth will succeed with high probability.

Note that much of general zero-shot reasoning consists of subcomponents that involve reasoning within formal systems. For example, when building rockets, we do a lot of reasoning about software and Newtonian physics, both of which can be construed as formal systems.

In addition to formal verification, a complete formal account of general zero-shot reasoning will require formalizing other aspects of reasoning:

Making and trusting abstractions: Humans are capable of turning sense data into abstract formal systems like Newtonian mechanics, and then deciding under which situations it’s appropriate to apply those abstractions. How do we formalize what an abstraction is, how to make abstractions, and under which circumstances to trust them?
Bounded rationality: What does it mean for a bounded agent to have a calibrated estimate of the likelihood that a plan will succeed? In other words, how can we tell when an agent with limited computing resources is properly reasoning about logical and empirical uncertainty? (Bayesian inference gets us some of the way there, but doesn’t tell us how to select hypotheses and is often computationally intractable. Logical induction is a good starting formalism for logical uncertainty, but current algorithms are also computationally intractable.)
Self-trust: An agent may need to formally reason about how much to trust its reasoning process. How can an agent formally refer to itself within a world in which it’s embedded, and reason about the ways its own reasoning might be faulty? (Sources of error may include hardware failures in the physical world and bugs in the software it’s running.)
Logical counterfactuals: An agent may want to formally reason about the consequences of choices it takes. But if it’s deterministic, it will only end up making one choice, so it’s not clear how to formally talk about what happens if it picks something else. (Concretely, if it’s reasoning about whether to take action A or action B, and it in fact takes action A, reasoning formally about what would happen if it took action B is confusing, because anything can happen if it takes action B by the principle of explosion.)

This list is just an overview of some of the problems that need to be solved, and is by no means intended to be exhaustive. Also note the similarity between this list and the technical research problems listed in MIRI’s Agent Foundations agenda.

Why care about formalizing zero-shot reasoning?

Isn’t extreme caution sufficient for zero-shot reasoning?

It’s true that humans can make plans far more robust by thinking about them much longer and much more carefully. If there’s a massive codebase, or a blueprint of a rocket, or a detailed business plan, you could make them far more robust if you had the equivalent of a billion humans ruminating over the plan, reasoning about all the edge cases, brainstorming adversarial situations, etc. And yet, I think there remains a qualitative difference between “I thought about this plan very very hard and couldn’t find any errors” and “Under plausible assumptions, I can prove this plan will work”. [4]

It was critically important to Intel that their chips do arithmetic correctly, yet their reliance on human judgment led to the Pentium division bug. (They now rely on formal verification.) The Annals of Mathematics, the most prestigious mathematics journal, accepted both a paper proving some result and a paper proving the negation of that result.

Human reasoning is fundamentally flawed. Our brains were evolutionarily selected to play political games on savannahs, not to make long error-free chains of reasoning. Our cognition is plagued with heuristics and biases (including a heuristic that what we see is all there is), and we all have massive blind spots in our reasoning that we aren’t even aware exist. If we check a plan extremely thoroughly, we can only trust the plan to the extent that we trust that the plan doesn’t have any failure modes within our blind spots. The more conjunctive a plan is, the more likely it is that it will have a failure point hidden within our blind spots.

More concretely, suppose we have a plan with 20 components, and our estimate is that each component has a 99.9% chance of success, but in actuality three of the components have likelihoods of success closer to 80% because of edge cases we failed to consider. The overall plan will have a 0.999^17 * 0.80^3 ≈ 50% chance of success, rather than the 0.999^20 ≈ 98% we were hoping for. If such a plan had 100 components instead, the unconsidered edge cases would drive the plan’s likelihood of success close to zero. [5]

We can avoid this problem if we have guarantees that we’ve covered all the relevant edge cases, but such a guarantee seems more similar in nature to a “proof” that all edge cases have been covered (i.e., formal zero-shot reasoning) than to an assurance that someone failed to think up unhandled edge cases after trying really hard (i.e., extreme caution).

Do we need zero-shot reasoning at all?

I think we will most likely end up building an AGI that recursively self-improves, and I think recursive self-improvement is very unlikely to be safe without zero-shot reasoning. [6]

If you’re building a successor agent far more powerful than yourself to achieve your goals, you’d definitely want a guarantee that your successor agent is aligned with your goals, as opposed to some subtle distortion of them or something entirely different. You’d also want to have a level of confidence in this guarantee that goes much beyond “I thought really hard about ways this could go wrong and couldn’t think of any”. [7]

This is especially the case if that successor agent will create successor agents that create successor agents that create successor agents, etc. I feel very pessimistic about building an aligned recursively self-improving AGI, if we can’t zero-shot reason that our AGI will be aligned, and also zero-shot reason that our AGI and all its successors will zero-shot reason about the alignment of their successors.

Zero-shot reasoning seems much less important if we condition on humanity never building an AGI that fooms. I consider this conditional very unlikely if hard takeoffs are possible at all. I expect there will be consistent incentives to build more and more powerful AGI systems (insofar as there will be consistent incentives for humans to more efficiently attain more of what they value). I also expect the most powerful AI systems to be recursively self-improving AGIs without humans in the loop, since humans would bottleneck the process of self-improvement.

Because of such incentives, a human society that has not built a foomed AGI is at best in an unstable equilibrium. Even if the society is run by a competent world government that deploys superintelligent AIs to enforce international security, I would not expect this society to last for 1,000,000,000 years without some rogue actor building a foomed AGI, which I imagine would be smart enough to cut through this society’s security systems like butter. (I have a strong intuition that for narrow tasks with extremely high ceilings for performance, like playing Go well or finding security vulnerabilities, a foomed AGI could perform that task far better than any AI produced by a human society with self-imposed limitations.)

Preventing anything like this from happening for 1,000,000,000 years seems very unlikely to me. Human societies are complex, open-ended systems teeming with intelligent actors capable of making novel discoveries and exploiting security flaws. Ensuring that such a complex system stays stable for as long as 1,000,000,000 years seems plausible only with the assistance of an aligned AGI capable of zero-shot reasoning about this system. But in that case we might as well have this AGI zero-shot reason about how it could safely recursively self-improve, in which case it would robustly optimize for our values for much longer than 1,000,000,000 years.

Why should we work on it?

Can’t we just train our AIs to be good zero-shot reasoners?

There's a difference between being able to do math well, and having a formal notion of what a correct mathematical proof is. It’s possible to be extremely good at mathematics without having any formal notion of what constitutes a correct mathematical proof (Newton was certainly like this). It’s even possible to be extremely good at mathematics while being sloppy at identifying which proofs are correct—I’ve met mathematicians who can produce brilliant solutions to math problems, who are also very prone to making careless mistakes in their solutions.

Likewise, it’s possible to train AIs that learn to create and apply abstractions, act sensibly as bounded rational agents, reason about themselves in their environments, and reason sensibly about counterfactuals. This is completely different from them having formal notions of how to do these all correctly, and the fact that they can do these at all gives no guarantees on how well it does them.

We won’t be able to train our AIs to be better at zero-shot reasoning than we are, because we don’t have enough examples of good general zero-shot reasoning we can point it to. At best we’ll be able to impart our own pre-rigorous notions to the AI.

Can’t we build AIs that help us formalize zero-shot reasoning?

In principle, yes, but the task of converting pre-rigorous philosophical intuitions into formal theories is the most “AGI-complete” task I can imagine, so by default I expect it to be difficult to build a safe AGI that can usefully help us formalize zero-shot reasoning. That said, I could imagine a few approaches working:

Paul Christiano’s research agenda might let us build safe AGIs that can perform thousands of years’ worth of human cognition, which would be sufficient to help us formalize zero-shot reasoning. (On the other hand, we might need a formal account of zero-shot reasoning to establish the worst-case guarantees that Paul wants for his agenda.)
We could carefully construct non-superintelligent AGI assistants that can help humans perform arbitrary cognition, but are trained to be docile and are only run in very limited contexts (e.g. we never let it run for more than 5 minutes at a time before resetting its state). I feel confused about whether this is possible, but it’s certainly conceivable to me.
We train tool AIs on lots of examples of humans successfully turning pre-rigorous intuitions into formal theories.
We build technologies that substantially expedite philosophical progress, e.g. via intelligence amplification or whole-brain emulations run at 10,000x.

Won’t our AGIs want to become good zero-shot reasoners?

I do suspect that becoming a skilled zero-shot reasoner is a convergent instrumental goal for superintelligences. If we start with an aligned AGI that can self-modify to become a skilled zero-shot reasoner without first modifying into a misaligned superintelligence (possibly by mistake, e.g. by letting its values drift or by getting taken over by daemons), I’d feel good about the resulting outcome.

Whether we can trust that to happen is an entirely separate story. I certainly wouldn’t feel comfortable letting an AGI undergo recursive self-improvement without having some extremely strong reason to think its values would be maintained throughout the process, and some extremely strong reason to think it wouldn’t be overtaken by daemons. (I worry about small bugs in the AI creating security flaws that go unnoticed for a while, but are then exploited by a daemon, perhaps quite suddenly. The AI might worry about this too and want to take preventative measures, but at that point it might be too late.)

It might turn out that corrigibility is robust and has a simple core that powerful ML models can learn, that AGIs are likely to only get more and more corrigible as they get more and more powerful, that daemons are simple to prevent, and that corrigible AGIs will by default reliably prevent themselves from being overtaken by daemons. On these assumptions, I’d feel happy training a prosaic AGI to be corrigible and letting it recursively self-improve without any formalization of zero-shot reasoning. On the other hand, I think this conjunction of assumptions is unlikely, and for us to believe it we might need a formal account of zero-shot reasoning anyway.

Why should we think zero-shot reasoning is possible to formalize?

Humanity has actually made substantial progress toward formalizing zero-shot reasoning over the past century or so. Over the last century or so, we’ve formalized first-order logic, formalized expected utility theory, defined computation, defined information, formalized causality, developed theoretical foundations for Bayesian reasoning, and formalized Occam’s razor. More recently, MIRI has formalized aspects of logical uncertainty and made advances in decision theory. I also think all the problems in MIRI’s agent foundations agenda are tractable, and likely to result in further philosophical progress. [8]

Can we formalize zero-shot reasoning in time?

Probably not, but working on it now still nontrivially increases the odds that we do. Impressive progress on formalizing zero-shot reasoning makes it more prestigious, more broadly accessible (pre-rigorous intuitions are much harder to communicate than formal ones), and closer to being solved. This makes it more likely for it to be understood and taken seriously by the major players shortly before a singularity, and thus more likely for them to coordinate around not building a recursively self-improving AI before formalizing zero-shot reasoning.

(For comparison, suppose it turned out that homotopy type theory were necessary to align a recursively self-improving AGI, and we found ourselves in a parallel universe in which no work had been done on the topic. Even though we could hope for the world to hold off on recursive self-improvement until homotopy type theory were adequately developed, doesn't it seem much better that we're in a universe with a textbook and a community around this topic?)

Additionally, I think it’s not too unlikely that AGI is far away and/or that zero-shot reasoning is surprisingly easy to formalize. Under either assumption, it becomes far more plausible that we can formalize it in time, and whether or not we make it is straightforwardly impacted by how much progress we make today.

My personal views

I ~20% believe that we need to formalize zero-shot reasoning before we can build AGI systems that enable us to perform a pivotal act, ~85% believe that we need to formalize zero-shot reasoning before building a knowably safe recursively self-improving AI, and ~70% believe that conceptual progress on zero-shot reasoning is likely to result in conceptual progress in adjacent topics, like corrigibility, secure capability amplification, and daemon prevention.

I think working on zero-shot reasoning today will most likely turn out to be unhelpful if:

takeoff is slow (which I assign ~20%)
we can build a flourishing human society that coordinates around not building a recursively self-improving AGI, that stays stable for 1,000,000,000 years (which I assign ~10%), or
we can safely offload the bulk of formalizing zero-shot reasoning to powerful systems (like ALBA or whole-brain emulations) and implement an aligned recursively self-improving AGI before someone else builds a misaligned recursively self-improving AGI (which I assign ~50%).

My current all-things-considered position is that a formalization of zero-shot reasoning will substantially improve the odds that our first recursively self-improving AGI is aligned with humans, and that working on it today is one of humanity’s most neglected and highest-leverage interventions for reducing existential risk.

[1] This term is named in analogy with zero-shot learning, which refers to the ability to perform some task without any prior examples of how to do it.

[2] Not arbitrarily high confidence, given inherent uncertainties and unpredictabilities in the world.

[3] We can’t get arbitrarily high confidence even in the domain of math, because we still need to trust the soundness of our formal verifier and the soundness of the axiom system we're reasoning in.

[4] It’s worth noting that a team of a billion humans could confidently verify the software’s correctness by “manually” verifying the code, if they all know how to do formal verification. I feel similarly optimistic about any domain where the humans have formal notions of correctness, like mathematics. On the other hand, I feel pessimistic about humans verifying software if they don't have any notion of formal verification and can't rederive it.

[5] I'm specifically referring to conjunctive plans that we'd like to see succeed on our first try, without any iteration. This excludes running companies, which requires enormous amounts of iteration.

[6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year, at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation.

[7] It might be possible for humans to achieve this level of confidence without a formalization of zero-shot reasoning, e.g. if we attain a deep understanding of corrigibility that doesn’t require zero-shot reasoning. See “Won’t our AGIs want to become good zero-shot reasoners?”

[8] Zero-shot reasoning is not about getting 100% mathematical certainty that your actions will be safe or aligned, which I believe to be a common misconception people have of MIRI’s research agenda (especially given language around “provably beneficial AI”). Formalization is less about achieving 100% certainty than it is about providing a framework in which we can algorithmically verify whether some line of reasoning is sound. Getting 100% certainty is impossible, and nobody is trying to achieve it.

Thanks to Rohin Shah, Ryan Carey, Eli Tyre, and Ben Pace for helpful suggestions and feedback.