This post is my attempt to summarize and distill the major public debates about MIRI's highly reliable agent designs (HRAD) work (which includes work on decision theory), including the discussions in Realism about rationality and Daniel Dewey's My current thoughts on MIRI's "highly reliable agent design" work. Part of the difficulty with discussing the value of HRAD work is that it's not even clear what the disagreement is about, so my summary takes the form of multiple possible "worlds" we might be in; each world consists of a positive case for doing HRAD work, along with the potential objections to that case, which results in one or more cruxes.
I will talk about "being in a world" throughout this post. What I mean by this is the following: If we are "in world X", that means that the case for HRAD work outlined in world X is the one that most resonates with MIRI people as their motivation for doing HRAD work; and that when people disagree about the value of HRAD work, this is what the disagreement is about. When I say that "I think we are in this world", I don't mean that I agree with this case for HRAD work; it just means that this is what I think MIRI people think.
In this post, the pro-HRAD stance is something like "HRAD work is the most important kind of technical research in AI alignment; it is the overwhelming priority and we're pretty much screwed if we under-invest in this kind of research" and the anti-HRAD stance is something like "HRAD work seems significantly less promising than other technical AI alignment agendas, such as the approaches to directly align machine learning systems (e.g. iterated amplification)". There is a much weaker pro-HRAD stance, which is something like "HRAD work is interesting and doing more of it adds value, but it's not necessarily the most important kind of technical AI alignment research to be working on"; this post is not about this weaker stance.
Before describing the various worlds, I want to present some distinctions that have come up in discussions about HRAD, which will be relevant when distinguishing between the worlds.
The idea of levels of abstraction was introduced in the context of debate about HRAD work by Rohin Shah, and is described in this comment (start from "When groups of humans try to build complicated stuff"). For more background, see these articles on Wikipedia.
Later on, in this comment Rohin gave a somewhat different "levels" idea, which I've decided to call "levels of indirection". The idea is that there might not be a hierarchy of abstraction, but there's still multiple intermediate layers between the theory you have and the end-result you want. The relevant "levels of indirection" is the sequence HRAD → machine learning → AGI. Even though levels of indirection are different from levels of abstraction, the idea is that the same principle applies, where the more levels there are, the harder it becomes for a theory to apply to the final level.
A precise theory is one which can scale to 2+ levels of abstraction/indirection.
An imprecise theory is one which can scale to at most 1 level of abstraction/indirection.
More intuitively, a precise theory is more mathy, rigorous, and exact like pure math and physics, and an imprecise theory is less mathy, like economics and psychology.
This distinction comes from Abram Demski's comment. However, I'm not confident I've understood this distinction in the way that Abram intended it, so what I describe below may be a slightly different distinction.
Building agents from the ground up means having a precise theory of rationality that allows us to build an AGI in a satisfying way, e.g. where someone with security mindset can be confident that it is aligned. Importantly, we allow the AGI to be built using whatever way is safest or most theoretically satisfying, rather than requiring that the AGI be built using whatever methods are mainstream (e.g. current machine learning methods).
Understanding the behavior of rational agents and predicting roughly what they will do means being handed an arbitrary agent implemented in some way (e.g. via blackbox ML) and then being able to predict roughly how it will act.
I think of the difference between these two as the difference between existential and universal quantification: "there exists x such that P(x)" and "for all x we have P(x)", where P(x) is something like "we can understand and predict how x will act in a satisfying way". The former only says that we can build some AGI using the precise theory that we understand well, whereas the latter says we have to deal with whatever kind of AGI that ends up being developed using methods we might not understand well.
The goal of HRAD research is to generally become less confused about things like counterfactual reasoning and logical uncertainty. Becoming less confused about these things will: help AGI builders avoid, detect, and fix safety issues; help AGI builders predict or explain safety issues; help to conceptually clarify the AI alignment problem; and help us be satisfied that the AGI is doing what we want. Moreover, unless we become less confused about these things, we are likely to screw up alignment because we won't deeply understand how our AI systems are reasoning. There are other ways to gain clarity on alignment, such as by working on iterated amplification, but these approaches don't decompose cognitive work enough.
For this case, it is not important for the final product of HRAD to be a precise theory. Even if the final theory of embedded agency is imprecise, or even if there is no "final say" on the topic, if we are merely much less confused than we are now, that is still good enough to help us ensure AI systems are aligned.
The main reason I think we might be in this world (i.e. that the above case is the motivating reason for MIRI prioritizing HRAD work) is that people at MIRI frequently seem to be saying something like the case above. However, they also seem to be saying different things in other places, so I'm not confident this is actually their case. Here are some examples:
One way to reject this case for HRAD work is by saying that imprecise theories of rationality are insufficient for helping to align AI systems. This is what Rohin does in this comment where he says imprecise theories cannot build things "2+ levels above".
There is a separate potential rejection, which is to say that either HRAD work will never result in precise theories or that even a precise theory is insufficient for helping to align AI systems. However, these move the crux to a place where they apply to more restricted worlds where the goal of HRAD work is specifically to come up with a precise theory, so these will be covered in the other worlds below.
There is a third rejection, which is to argue that other approaches (such as iterated amplification) are more promising for gaining clarity on alignment. In this case, the main disagreement may instead be about other agendas rather than about HRAD.
The goal of HRAD research is to come up with a theory of rationality that is so precise that it allows one to build an agent from the ground up. Deconfusion is still important, as with world 1, but in this case we don't merely want any kind of deconfusion, but specifically deconfusion which is accompanied by a precise theory of rationality.
For this case, HRAD research isn't intended to produce a precise theory about how to predict ML systems, or to be able to make precise predictions about what ML systems will do. Instead, the idea is that the precise theory of rationality will help AGI builders avoid, detect, and fix safety issues; predict or explain safety issues; help to conceptually clarify the AI alignment problem; and help us be satisfied that the AGI is doing what we want. In other words, instead of directly using a precise theory about understanding/predicting rational agents in general, we use the precise theory about rationality to help us roughly predict what rational agents will do in general (including ML systems).
As with world 1, unless we become less confused, we are likely to screw up alignment because we won't deeply understand how our AI systems are reasoning. There are other ways to gain clarity on alignment, such as by working on iterated amplification, but these approaches don't decompose cognitive work enough.
This seems to be what Abram is saying in this comment (see especially the part after "I guess there's a tricky interpretational issue here").
It also seems to match what Rohin is saying in these two comments.
The examples MIRI people sometimes give for precedents of HRAD-ish work, like the work done by Turing, Shannon, and Maxwell are precise mathematical theories.
There seem to be two possible rejections of this case:
The goal of HRAD research is to directly come up with a precise theory for understanding the behavior of rational agents and predicting what they will do. Deconfusion is still important, as with worlds 1 and 2, but in this case we don't merely want any kind of deconfusion, but specifically deconfusion which is accompanied by a precise theory that allows us to predict agents' behavior in general. And a precise theory is important, but we don't merely want a precise theory that lets us build an agent; we want our theory to act like a box that takes in an arbitrary agent (such as one built using ML and other black boxes) and allows us to analyze its behavior.
This theory can then be used to help AGI builders avoid, detect, and fix safety issues; predict or explain safety issues; help to conceptually clarify the AI alignment problem; and help us be satisfied that the AGI is doing what we want.
As with world 1 and 2, unless we become less confused, we are likely to screw up alignment because we won't deeply understand how our AI systems are reasoning. There are other ways to gain clarity on alignment, such as by working on iterated amplification, but these approaches don't decompose cognitive work enough.
I mostly don't think we're in this world, but some critics might think we are.
For example Abram says in this comment: "I can see how Ricraz would read statements of the first type [i.e. having precise understanding of rationality] as suggesting very strong claims of the second type [i.e. being able to understand the behavior of agents in general]."
Daniel Dewey might also expect to be in this world; it's hard for me to tell based on his post about HRAD.
The crux in this world is basically the same as the first rejection for world 2: we can reject the existence of a precise theory for understanding the behavior of arbitrary rational agents.
To summarize the above, combining all of possible worlds, the pro-HRAD stance becomes:
(ML safety agenda not promising) and (
(even an imprecise theory of rationality helps to align AGI) or
((a precise theory of rationality can be found) and
(a precise theory of rationality can be used to help align AGI)) or
(a precise theory to predict behavior of arbitrary agent can be found)
and the anti-HRAD stance is the negation of the above:
(ML safety agenda promising) or (
(an imprecise theory of rationality cannot be used to help align AGI) and
((a precise theory of rationality cannot be found) or
(even a precise theory of rationality cannot be used to help align AGI)) and
(a precise theory to predict behavior of arbitrary agent cannot be found)
How does this fit under the Double Crux framework? The current "overall crux" is a messy proposition consisting of multiple conjunctions and disjunctions, and fully resolving the disagreement can in the worst case require assigning truth values to all five parts: the statement "A and (B or (C and D) or E)", with disagreements resolved in the order A=True, B=False, C=True, D=False can still be true or false depending on the value of E. From an efficiency perspective, if some of the conjunctions/disjunctions don't matter, we want to get rid of them in order to simplify the structure of the overall crux (this corresponds to identifying which "world" we are in, using the terminology of this post), and we also might want to pick an ordering of which parts to resolve first (for example, with A=True and B=True, we already know the overall proposition is true).
So some steps for moving the discussion forward:
Thanks to Ben Cottier, Rohin Shah, and Joe Bernstein for feedback on this post.
World 3 doesn't strike me as a thing you can get in the critical period when AGI is a new technology. Worlds 1 and 2 sound approximately right to me, though the way I would say it is roughly: We can use math to better understand reasoning, and the process of doing this will likely improve our informal and heuristic descriptions of reasoning too, and will likely involve us recognizing that we were in some ways using the wrong high-level concepts to think about reasoning.
I haven't run the characterization above by any MIRI researchers, and different MIRI researchers have different models of how the world is likeliest to achieve aligned AGI. Also, I think it's generally hard to say what a process of getting less confused is likely to look like when you're still confused.
(I really like this post, as I said to Issa elsewhere, but) I realized after discussing this earlier that I don't agree with a key part of the precise vs. imprecise model distinction.
I think this is wrong. More levels of abstraction are worse, not better. Specifically, if a model exactly describes a system on one level, any abstraction will lose predictive power. (Ignoring computational cost - which I'll get back to,) Quantum theory is more specifically predictive than Newtonian physics. The reason that we can move up and down levels is because we understand the system well enough to quantify how much precision we are losing, not because we can move further without losing precision.
The reason that precise theories are better is because they are tractable enough to quantify how far we can move away from them, and how much we lose by doing so. The problem with economics isn't that we don't have accurate enough models of human behavior to aggregate them, but that the inaccuracy isn't precise enough to allow understanding how the uncertainty from psychology shows up in economics. Fore example, behavioral economics is partly useless because we can't build equilibrium models - and the reason is because we can't quantify how they are wrong. For economics, we're better off with the worse model of rational agents, which we know is wrong, but can kind-of start to quantify by how much, so we can do economic analyses.
... we don't merely want a precise theory that lets us build an agent; we want our theory to act like a box that takes in an arbitrary agent (such as one built using ML and other black boxes) and allows us to analyze its behavior.
FWIW, this is what I consider myself to be mainly working towards, and I do expect that the problem is directly solvable. I don't think that's a necessary case to make in order for HRAD-style research to be far and away the highest priority for AI safety (so it's not necessarily a crux), but I do think it's both sufficient and true.
Planned summary for the Alignment Newsletter:
This post tries to identify the possible cases for highly reliable agent design (HRAD) work to be the main priority of AI alignment. HRAD is a category of work at MIRI that aims to build a theory of intelligence and agency that can explain things like logical uncertainty and counterfactual reasoning.
The first case for HRAD work is that by becoming less confused about these phenomena, we will be able to help AGI builders predict, explain, avoid, detect, and fix safety issues and help to conceptually clarify the AI alignment problem. For this purpose, we just need _conceptual_ deconfusion -- it isn’t necessary that there must be precise equations defining what an AI system does.
The second case is that if we get a precise, mathematical theory, we can use it to build an agent that we understand “from the ground up”, rather than throwing the black box of deep learning at the problem.
The last case is that by understanding how intelligence works will give us a theory that allows us to predict how _arbitrary_ agents will behave, which will be useful for AI alignment in all the ways described in the first case and <@more@>(@Theory of Ideal Agents, or of Existing Agents?@).
Looking through past discussion on the topic, the author believes that people at MIRI primarily believe in the first two cases. Meanwhile, critics (particularly me) say that it seems pretty unlikely that we can build a precise, mathematical theory, and a more conceptual but imprecise theory may help us understand reasoning better but is less likely to generalize sufficiently well to say important and non-trivial things about AI alignment for the systems we are actually building.
I like this post -- it seems like an accessible summary of the state of the debate so far. My opinions are already in the post, so I don’t have much to add.
Thanks for the post :) To be clear, I'm very excited about conceptual and deconfusion work in general, in order to come up with imprecise theories of rationality and intelligence. I guess this puts my position in world 1. The thing I'm not excited about is the prospect of getting to this final imprecise theory via doing precise technical research. In other words, I'd prefer HRAD work to draw more on cognitive science and less on maths and logic. I outline some of the intuitions behind that in this post.
Having said that, when I've critiqued HRAD work in the past, on a couple of occasions I've later realised that the criticism wasn't aimed at a crux for people actually working on it (here's my explanation of one of those cases). To some extent this is because, without a clearly-laid-out position to criticise, the critic has the difficult task of first clarifying the position then rebutting it. But I should still flag that I don't know how much HRAD researchers would actually disagree with my claims in the first paragraph.
I should note that there are some things in world 1 that I wouldn't reject this way -- e.g. one of the examples of deconfusion is “anyhow, we could just unplug [the AGI].” That is directly talking about AGI safety, and so deconfusion on that point is "1 level away" from the systems we actually build, and isn't subject to the critique. (And indeed, I think it is important and great that this statement has been deconfused!)
It is my impression though that current HRAD work is not "directly talking about AGI safety", and is instead talking about things that are "further away", to which I would apply the critique.
Thanks for the post, it is a helpful disjunction of possibilities and set of links to prior discussion.
I think that the post would be clearer if instead of sections called "Why I think we might be in this world" it had section with the same content called "Links to where people have discussed being in this world" or something similar. I'm not really sure why you use the title you do, it threw me for a bit.
With help from David Manheim, this post has now been turned into a paper. Thanks to everyone who commented on the post!
I think theoretical work on AI safety has multiple different benefits, but I prefer a slightly different categorization. I like categorizing in terms of the sort of safety guarantees we can get, on a spectrum from "stronger but harder to get" to "weaker but easier to get". Specifically, the reasonable goals for such research IMO are as follows.
Plan A is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a proof that this algorithm is aligned, or at least a solid base of theoretical and empirical evidence, similarly to the situation in cryptography. This more or less correspond to World 2.
Plan B is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a specific impractical but provably aligned algorithm (iv) informal and empirical arguments suggesting that the former algorithm is as aligned as the latter. As an analogy consider Q-learning (an impractical algorithm with provable convergence guarantees) and deep Q-learning (a practical algorithm with no currently known convergence guarantees, designed by analogy to the former). This sort of still corresponds to World 2 but not quite.
Plan C is having enough theory to at least have rigorous models of all possible failure modes, and theory-inspired informal and empirical arguments why a certain algorithm avoids them. As an analogy, concepts such as VC dimension and Rademacher complexity allow us being more precise in our reasoning about underfitting and overfitting, even if we don't know how to compute them in practical scenarios. This corresponds to World 1, I guess?
In a sane civilization the solution would be not building AGI until we can implement Plan A. In the real civilization, we should go with the best plan that will be ready by the time competing projects become too dangerous to ignore.
World 3 seems too ambitious to me, since analyzing arbitrary code is almost always an intractable problem (e.g. Rice's theorem). You would need at least some constraints on how your agent is designed.
I think that the plans you lay out are all directly talking about the AI system we eventually build, and as a result I'm more optimistic about them (and your work, as it's easy to see how it makes progress towards these plans) relative to HRAD.
In contrast, as far as I can tell, HRAD work does not directly contribute to any of these plans, and instead the case seems to rely on something more indirect where a better understanding of reasoning will later help us execute on one of these plans. It's this indirection that makes me worried.
Well, HRAD certainly has relations to my own research programme. Embedded agency seems important since human values are probably "embedded" to some extent, counterfactuals are important for translating knowledge from the user's subjective vantage point to the AI's subjective vantage point, reflection is important if it's required for high capability (as Turning RL suggests). I do agree that having a high level plan for solving the problem is important to focus the research in the right directions.