This is a linkpost for

Publication cycles are long, so I've talked about some of the ideas in this paper before, but since it's freshly out in print, I thought I'd take the chance to share it and summarize its contents for those who don't want to slog through an entire academic paper. I'll aim for a more friendly, informal, and relatively brief approach in this post as an alternative to the academic, formal, and more detailed approach you'll find in the paper. Take your pick of which style you prefer, or enjoy both.


The distribution of risks from AGI suggest we can reduce the risk of catastrophe in attempts to build aligned AGI by trading off false positives for false negatives.

In more words, since we operate under uncertainty about how to build aligned AGI, assuming we are already along the Pareto frontier of interventions that are most likely to work, we are safer if we avoid trying some things that counterfactually would have worked had we tried them in exchange for trying fewer things that would fail to produce aligned AGI since those failed attempts put us at risk of unleashing unaligned or otherwise unsafe AGI and losing more that we value than we would have lost from wasted opportunity to develop aligned AGI sooner.

Stated in less formal terms, my thesis is that it's safer to be "conservative" in choices of assumptions when designing alignment schemes where "conservative" is specifically conservation of false positives.

To demonstrate this, I work through two examples of making choices about necessary assumptions to designing alignment schemes, and I know that these are necessary assumptions that lie along the frontier because they are epistemologically fundamental choices. I show what reasoning about tradeoffs between false positives and false negatives in alignment looks like, and present two recommended tradeoffs, albeit without considering how those tradeoffs might be operationalized when designing alignment mechanisms.

Reducing false positive risk

The risks of AGI appear to be so large such that even if you believe a catastrophic outcome from AGI is very unlikely, that catastrophe would be so bad that its expected value is still massively negative, and such negative outcomes represent a far larger loss of value than failures-of-opportunity to obtain the best positive outcomes. If you disagree with this premise the rest of what follows won't make much sense. The best sources to convince you of the magnitude of AGI risk are probably Superintelligence (related content) and Human Compatible (reviews: 1, 2, 3).

So given this assumption, for those working to mitigate existential risks like those from AGI, I argue that we should worry more about interventions failing than failing to implement interventions that would have worked if we had tried them. That is, all else equal, we should prefer false negatives—cases where we fail to try an intervention that would have worked—to false positives—cases where we try an intervention and it fails—since it increases the expected value of AGI.

Of course, what's better than trading off between false negatives and false positives is reducing both at the same time, i.e. making improvements that move us closer to the frontier of what's possible by reducing the base error rate. Unfortunately frontier expansion is hard and even if it's not we will still eventually reach it, so in the absence of clear opportunities to reduce base error rates we must trade off between false negative and false positive errors. Given that we must make this tradeoff, I assess the distribution of outcomes to imply we must choose more false negatives and less false positives to maximize expected value.

But what does it mean to do that? What does it look like to be preferring false negatives to false positives? It means to make choices about how we attempt to solve alignment, and in making those choices we trade off false positives for false negatives. I'll demonstrate by considering a couple choices as examples.

Of all the choices I might have considered as examples in the paper, I concern myself with two choices over "fundamental" assumptions—what might also be called hinge propositions or axioms or universal priors—necessitated by epistemic circularity (assuming we would like to reason systematically, consistently, and as close to completely as possible). I do this because the nice thing about these assumptions is that they are fundamental, so everyone must always make a choice about them even if that choice is to ignore them and make some hidden default choice while being pragmatic, and there are sets of positions that cover all possible choices about them.

I additionally picked my two examples because I find them interesting, and because the results of my reasoning on them produced results that are not universally accepted among AI safety researchers, thus it seems, in expectation, valuable for humanity if I convince marginally more AI researchers of my view on these issues on the assumption that my analyses are correct. However it's worth stressing that these additional reasons are not critically important, so disagreeing with me on my analyses at the object level is not necessarily as important as whether or not the general method is sound, and I could have just as easily chosen boring and irrelevant examples to demonstrate the method.

Meta-ethical uncertainty

Solving alignment demands that we make normative assumptions. I think this used to be a more controversial statement, and it certainly seemed that way to me when I started writing this paper two years ago, but now it seems less so. Regardless, if you are not convinced, I recommend reading 1, 2, 3, 4, 5, and, of course, 6.

Epistemic circularity prevents us from choosing norms systematically and without grounding (i.e. positivism ultimately fails even though its methods are extremely useful up to the epistemic gap created by the problem of the criterion), so we must ground norms in something, and that something necessarily requires resolving the as yet unresolved question about the existence of moral facts among many other things. Lucky for us, at least for purposes of constructing an example, there is a set of three "answers" that covers the entire space of positions about moral facts: realism, antirealism, and skepticism. This conveniently lets us ignore most nuance in theories about moral facts and perform a comprehensive analysis about the existence of moral facts as it relates to the false negative/positive tradeoff for building aligned AGI.

So this example is grounded for you, a few words on moral facts and the positions on them. Facts are true statements, and moral facts are true normative statements, i.e. statements about what people should or ought to do. Realism is the position that moral facts exist (i.e. at least some normative statements are true), and antirealism is the position that moral facts don't exist (i.e. there are no true normative statements). There are a lot of possible positions in between the naive hardline versions of these two positions, and they seem to converge at the limit of nuance. There is also the skeptical position on moral facts, just as there is a skeptical position on all questions of fundamental assumptions, which says we can't (yet?) know whether or not moral facts exist. This is far from the most interesting or most practically important question in metaethics, but it is the most fundamental, which as I say is great for an example because everyone is making a choice about it, even if it's a pragmatic or inconsistent choice.

Working through all the details to evaluate the tradeoffs between false negatives and false positives resulting from taking each of these positions on moral facts is exhausting, as you'll see if you read the paper, but it comes down to this: realism is worse than antirealism is worse than skepticism in terms of decreasing false positives at the cost of increasing false negatives. The gloss of the argument is that an alignment scheme or AI that assumes realism can more easily fail hard if it's wrong and moral facts don't exist, a scheme that assumes antirealism fails less hard if moral facts unexpectedly do exist for the cost of making it harder to find norms, and skepticism can't be "wrong" so there are no failures that way but it does make finding norms much harder.

Thus all else equal we should prefer skepticism about moral facts to antirealism to realism because this reduces the risk of false positives in exchange for increasing the risk of false negatives by ignoring solutions to alignment that might have worked if we had been able to assume moral facts definitely do or don't exist.

Uncertainty about mental phenomena

Having resolved a key question in metaethics (as it relates to constructing aligned AI), let's take things a little easier by considering AI consciousness.

AI, and in particular AGI, may or may not be conscious, have subjective experiences, or otherwise exhibit mental phenomena. This has bearings on alignment because our alignment scheme can choose to engage with an AGI's mental phenomena if they exist. Like with the case of moral facts, there are three possible positions—mindlessness, mindfulness, and skepticism—although I ignore skepticism here because I expect skepticism will almost always dominate and just focus on analyzing the choice between assuming mindless or mindful AGI.

The mindless assumption is, of course, that AGI do not or will not have minds or mental phenomena or consciousness; the mindful assumption is that they do or will. The choice of assumption constrains the design space of alignment mechanisms because in the mindful case we might explore mechanisms that aim to achieve alignment by working with the mental aspect of AGI, perhaps in ways beyond what is possible even with attempts to align humans to other humans, whereas assuming mindlessness means we cannot do this so the only possible alignment mechanisms either work directly on the algorithms/implementation of the AGI or via blackbox, behaviorist methods that make no claims on the internal state of the AGI.

The initial, straightforward analysis is that the mindless assumption is safer (has less false positive and more false negative likelihood) than the mindful assumption because assuming mindfulness would lead us to try alignment mechanisms that never could have worked if AGI do not have minds whereas assuming mindlessness can produce mechanisms that would work regardless of whether or not AGI are conscious. So far so good.

Unfortunately I do not think that all else is equal. You may disagree, but I think Goodhart effects are so robust that wide swaths of solution space are broken by them, that includes those solutions that would try to optimize algorithms or behavior for "alignment". From the paper:

Suppose we have an alignment scheme that proposes to align an AGI with human values via humans manipulating its algorithm or the AGI observing human behavior to infer human values. In the case of manipulating its algorithm, humans are preferentially choosing (viz. optimizing for) changes that result in better expectation of alignment. In the case of the AGI observing human behavior, it is trying to determine the best model of human values that explains the observed behavior. Both cases create conditions for alignment of the AGI to suffer from Goodhart’s curse and therefore diverge from actual human values by optimizing for something other than human values themselves, namely appearance of alignment as observed by humans and strength of model fit as measured by the AGI.

Thus the mindless assumption is not as good a choice as it seems because it constrains the design space so much that it eliminates too many and possibly all solutions to AI alignment that might actually work and leaves us with only those too underpowered to work. Therefore I suggest we prefer to assume mindfulness since all else is not equal under Goodharting.


I am most excited about the method of tradeoff analysis presented in this work and seeing it applied to reasoning about possible AGI alignment schemes in particular and AI and existential risk mitigation interventions more generally. Even if you disagree with me on the object level analysis of the later sections I hope that you agree with me on the method and that we should be preferring false negatives to false positives in attempts to mitigate existential risks.

For example, I worry that some approaches to AI alignment like IRL, CIRL, IDA, HCH, debate, and kin make the wrong tradeoff by greater risking false positives to explore what might otherwise be false negatives. This kind of analysis offers a method to formalize arguments that these approaches may be more dangerous than alternatives and should be disfavored, or to show that these worries are misplaced because all else is not equal in some way that constrains the solution space such that we must explore otherwise dispreferred options. Thus perhaps we can better formalize our intuitions about why we think one method is a better choice than another, at least along the dimension of error in assessing the safety of these mechanisms, by following this methodology.

New Comment
2 comments, sorted by Click to highlight new comments since:

When it is all over, we will either have succeed or failed. (The pay-off set is strongly bimodal.)

The magnitude of the pay-off is irrelevant to the optimal strategy. Suppose research program X has a 1% chance of FAI in 10 years, a 1% chance of UFAI in 10 years, and 98% chance of nothing. Is it a good option? That depends on our P(FAI | no AI in 10 years). If FAI will probably arive in 15 years, X is bad. UFAI in 15 years and X may be good. Endorse only those research programms such that you think P(FAI | that research program makes AI) > P(FAI | no one else makes AI before the research program has a chance to). Actually, this assumes unlimited research talent.

Avoiding avenues with small chances of UFAI corresponds to optimism about the outcome.

I think that it should be fairly clear whether an AI project would produce UFAI before it is run. P(friendly) <5% or >80% usually. Probability on serious assesment by competent researchers. So say that future Miri can tell if any given design is friendly in a few months. If some serious safety people study the design and become confident that its FAI then run it. If they think its UFAI, they won't run it. If someone with limited understanding and lack of knowledge of their ignorance manages to build an AI, its UFAI, and they don't know that, so they run it. I am assuming that there are people who don't realise the problem is hard, those that know they can't solve it, and those who can solve it, in order of increasing expertise. Most of the people reading this will be in the middle category. (not counting 10^30 posthuman historians ;-) People in the middle category won't build any ASI. Those in the first category will usually produce nothing, but might produce a UFAI, those in the third might produce a FAI.

I'm unclear if you are disagreeing with something or not, but to me your comment reads largely as saying you think there's a lot of probability mass that can be assigned before we reach the frontier and that this is what you think is most important for reasoning about the risks associated with attempts to build human-aligned AI.

I agree that we can learn a lot before we reach the frontier, but I also think that most of the time we should be thinking as if we are already along the frontier and not much expect the sudden development of resolutions to questions that would let us get more of everything. For example, to return to one of my examples, we shouldn't expect to suddenly learn info that would let us make Pareto improvements to our assumptions about moral facts given how long this question has been studied, so we should instead mostly be concerned with marginal trade offs about the assumptions we make under uncertainty.