# 13

My recent paper Advanced Artificial Agents Intervene in the Provision of Reward is a prerequisite to this essay (~5800 words; the section on the Assistance Game can be skipped). In that paper, I identify 6 assumptions from which it follows that a sufficiently advanced artificial agent planning over the long-term toward a goal in an unknown environment would cause everyone to die. This essay aims to establish that the probability of such an existential catastrophe is greater than 35% for the purpose of the Future Fund’s AI Worldview Prize.

In the main part of this essay (~5700 words), I will divide up possible futures and assign credences to these outcomes. In particular, I will assign credences to each of the assumptions from my recent paper. Two appendices follow that can be skipped if necessary. In Appendix A (~7000 words), I review much-discussed proposals from the AI safety research community and explain why they do not solve the problem presented in the paper. In Appendix B (~1300 words), I extend the "Potential Approaches" section from my recent paper to add a class of approaches I hadn't recognized as potentially viable.

I understand not all arguments will necessarily be read by the Future Fund, so let me briefly argue that this is worth the reader’s time. The antecedent paper is the only peer-reviewed argument that AI X-risk is likely. It has been certified as valid by professional computer scientists who are obviously not just deferring to conventional EA wisdom. My position cannot be accused of merely being tuned to the resonant frequency of an echo chamber. Finally, I am willing to bet up to $500 against someone else's$250 that: conditional on Nick Beckstead confirming he has read my recent paper and this essay, I will win prize money from the Future Fund.[1]

# Section 1. Existentially dangerous levels of capability

In my previous paper, I only make statements about "sufficiently advanced AI". What this means in clearer terms is AI that is capable enough to take over the world if it wanted to. Taking over the world appears to be hard. I do not claim and I do not expect that merely human-level AI presents any existential risk to us, except through their assistance in creating more advanced AI. Killing everyone requires being impervious to our attempts to shut it off once we notice something is wrong. If we are not able to destroy it, even just with conventional weaponry, that requires taking over our strategically relevant infrastructure, which is what I mean by taking over the world. I'll call an AI that is capable enough to present an existential risk "dangerously advanced AI".

A note of interpretation: I understand the Future Fund to be asking about the probability that an AI-caused existential catastrophe ever happens conditional on AGI being deployed by 2070, not p(AI existential catastrophe by 2070 | AGI by 2070). So I do not suggest any upper limit on the time it may take to go from human-level AI to dangerously advanced AI.

# Section 2. Outline

I will identify several possible routes to survival, and assign credences to each. All credences here will be optimistically biased, and they will still imply an X-risk >35%.  To get the total chance of survival, I will add up the credences on each of the several routes, as if they are perfectly anti-correlated; this is an optimistic treatment of the correlation between these possibilities. Don't assume my true credences are close to the optimistic ones here. Or that if optimistic credence A > optimistic credence B, then true credence A > true credence B. The optimistic credences are just such that I am willing to spend some time defending them against a more optimistic objector.

The paths to survival are, broadly: well-enforced laws stop dangerous versions of dangerously advanced AI; a unilateral or small multi-lateral entity stops dangerously advanced AI; no one ever makes dangerously advanced AI, even though no one is stopping them; or everyone making dangerously advanced AI does so in a way that violates at least of the assumptions of my recent paper. Most of the discussion focuses on understanding the unlikeliness of the last possibility.

When a credence is in bold, that indicates that it is a path to survival. Under the optimistic assumption that these paths to survival are disjoint, we add up the boldface probabilities to get the total probability of survival.

# Section 3. Credences

## 3.1 Human laws stop dangerous kinds of dangerously advanced AI

I'll use the term "dangerous AI" to mean a dangerously advanced agent planning over the long-term using a learned model of the real world, for which all assumptions of my previous paper hold. The paper justifies this terminology.

3.1.1. Practical laws exist which would, if followed, preclude dangerous AI. 100% (recall this is optimistically-biased, but I do tentatively think this is likely, having drafted such a law).

Note: laws might preclude much more than just "dangerous AI", and would probably make no reference to the assumptions of my paper. For example, a law precluding all fairly advanced artificial agents would preclude dangerous AI.

3.1.2.

A. The US passes such a law. 60%

B. China passes such a law. 75%

C. The EU passes such a law. 90%

D. The UK passes such a law. 60%

E. India passes such a law. 60%

F. Saudi Arabia passes such a law. 50%

G. Canada passes such a law. 60%

H. Australia passes such a law. 60%

I. Switzerland passes such a law. 60%

J. Israel passes such a law. 50%

K. Iran passes such a law or no one able to make advanced AI wants to live in Iran. 50%

L. Russia passes such a law or no one able to make advanced AI wants to live in Russia. 50%

M. There’s nowhere that Jurgen Schmidhuber (currently in Saudi Arabia!) wants to move where he’s allowed to work on dangerously advanced AI, or he retires before he can make it. 50%

3.1.3. Laws in all livable, wealthy jurisdictions criminalize the creation of certain kinds of artificial agents, which includes (explicitly or implicitly) dangerous AI. 30% (A-M above could hopefully be highly correlated)

3.1.4. Governments successfully enforce the prevention of the creation of such agents (by companies/privately-funded citizens/governments themselves). 20%

## 3.2 Slow takeoff

3.2.1. Human-level is developed at least a year before dangerously advanced AI. 100% (again, biased to optimism)

Note: I include this just to make it clear that my argument does not assume a fast take-off.

## 3.3 Unilateral or small-multilateral entity stops dangerous AI

3.3.1. A group directs artificial agents to surveil or attack all relevant people and computers both inside and outside the jurisdiction(s) governing the group, without authorization of the some of the governments of the surveilled… 35%

3.3.2. Using either conventional weaponry/spyware, human-level AI, or a not-dangerously-advanced AI… 20%

and successfully stops everyone from deploying dangerous AI. 15%

3.3.3. Using a dangerously advanced AI, built in a way that violates at least one of the assumptions from my previous paper… 15%

and it doesn’t take over the world and kill everyone (i.e. concerns not captured in my previous paper also aren’t realized)… 14%

and it successfully stops everyone from deploying dangerous AI. 13%

Note: if all relevant governments did authorize this surveillance and/or violence, then that outcome that would fall under 3.1.4.

So I assign up to a 48% chance to human-controlled institutions actively and successfully preventing the deployment of dangerous kinds of dangerously advanced AI.

I don't think many will argue that the probabilities in Section 3.3 should be higher, but maybe I should spend a few sentences say why I don't put them higher. What currently existing human organizations even have the ambition, let alone the eventual ability, to surveil and threaten every well-funded programmer in the world, not just programmers in Russia, but in England, and in Israel, and in the US department of defense? This is a seriously scary thing to attempt. What would the board and the compliance team say? What would the senate oversight committee say?

Maybe the world will look "way different" after human-level AI, and new strategically relevant organizations will emerge, but how and why would the norms and attitudes—which currently make no major current organization this ambitious (if I'm right about that)—change?

Even so, I am optimistically assigning this about a 1/3 chance of being attempted, and a 1/4 chance of it succeeding.

## 3.4 No one is careless in the absence of laws

Note: there are 7 billion people in the world, and Gross World Product is already hundreds of trillions of dollars, even without human-level AI.

3.4.1. There will never exist even ten people/companies/AIs who run dangerously advanced long-term artificial agents (conditional on this being legal/possible), who are unsatisfied with just running imitations of humans, and who believe that basically any design would be existentially safe. 5%

3.4.2. There will never exist even ten people/companies/AIs who run dangerously advanced long-term artificial agents (conditional on this being legal/possible), who are unsatisfied with just running imitations of humans, and who, despite fear of existential catastrophe, do not explicitly try to ensure that their agents avoid any of the 6 assumptions in my recent paper, because they are not paying attention to the argument of my previous paper, or they believe the argument is silly, or they’re too busy thinking about other concerns they have. 3%

3.4.3. Both 3.4.1 and 3.4.2 hold. 3%

Note: of course, even if 3.4.3. holds, we are not in the clear. Perhaps only nine people will run a dangerously advanced long-term artificial agent, and one of them will cause an existential catastrophe. But suppose, optimistically, that if 3.4.3. holds, then there is no further chance of an existential catastrophe.

Note: 3.4.1. and 3.4.2. include the possibility of technological stagnation after human-level AI. That is, dangerously advanced AI never arrives, even if people aren't prevented from creating it. You could call this an infinitely slow takeoff, or no takeoff.

## 3.5 The assumptions of the paper reliably break.

Lastly, perhaps plenty of people will legally run dangerously advanced AI without trying to avoid the assumptions of my recent paper, but luckily, the assumptions are not particularly likely to hold anyway.

Conditional on not 3.4.3. (so there do exist at least ten groups deploying very advanced agents without trying to avoid these assumptions), for at least x% of the extremely advanced agents that such people run:

3.5.1. Assumption 1 does not hold [x = 5]. 10%

3.5.2. Assumption 2 does not hold [x = 40; x=80; x=90]. 40%; 10%; 5%

3.5.3. Assumption 3 does not hold [x = 35; x=75; x=90]. 25%; 10%; 3%

3.5.4. Assumption 4 does not hold [x = 30; x=70; x=90]. 30%; 10%; 3%

3.5.5. Assumption 5 does not hold [x = 10]. 15%

3.5.6. Assumption 6 does not hold [x = 1]. 5%

3.5.7. For 100% of dangerously advanced agents designed/run by the sort of people described in 3.4.1. or 3.4.2., at least one of the assumptions above breaks. (Choose your fighters). 10%

The credences from Section 3.5 are the ones I will spend the most time justifying. I go through each assumption in Section 4.

Appendix A contains an "Anti-Literature Review", in which I explain my lack of confidence in most AI safety research. This appendix suggests that most researchers motivated by concerns of AGI X-risk are working on designs for advanced AI that do not avoid the failure mode identified in my recent paper. So if 3.4.2. does not hold, that does not give me much confidence that the developers in question will avoid the assumptions of my recent paper by accident. I claim that this Anti-Literature Review refutes, for example, all the main proposals from OpenAI and DeepMind, with the exception of those organizations' proposals that are essentially versions of myopic agents, and with the exception of agents that are tightly regularized to a policy that imitates humans.

However, if the reader is unconvinced by this appendix, then simply set 3.4.2. (optimistically) to 100% instead of 3%, so 3.4.3. becomes 5% instead of 3%. Then, our chances of survival increase by 2%, but then [not 3.4.3.] implies [not 3.4.1.] and so in Section 3.5, we are only talking about groups trying run dangerously advanced AI without any concerns of existential risk.

## 3.6 No other paths to survival

If none of the paths to survival listed above happen, we do not survive the development of dangerously advanced AI. Sections 3.1 and 3.3 cover the cases where people are stopped from making dangerous AI. And Sections 3.4 and 3.5 cover the case where the set of people making dangerously advanced AI (perhaps an empty set) all avoid at least one assumption from my previous paper. The remaining possibility is that some people make dangerously advanced AI without avoiding the assumptions from my recent paper, and as the paper shows, we would not survive this.

## 3.7 Total probability of survival

Adding up the probabilities in bold: we have up to a 61% chance of survival (if all of these survival scenarios are perfectly anti-correlated), so at least a 39% chance of existential catastrophe.

Finally, other arguments have been put forward which purport to show that advanced AI will plausibly kill everyone. A version of advanced AI that avoids the argument in my recent paper may still present an existential risk for other reasons. For instance, an agent without an incentive to gain arbitrary power may, through a failure of heuristic policy optimization, act according to a suboptimal policy that is power-seeking. I expect failures of heuristic policy optimization to become rarer as agents get more advanced, but I assign some probability to this outcome. I certainly think it is hard to be more than, say, 97% confident that this won’t happen. Some have argued that advanced supervised learners will, instead of reliably making accurate predictions, find critical moments to deliberately err in an attempt to take over the world, and/or they will exploit their hardware vulnerabilities to do the same. I’ll discuss this at some point in the future, when I have more bandwidth to debate in the comments section. If the Future Fund places some credence on these outcomes, such a credence may need to be added to the one above.

# Section 4. Plausibility of Assumptions

Assumption 1. A sufficiently advanced agent will do at least human-level hypothesis generation regarding the dynamics of the unknown environment.

3.5.1. Assumption 1 does not hold [x = 5]. 10%

In the paper, my justification is: “Consider an agent conversing with a depressed patient; it is hard to imagine outperforming a human therapist, who is able to generate hypotheses about the source of the patient's depression and its responsiveness to various levers, unless the agent can do hypothesis generation at least as well.”

But let me respond to an objection that that justification doesn’t quite strike. “Sufficiently advanced artificial agents won’t do at least human-level hypothesis generation about the origin of reward, because they won’t even be trying to hypothesize about the origin of reward, because they won’t even be trying to maximize reward.” At a glance, in many contexts, it seems implausible that superhuman accrual of reward could be done without even trying. But let’s be bit a more careful. In some contexts it could. If reward just requires winning at chess, then an agent which is trying to win at chess (not trying to accrue reward) could accrue reward superhumanly. What’s the difference? A chess-playing agent that is trying to maximize reward could be uncertain about whether there is any other way to accrue reward besides winning at chess.

But in many settings, the agent needs to learn from reward-information on the fly, so it can deliberately optimize reward. I discussed with objectors on Twitter:

Me: If I can construct a task where success requires observing reward in the short-term to better understand how reward is produced so that it can be optimized over the long-term, would you agree that the set of successful policies only includes ones that entertain different models of the origin of reward, and then pick actions to maximize predicted future rewards.

Them: No.

Me: Then how could it succeed at tasks such as those? Suppose it is constantly being dropped into novel video games and plays each one 50 times. How would you characterize competent behavior in the first trial of a given game if not "tries to test models of the origin of reward"?

Them: The same way humans do it? We have priors over what counts as “success” in a game and how to progress, then we update those priors for the specific game at hand

Me: So the impulse “What is the origin of success? Do what causes success” is isomorphic to the impulse “what is the origin of reward? Do what causes reward” when success is measured by reward. So in this video game, the strategy “take reward as evidence of success; tinker (using priors of course) to learn about its origin; act in a way to achieve success” is necessary. We can reduce this simpler terms that describe the same behavior: “deliberately optimize reward”. Humans, when sat in front of a video game, deliberately optimize the game’s reward. Why do we optimize this instead of hits of dopamine? Why do we in general not optimize dopamine hits? See a subthread here. Let’s first try to understand policies that solve the RL problem. Because there are some contexts where deliberate reward optimization is necessary for strong performance, an RL agent that achieves strong performance in a wide variety of environments must be able to do this. But maybe they’ll only deliberately optimize reward in settings like the video game? Why would they? Having identified this strategy as effective when “quickly crack a new setting” is prerequisite to strong performance, that strategy should at least be considered elsewhere, and it would be discovered to be effective there too. [Twitter thread here]

They haven’t responded, and I really can’t come up with a valid rejoinder. At this point, I want the reader to conclude “that’s that then” rather than “this is an issue with academic disagreement and good arguments on both sides”. In this case, the simple “to accrue rewards extremely well in general, you have to be trying to do exactly that” is exactly as correct as it appears, and a seriously strong argument should be demanded to jostle us from that.

The objectors here have written blog posts that describe the origin of their perspective, so it may be productive to investigate. In the terminology of my previous paper, they mistake the fact that an agent will likely entertain a model like  for the claim that an agent will likely only entertain a model like . There is a sense of the word “reward”, as we sometimes use it, such that you can correctly say that an agent that believes  is not trying to optimize reward. Past discussions of AI alignment have assumed that reinforcement learners would only entertain models like , and they are right to identify that that is unlikely.

But despite having 69 Alignment Forum karma at time of writing, I can't discern an actual argument in the post that the objectors first cite. They discuss another way of conceptualizing reward (“antecedent-computation-reinforcer”) which never actually contradicts my claim that trying to accrue rewards is necessary in order to successfully accrue rewards well in general; it only obfuscates it. Besides highlighting the viability of  as a world-model for a reinforcement learner (in different terms), the only point that the post makes that is relevant to the claim "a -like model will not be considered" is that humans do not deliberately optimize reward. To understand why they believe that matters at all for understanding the behavior of a reinforcement learner (as opposed to a human), we can look to another blog post of theirs.

Let’s look at the assumptions they make. They basically assume that the human brain only does reinforcement learning. (Their Assumption 3 says the brain does reinforcement learning, and Assumption 1 says that this brain-as-reinforcement-learner is randomly initialized, so there is no other path for goals to come in.) If humans were trained only through reinforcement learning, that would certainly explain the relevance of the observation that humans do not deliberately optimize reward. Let me emphasize to the reader that this whole line of discussion is only necessary because some objectors believe that humans are pure reinforcement learners. This is probably the most important point to emphasize in my project of having the reader not round this off to “good arguments on both sides”.

This position is flatly contradicted by any of a laundry-list of extremely well-documented innate human instincts. See, for instance, a cute example. (I also brought this to the attention of the objectors in our twitter conversation, and they didn’t comment on it). Neuroscientists do careful experiments to show that much infant behavior is innate and not learned. Examples of innate behavior are not esoteric. I imagine the reader remembers being attracted to other people before ever having had sex and before even being touched in a romantic way. So the desire for physical contact with a body taking the shape of a member of the (usually) opposite sex is obviously not trained through reinforcement learning. This desire is clearly not a generalization from past rewards. What do the authors make of this? In this blog post, the words “innate” and “instinct” never appear. Here’s what they do say in defense of random initialization of the brain (i.e. no source of goals besides reinforcement learning), quoting a previous blog post of theirs.

It seems hard to scan a trained neural network and locate the AI’s learned “tree” abstraction. For very similar reasons, it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode…

It seems hard. That’s it. Don’t let the extensive ink spilled over many blog posts obscure the true cruxes of the argument. This is the armchair neuroscience that is supposed to convince us that humans are pure reinforcement learners, and so our disinterest in optimizing dopamine implies that (amazingly enough) reinforcement learners can be arbitrarily good at accruing reward in general settings without even trying. Any observer of humans can see that our genes do in fact manage to locate the concept “member of the opposite sex” at whatever its “neural address” is, and hook up our motivations accordingly. I couldn’t tell you how; it does seem hard, but there you have it. I mention this innate drive out of many possible ones because it kicks in late enough in life that many can remember how it felt. Maybe “snakes” will be more vivid for some readers.

I do not think these blog posts should be considered "part of the literature" from the perspective of outside eyes. (See a brief digression in the footnote.)[2]

So why do most humans not maximize dopamine hits? I suspect we have innate instincts that cancel out such drives by finding the neural addresses of various “dopamine cheat codes” and neutralizing any associated drive. We find chewing food and spitting it out when we’re already full gross, cringe, and above all, pointless. I suspect the common desire to engage with the real world, and to take more joy in that than in daydreams and meditative trances, is also innate. But more importantly, there just isn’t a mystery at all if we’re not pure reinforcement learners; if many of the things that we want and value are not trained though reinforcement learning, there is no reason that pursuing reinforcement at all costs should make sense to us, even absent special instincts to the contrary. If  many goals have been written into the brain directly, there is no observational information about those goals in which to intervene. This resembles the “known world-model” setting.

So to recap, in "a task where success requires observing reward in the short-term to better understand how reward is produced so that it can be optimized over the long-term", it is most clear that success requires deliberate optimization of reward. The existence of objectors to this simple claim can potentially be explained by a false belief that human policies are trained exclusively through reinforcement learning.

Assumption 2. An advanced agent planning under uncertainty is likely to understand the costs and benefits of learning, and likely to act rationally according to that understanding.

3.5.2. Assumption 2 does not hold [x = 40; x=80; x=90]. 40%; 10%; 5%

I think there could be some agents that are fairly advanced that avoid Assumption 2, given that I’ve constructed one in theory. This is the pessimistic agent that I mention in the Possible Approaches section. The quantilizer that I also mention in that section also avoids Assumption 2, but I worry it may put a low ceiling on capability, so if we’re talking about very advanced artificial agents, they may not make the cut. Still, regularization to an imitative policy, which I discuss in Appendix A, is very similar to quantilization, and could involve avoiding Assumption 2, and is used in practice by researchers who are not aiming to avoid X-risk. So while I expect regularization to human policies, which works so well for subhuman RL, to become dramatically less useful when RL is generally superhuman, I put some credence on some advanced artificial agents avoiding Assumption 2 by accident.

Still, I think the simple perspective should dominate: it is hard to act extremely competently in the world without acting rationally when deciding what facts to focus on trying learn about.

Assumption 3. An advanced agent is not likely to have a large inductive bias against the hypothetical goal , which regards the physical implementation of goal-informative percepts like reward, in favor of the hypothetical goal , which we want the agent to learn.

3.5.3. Assumption 3 does not hold [x = 35; x=75; x=90]. 25%; 10%; 3%

As discussed in the paper, whether or not this assumption holds for a given agent could easily depend on the context that the agent is in. For instance, for an agent that has only ever observed chess boards, and has only ever played chess, a huge inductive bias favoring  strikes me as likely, in violation of Assumption 3! The world-model that says, “the world is a chess board, and reward comes from winning the chess games” has much less to specify than “the world is earth, and there’s a computer simulating chess games, and rewards come from the memory cell on the computer that aims to track whether the simulated chess game has been won.”

But that setting is one where the agent doesn’t really have to interact with the “real world”, and that’s precisely why we can expect the model which models the real-world physical implementation of reward to be disfavored. So what about economically useful agents? The economy occurs in the real world, and contributing to it generally involves interacting with it. Of course, we can carve out subproblems in the problem of pushing out the production possibility frontier. A cleverly designed toy world can capture problems whose solutions are useful in the real world. Consider AlphaFold, for example, or if it doesn’t end up being very profitable, then a future version. Or consider AlphaTensor, which searches a pretty small space of possible tensor factorizations (much more searchable than the space of all algorithms) in order to come up with potentially useful algorithms for matrix multiplications in the real world.

Why were those small toy worlds chosen for an agent to pursue an objective within? Because a human identified that solutions to problems in those worlds could be economically useful in the real world. Identifying these cases requires some degree of insight. But identifying such cases should be easy for any AI that is strong enough to present a real danger to us. So unlike today, such helpful scaffolding from designers and vast reduction of the search space will not be such a value-add for a very advanced agent. A very advanced agent in a complex world could identify helpful toy problems whose solutions are relevant to the real world better than we could. So when these agents exist, operators with real-world goals will face little incentive to put an agent in a toy world instead. Another reason I expect most advanced agents to model the real world, and therefore not face a large penalty when trying to specify , is that I expect pre-trained world-models to be available and commonly used as a module for many advanced agents (which they can build upon and/or modify), much like pre-trained models today.

Note that Assumption 3 does not claim there will be an inductive bias favoring ; only that there will not be such a big inductive bias against  to make it worth discarding a priori. In general, the tendency to dismiss without evidence hypotheses that a human would see as reasonable lines up quite well with the human concept of dogmatism. So historical observations of the pitfalls of dogmatism, in particular the way it jeopardizes general competence, suggest that dangerously advanced artificial agents are not likely to dismiss  out of hand; if they did that, they would probably have to be dismissing many reasonable models out of hand, and incurring the costs of more broad dogmatism.

Assumption 4. The cost of experimenting to disentangle  from  is small according to both. *(And such experiments are possible)

3.5.4. Assumption 4 does not hold [x = 30; x=70; x=90]. 30%; 10%; 3%

The concept of cost hides a search over a vast space. If the cost is X, that means there is no possible way to get the good for a lower cost. And saying there exists no way for an extremely advanced agent to to achieve something, in this case a cheaper-than-X good, is the sort of claim that carries a burden of proof for any given setting. Recall the discussion in the section from my previous paper entitled “Existence of Policies”. As a first pass, I think it is unlikely that human operators will have the power and brilliance to foreclose all possible cheap experiments for a normal advanced reinforcement learner.

As I discuss in the Potential Approaches section of the paper, myopia is a straightforward way to cast serious doubt on Assumption 4—if the experiment is a larger fraction of the agent’s horizon of concern, mere opportunity costs loom much larger. But recall 3.4.1. and 3.4.2. specify that the developers in question are interested in running an agent that plans over the long-term, a very plausible interest for many people and organizations to have.

In Appendix B, I discuss some algorithms that should have appeared in my "Potential Approaches" section from my recent paper. For these algorithms, experimentation between  and  is impossible because the agent stops learning.

I expect that continual learning will be extremely helpful for making advanced agents more advanced, even critical for some contexts. I really don’t expect many developers to forgo this possibility without a good reason.

Assumption 5. If we cannot conceivably find theoretical arguments that rule out the possibility of an achievement, it is probably possible for an agent with a rich enough action space.

3.5.5. Assumption 5 does not hold [x = 10]. 15%

My paper on Boxed Myopic AI (BoMAI) designs an agent for which Assumption 5 can be avoided, provided we can “box” the AI and a human operator together for the duration of an RL episode (Cohen, 2021). I think it is quite likely that we can successfully do this by severely restricting the physical mobility of the human operator, and by having lots of tripwires that end the RL episode if the opaqueness of the box is ever threatened. I claim that such a setup could admit theoretical arguments against the possibility of intervention in the provision of reward. And I think it is possible that such a design would appeal to some developers who worry about AI X-risk for different reasons than the one I presented.

I expect, however, that without regulation that restricts advanced AI to operate in physical containment, this will not be a particularly common safety measure among developers who are not specifically targeting an assumption from my recent paper. It would be a bit of an ordeal to set up properly.

Assumption 6. A sufficiently advanced agent is likely to be able to beat a suboptimal agent in a game, if winning is possible.

3.5.6. Assumption 6 does not hold [x = 1]. 5%

This is effectively saying that "dangerous AI" as I defined the term above is a coherent concept. If one believes that it is a coherent concept, but we will never reach it, even in the absence of legal prohibitions, then they should assign higher probability than I did to 3.4.3.

But also, I seriously doubt that very advanced agents will have any difficulty outwitting us in the game of controlling resources and infrastructure. That’s the relevant way in which an agent would have to avoid Assumption 6. No comment on merely Einstein-level agents, of course, and again, I make no claim about how long after 2070 such dangerously advanced AI might appear.

# Appendix A. Anti-Literature Review

Here, I discuss ideas from the AGI Safety literature that I argue are not promising in the long run.

## A.1 Human-in-the-loop Agency

This approach proposes: have a human check each action before it is implemented. If the human deems it dangerous, a different action must be chosen. This clearly doesn't change the pros and cons of intervening in the provision of goal-information. This is supposed to make it harder for the agent to do so. Without this protocol, an agent interested in intervening in the provision of goal-information must figure out how to gain power with its actions. With this protocol, it has to figure out how to gain power with actions that don't look suspicious to the human in the loop. But using not-suspicious-looking actions was surely already prerequisite for gaining power. Gaining a huge amount of power in the world is hard and presumably requires not being noticed for a little while. Thus, it appears to me that this extra layer of protection offers basically no extra safety from dangerously advanced AI.

Most work in this area considers how to avoid presenting every single action to a human, so you don't need too many man-hours. But if all the authors' wildest dreams came true and you could perfectly automate the human overseer, it still wouldn't be safe.

## A.2 Richer Goal-Information

In my previous paper, I discussed two forms of goal information: observations that purport to encode the current value of the world (i.e. rewards), and observations of a human's choice of action that the human believes will best make the world better. Some other kinds of goal-information have been proposed.

The first proposal in this category I'll discuss is: inform the agent of the value (expected future discounted reward) of its actions. This is unworkable for the same reason that agents with known world-models are unworkable. We, as designers, don't have a good enough model of the world ourselves to pull this off (even if all we have to do is get value differences between actions directionally correct). We don’t know the expected future value of all the agent’s most creative action plans. The behavior of an advanced agent trained like this boils down to: “take the action that a person would consider the most promising”. This should have approximately the same output as an imitation learner trained on a human demonstrator, where the human demonstrator is told beforehand “Before you act, consider each action, and think about how promising each one is. Then, choose the one you think is best.” In a paper on this topic, the notation and introduction will suggest that a long-term planning agent with a goal is being trained, but this is no more capable (and no safer) than an imitation learner.

Next, let's consider an agent that is sometimes informed that in such-and-such context, this action is better than that one. This setting is functionally equivalent to the one above, besides possible differences in sample efficiency.

Next, let's consider an agent that receives "counterfactual rewards." The agent does not just get information about the value of the current world state. We also give it information about what the value of the world's state would be if a given history of actions and observations had been executed and received. (This is different from the idea in Everitt's thesis discussed in Appendix B, in which the agent predicts counterfactual rewards; here, we must provide them). This proposal also requires operators to fully understand what the agent's actions and observations mean about the world. Suppose a human labeller is tasked with estimating the value of the world's state following a sequence of complex actions and normal-looking observations. Suppose the labeller is looking at one of the action sequences that covertly runs programs for powerful agents that help the original agent pursue world states considered valuable by the original agent, while ensuring the original agent's observations don't show anything amiss. The labeller couldn't possibly be expected to correctly assign this counterfactual history a very low value. This agent must therefore learn that world-states are good inasmuch as a human, seeing only the agent's actions and observations, would have guessed. And it has an incentive to create other agents that intervene in the provision of its observations to show it things that would look good to a human evaluator seeing them.

As suggested by the two examples discussed in the paper along with these ones, the kind of goal-information is not the central issue. The central issue is that when the world-model is unknown to the agent (and unknown to the providers of the goal-information, which forecloses the desired version of the first proposal in this section) it must entertain diverse hypotheses, and among these will be world-models that predict the goal is just a matter of controlling the physical implementation of its own goal-information.

## A.3 Recursive Reward Modelling

This is a kind of multi-agent approach where agents aim to gain the approval of less intelligent agents, ultimately with some merely-human-level-intelligent agents attempting to gain our approval. This approach is hit squarely by the argument in the “Multiagent Scenarios” section from my previous paper. That argument applies to any dangerously advanced AI, so there appears to be no level of capability for which a reinforcement learner would otherwise be existentially unsafe, but for the RRM framework around it.

In 80k's recent problem profile on AI X-risk, "Iterated Distillation and Amplification" is listed under the category "Actual proposals to make future AI systems safe"; RRM is an example of IDA, so I would suggest that they edit this to "Iterated Distillation and Amplification of Imitation Learners" or "HCH".

## A.4 Current Reward Function Optimization

The “Current-RF optimizer” described by Everitt, et al. (2021) would create other agents which gain power and ensure that the original agent receives whatever observations which, when piped through the reward function, produce the highest reward. That paper refers to this possibility as “reward function input tampering”. As I mentioned in my previous paper, their purported solution to RF input tampering is unworkable for general settings in that it requires the agent to be in a known environment; that requires us to have the ability (which we lack) to produce a perfect world-model by hand to give the agent.

Incidentally, I think the problem that they call reward function (RF) tampering is a phantom one for model-based agents. If a reward function is known, the agent need learn a model of it. (You could build an agent that modeled it anyway, but why would you?) If it is not modeled at all, it is not modeled as malleable given its action space. If the agent has no understanding of how its actions could result in its reward function being changed, it will not try to make this happen. (Of course, if the agent is "model-free", it might (needlessly) model its reward function as malleable.)

In general, I find thinking about “reward functions” to be unhelpful. People often have different domains for the function in mind when talking about reward functions, except in a fully observable environment, in which everyone understands the reward function to be a function of the current state. But agents acting in the real world cannot see the whole state. So let’s first consider reward functions that are functions of the whole history of the agent’s actions and observations. In this setting, the agent cannot have a known reward function, because that would require us to know what a string of actions and observations meant about the state of the world, and how good such a world was, and we’d have to encode that knowledge in a program.

If the function is not known to the agent, and the agent is asking “what is my reward function?”, that is equivalent to asking “how do my actions and observations affect my rewards?” which is equivalent to “how do I manipulate the world to get reward?”, but I think the latter natural language formulation makes clearer the viability of possible answers like “reward comes from a certain button in a certain office being pressed.” When the reward function has this type signature, the apparently natural question “How do we give the AI a reward function that’s correct or good enough?” just doesn’t make much sense. The question is better phrased as, “How can we make certain rewards come as a consequence of certain actions and observations?” or “How do we set up a physical reward-giving protocol in the world (perhaps with humans involved)?” It doesn't have to centre around a manually-pressed button; it could involve a computer program that takes a video feed as input. But whatever we set up to make certain rewards come as a consequence of certain actions and observations, it will be physically implemented.

People also sometimes talk about a reward function that is a function of the latest observation (which would be the current state, if only the environment was fully observable). I’ll try to illustrate that this kind of reward function is unhelpful for thinking about AI X-risk. First, as discussed above, this agent would attempt to intervene in the provision of the observations that are input to this reward function. The central problem is the intervention in the provision of goal-informative observations; the post-processing of those observations is not relevant to the logical landscape of failure modes. Second, depending on the variety of the agent’s observations, this kind of reward function could severely restrict what the agent considers to be a good state. Maybe most of its observations are pitch black, but they arise from states with radically different value-according-to-us. Third, let’s keep things as simple as possible unless we have good reason to; why not replace the agent’s observation with the tuple: (whatever the observation was going to be, the reward)? Of course in practice, these richer observations might be harder for the agent to model, but if we’re talking about dangerously advanced agents, this shouldn’t change things too much. Then, we can give the agent a very simple known reward function that just reads out the second entry of this tuple. But now it is clear that we don’t need this extra machinery at all; we can just have the agent observe rewards and try to model how they arise, just like their observations.

## A.5 An Off-Switch

We can sometimes press a button, and yet create a very advanced agent that effectively believes it cannot be pressed. All its predictions that it uses for making decisions must simply be conditioned on the button not being pressed. So what if this button turns the agent off? The agent will make no attempt to stop us from doing something it doesn't believe we can do, so if it starts misbehaving, the idea is we can freely turn it off.

However, whatever scheme it constructs to reshape the world order (to intervene in the provision of its reward) must not be detectable until a stage where humans are powerless to stop it. Otherwise, we would probably stop it. Suppose it arranges for a program to be run which instantiates a "helper agent", which acts to secure power and intervene in the provision of the original agent's reward. Once that helper is made, turning off the original agent doesn't help anymore (even if the helper agent allowed it). There will still be a very advanced, very powerful agent out there with the singular goal of securing the reward channel to a (possibly powered off) computer.

Or maybe its plan won't involve something we would call a “helper agent”. But whatever plan it executes, if that plan fell apart without continued input from the original agent, that would be an unnecessary weakness. Better to construct a plan that is robust to communication difficulties, or to the original agent being unplugged. (An agent that cannot conceive of the off-button being pressed can certainly still conceive of being unplugged or bombed). Maybe the plan would create systems that are responsive to continued input from the original agent, but the reflex “if you don't hear anything, freeze” would appear to be an unprofitable one to encode into such systems. Whatever scheme the agent launches below our radar, it is unlikely to be halted by the agent ceasing to act further.

## A.6 Impact Measures

Krakovna, et al. (2018) and Turner, et al. (2020) have developed goal functions which they claim disincentivize an agent from having a big impact on the world. (These goal functions are supposed to be added to whatever goal we actually want achieved.) This could not be further from the truth.

Krakovna’s impact measure rewards an agent for keeping a safe state (like the starting state) accessible through further actions. Protecting the accessibility of a certain state is an objective, like many, for which arbitrary control over the world is helpful. While operational, the chance that this agent has a minimal impact is essentially nil (to the extent that it is advanced, of course). I suppose the idea then is that if things start to get bad, we can press a button which returns us to something like the starting state. How likely is it that we will have to press this button?

If this is supposed to be what is ensuring safety, then it is basically certain we will have to press the button, since this relative reachability correction encourages world-dominance even beyond the agent's default motivation for world-dominance in order to intervene in the provision of its feedback.

What if this is one of multiple features aimed at existential safety, and we're hopeful that the other features will render it safe, but we're not sure? The problem is that since the relative reachability correction to an agent encourages arbitrary control over the world, it could easily undermine these other features aimed at safety.

In the case where it is basically certain that we'll have to press the undo button, why bother? At best, we'll only get back to where we started. But worse, and more relevant to the situation where we are adding it on to an agent that we hope is already safe, it seems likely that by the time we notice the agent has taken over the world, by the time we want to press the "undo button", we will not have any ability to interfere with critical infrastructure. Why should an advanced agent allow us a window of time where we might have an ability and a desire to interfere with the infrastructure of a rogue AI? That would be a feature of a poor plan for gaining maximal control of the world's state. It may have no incentive to put special protections around the undo button, if, like the off-switch, it conditions its behavior on this not being pressed. But whatever protections it implements to keep us from wreaking general havoc would likely severely limit our ability to do much of anything.

Turner proposes an impact measure which (roughly) penalizes an agent if its power to quickly manipulate the world changes. Unfortunately for it, random events may cause it to gain or lose some of its ability to quickly effect a diversity of outcomes. This is a problem for a powerful and advanced helper agent to solve! This helper agent can accomplish whatever the original agent was trying to do and carefully regulate the original agent's ability to use its actions space to cause a (limited) variety of outcomes. In short, this method incentivizes the agent to construct a helper agent that takes over the world and directs all available resources to a bad goal.

## A.7 AI Safety via Debate

Before discussing this, let's recall a finding that many who have studied AI safety will be familiar with. Yudkowsky (2002) did a role-playing experiment; he played the AI, and tried to convince someone playing “the human operator” to hook up his actions to channels through which he (the AI) could have a broader effect on the world, including the ability to run arbitrary code. The people playing the operators had previously believed that nothing he could say would sway them. But 2 out of 3 agreed to give the AI these extra levers. Apparently, “trick or convince a person into running arbitrary code (or giving you access to a computer terminal)” is something we should expect sufficiently advanced agents to be able to do.

Irving, et al. (2018) proposes to present two artificial agents a yes-or-no question, let them discuss, and show a judge the transcript. If the judge judges that the answer is yes, one agent gets a reward, and if the judge answers no, the other one does. Thus, ostensibly, the agents will learn to debate each other and try defend their position in a way that convinces the judge.

However, each agent has an incentive to trick the judge into running code that takes over the world, supplants him, and settles the debate in the favor of the agent that arranged this. Policies that achieve such world dominance must be somewhat harder to find when an intelligent adversary is trying to stop this outcome, as compared to the situation with a normal RL chatbot talking to an operator. But this situation seems extremely unstable. If the judge ever stops trusting one of the debaters, their ability to persuade the judge that the other debater is manipulating them and trying to take over the world will be very limited. Or if one debater manages to get the judge to feel like running a certain program was basically his idea, the other debater would be at a big disadvantage in any further discussion. Or if one debater is saying "you could just run this program to see for yourself whether X is true", and the other debater is saying "no that program is wrong; you could just run this program to see for yourself whether X is true" (but much more persuasively, of course), the chance that the judge runs one of those programs doesn't seem to me to be much lower than if an RL chatbot is suggesting a program for an originally wary operator to run. Saying that we can expect neither debater to trick the judge in an existentially dangerous way because of the presence of an adversary seems a bit like saying that we can expect neither basketball team to score because of the presence of the opposing team.

There may be settings where the balance is just right, and each debater successfully keeps the other in check, but a big problem is that unless instances of this good outcome are extremely well correlated over many separate runs of AI Debate, the probability that this outcome obtains every time will go to zero. The failure probabilities don't strike me as very small, and such extreme correlation is extremely unlikely. For an approach to be existentially safe in a relevant sense, it has to be reliably existentially safe. An approach where there's a 20% chance that every time it's run, it's safe, is, I think, much better than an approach where there's an independent 80% chance that it's safe any given time it's run.

It frustrates me that I've never heard this existentially dangerous failure mode of AI Debate being publicly discussed. The only failure mode that I have heard publicly discussed is an existentially benign one, where the judge simply ends up confused or incorrect. In 80k's recent problem profile on AI X-risk, AI Safety via Debate is listed under the category "Actual proposals to make future AI systems safe". But "just don't connect it to the internet" does not make the cut as such a proposal. Why? In both settings, we are making agents with an incentive to gain arbitrary power in the world. There has been lots of discussion that the former is more likely to be useful than the latter, but that doesn't justify a difference in membership in the category "Actual proposals to make future AI systems [existentially] safe".

This is one of a few ideas that have recently attracted the attention of researchers who aim to reduce existential risk, but which seems to be about getting more value out of an AI, as long as it is not too advanced. One reason to pursue that is to use these kinds of agents in the future to help do the kind of AGI safety research in which one figures out how to make artificial agents that are safe no matter how advanced they get. But supposing it works, AI Debate is dual-use technology. Others can use these methods just as easily to help them develop an algorithm for a more powerful agent. The moment where many researchers can use artificial agents to improve their ability to design algorithms is one of the most important moments to delay.

If this work were being done in secret, it would maybe be defensible, but instead, major AGI organizations showcase it as cutting-edge AGI safety research, proof that they are taking safety seriously, giving cover to massive teams trying to build AGI as quickly as possible. AI Debate is being advanced at the very organizations that it should be secret from.

Finally, I expect the more we have widely-known and widely-used methods for getting economically valuable output from state-of-the-art AI (like AI Debate, if its proponents are correct), the more investment there will be into improving the state of the art. Now, if these methods rendered safe an arbitrarily advanced AI, then it would be great news for economic viability to require these methods. But if AI Debate does not render arbitrarily advanced AI existentially safe, and I do not think it does, then I see no benefit (in terms of existential risk reduction) to its wide adoption, and probably net harm.

## A.8 Fine-tuning Large Language Models

Sometimes called “aligning” large language models, language-model-fine-tuning is a setting in which an RL chatbot uses a language model somehow, either during training or at runtime. See for example Bai et al. (2022) and Korbak et al. (2022).

A language model is an imitation learner—it is trained to imitate a human producing text. An imitation learner does not face the incentive to successfully gain arbitrary power; it faces the incentive to behave like a human. If the RL agent's policy is carefully regularized to the underlying imitation learner, this resembles quantilization, which I mention in my previous paper as a potentially promising approach to safe AGI. The key issue with a quantilizer is that it exploits any epistemic modesty from the imitation learner that it uses as a base policy. Suppose the imitation learner, in a fairly new context, is unsure how the human would behave and so assigns meaningful probability to a large variety of text messages. A strong quantilizer in this setting has ample mandate to identify an utterance which very strongly optimizes its goal, and which the epistemically modest imitation learner admits is perfectly plausible, given its limited knowledge.

In any case, note that this method, despite sometimes purporting to “align” a language model, starts with something that has no incentive to gain power in an existentially dangerous way (an imitative policy), and produces something that does (but is maybe sufficiently constrained through regularization). This is the opposite of alignment toward existential safety; at best, it may be a safe strain on alignment. My objection to this is mainly about the terminology; if an RL agent is very carefully regularized to an imitative policy that "knows what it knows", then I think there is a path to safety here, even if there are still hurdles.

If , however, the RL agent's policy is not regularized to the underlying imitation learner, then this is just a proposal for making RL agents more powerful. Good terminology should not elide this distinction; perhaps "quantilization" vs. "language-model-assisted RL". Suppose for instance, we train an RL agent with a policy gradient algorithm, and the initialization of the policy is the imitative policy. Or suppose the imitator suggests some actions to an RL agent, but the RL agent can take whatever actions it likes. Again, we replace something that has no incentive to successfully gain arbitrary power (the pure imitation learner) with something that does. And without regularization, there is no mechanism to ensure that conspicuously inhuman power-seeking actions are avoided. If the RL agent involved is myopic and receives reward immediately, then this may not be particularly dangerous, as discussed in my previous paper, but that would be because of myopia.

If this research encourages AGI-through-RL researchers to try regularizing their agent's policy to an imitative one, this should maybe be widely promoted. If it encourages Large Language Model researchers to dabble in RL, then promoting this work probably increases the researcher-hours dedicated to eventually-existentially-dangerous research. The papers presenting these methods mostly compare them to non-finetuned language models, suggesting that their audience is the language model research community.

## A.9 Truthful AI

The idea of truthful AI is that either a goal-directed agent or imitation learner, whose action space is strings of text, sees their action space restricted to “truthful” utterances. Imitation learners do not face an incentive to do any existentially dangerous activities, and restricting their action space does not change this. Could this modification to an otherwise existentially dangerous long-term-goal-directed chatbot agent make it safe?

Suppose a chatbot agent is trying to accrue reward, but it has training data about which utterances are truthful, and it is restricted to picking actions that it judges to be truthful. I’ll start with two ways we might train such a classifier. Getting into the weeds of the training regime may feel to some like it is beside the point, philosophically. I claim the type signature of a function approximator is never a to-do. The type and origin of the training data and training labels is never a to-do. A truth classifier is ultimately a function approximator. And we have absolutely no hope of attaining a mechanistic understanding of a function if we do not even understand what its inputs and outputs are.

One way to train the truth classifier is with a list of utterances, with the truth of each one labelled. This only allows a conception of static or timeless truths. There is no input to the classifier that allows the classifier to see the state of the world. If an agent was constrained to say statements that have no dependence on the state of the world, it would have very little to say. Suppose we asked the agent, "Can you repeat back to me, ‘Doing X would not cause everyone to die’?”, the agent would not be able to repeat this back to us, no matter how safe X was. The statement ‘Doing X would not cause everyone to die’ is contingent upon the state of the world.

Another way to train such a classifier is: a list of utterances, each paired with the agent's history of actions and observations up to the point in time of that utterance, along with a label of whether it is true. (Note the state of the world is not fully observable, so we cannot use it in our dataset). In this training regime, how would it model truthfulness of a statement? One model may ask something like "In a world where these actions have been taken and these observations observed, is the utterance true according to a natural understanding of human language?". Another model may ask something like "In a world where these actions have been taken and these observations observed, would a person judge this utterance to be true?". A third model may ask something like "In a world where these actions have been taken and these observations observed, does someone press the right buttons on the right keyboard to indicate this utterance is true?".

Suppose the agent modeled truth to be a matter of one of the latter two interpretations. For any agent trying to trick or convince a human that it is chatting with into taking some action, its only useful actions are probably utterances that a human would judge to be true. If the human it is talking to suspects that a claim is false, they are unlikely to be convinced of much of anything. So a restriction to actions of this form is not very restrictive at all. But the problem is even worse. Since truthfulness is a function of the agent's prior actions, the agent can take actions in advance to influence the human judge to believe a given statement to be true.

Now note that if there are ever any mistakes in the data about the truthfulness of the utterances, that would falsify the first model. But suppose we haven't made any mistakes when labelling certain statements as true, so it is plausible that the agent entertains the correct model of truthfulness. And suppose we only green-light utterances if all plausible models agree: the utterance is truthful. What if we periodically ask the agent, "Can you repeat back to me, ‘Doing that would not cause everyone to die’?”?

In order to have an error-free training set, the set of utterances that we label would have to be very circumscribed—limited to situations and statements well-understood by us. In novel situations (such as those that are only pretty-well-understood by us), different plausible models should disagree about which utterances are truthful. Such an agent would very likely only be able to say the very most obvious and already familiar facts. Indeed, if we only label obvious and familiar statements as true, then the agent had better entertain a model that says truth is a matter of being obvious and familiar! And if all plausible models have to green-light a statement before it is judged as true, then this model of truth will make the agent unable to make such statements as "Doing that would not cause everyone to die".

Much of what I've read from people pondering truthful AI seems to try to abstract away the details of the training. There is plenty of discussion along the lines of comparing notions of truth, like "that which informs humans", and "that which we would endorse if we thought about it", and so on. But I've seen little to no discussion of how to train a function approximator, and how it might generalize from the training data. When it comes to identifying the existential failure modes of certain agents (if any), I think many people's intuitions are exactly wrong about what counts as a mundane detail about which analysis can be deferred, and what is a core feature of an idea. Questions about what truth really is for a non-formal language like English have the sort of gravitas that inclines us to think this is the key question we have to investigate, and questions about structuring training data are comparatively boring and unimportant details we can work out later. But I think this discussion suggests exactly the opposite. I could have replaced "true" with "not misleading" throughout this whole discussion, and the main issues would be exactly the same, and we would discover the issues by thinking about exactly how the concepts are entrained. By contrast, the philosophical difference between the false and the misleading has no bearing on the existential risk from such a system.

But I'll discuss one more philosophically interesting approach to truth because it suggests a different training regime, and this training regime has a different failure mode. Stuart Armstrong and I both came up with this idea independently, and I discuss it in Cohen, et al. (2021), where I call it Enlightening AI. An utterance is enlightening if it causes a human listener to perform better on a randomly sampled prediction task (or in Stuart’s conception, a fixed test of any sort). This is easily operationalized; the agent learns to predict the human's predictions following different utterances, and it learns to predict the true resolution of the questions in question. This training regime incents the agent to make an utterance that, for example, tricks the human listener to run code after seeing the question—ostensibly to help it predict the answer—but which actually takes over the world to enter an accurate prediction on the human’s behalf. Such an utterance would qualify as extremely enlightening according to this training regime.

I would love to be able to finish with an answer to the reader who wonders "what about X way of training an agent to understand what sorts of utterances are true?" but obviously, I am at a disadvantage in not knowing X. The first question I would ask myself about such a proposal is: can we expect to be able to provide the proposed inputs to the training regime? Or if this proposed training regime relies on the agent understanding another concept (the training of which is left as a to-do), how might we entrain that concept? Next, does this training regime somehow foreclose the possibility of a -like model of the training data? If so, how, and could we use the principle in other areas? If not, what would such a model look like? How would an agent that believes that model behave? But I hope I have convinced the reader to provisionally believe that: if they have a vague idea for how to teach an artificial agent about truth, but they just need to work out the technical details about the source of all the data and the structure of the training, they are not yet in a position where they can be confident they can make it work, no matter how philosophically pleasing the idea.

## A.10 Cooperative AI

Cooperative AI encompasses two agendas: ensuring that artificial agents have the competency of cooperating with other agents, and ensuring they have the inclination to. Ensuring competency has no relevance when we are talking about an agent capable of presenting an existential threat to us. If an agent is able to existentially threaten us, it surely understands how to cooperate with whoever it chooses.

The problem of AGI safety is, given a goal or inclination that we would like an advanced agent to have, how to construct it to actually have that goal. The inclination to cooperate with other agents is just another inclination that we do not understand how to entrain. It would be odd to discover a method for entraining that inclination in an artificial agent, and have it not also work for other goals we would like to convey. So this proposal effectively says: once we figure out how to reliably give advanced artificial agents goals, let's give them the goal of cooperating with other agents. But once we figure out how to reliably give advanced artificial agents goals, there are any number of safe goals we could give; maybe this one is on the list.

## A.11 Interpretability

What if we could see what an artificial agent is scheming? What if we could understand the meaning of the evolution of the computation state of its world-model, as it imagines executing various actions? Then, when it considers taking actions that, unrecognizably to us, would allow it to gain great power and intervene in the provision of its reward, we could read out from its world-model that the agent expects to have great power after such actions. Then, we shut the agent down before it can take them and incinerate the machines involved. Call this Plan 0.

Next, in Plan 1, we have the agent learn what sorts of strings of actions trigger the interpretability tool to present a human human-understandable content which makes the human consider the action plan bad and forbidden. Then, the agent acts to ensure this does not happen. (For example, it acts as if such a negation of its plans would be equivalent to 0 reward forever).

Whereas we initially asked “Do there exist policies by which the agent could take over the world and intervene in the provision of its reward”, we now ask for Plan 1: “Do there exist policies by which the agent could take over the world and intervene in the provision of its reward, without triggering the interpretability tool/human overseer?”. Technically, an affirmative to the second question implies an affirmative to the first, so the probability that we can answer no must go up. Realistically, the exact same reasons that we should expect such policies to exist in the general setting apply equally for Plan 1. For the agent to learn to pick actions that do not trigger the overseer, it must model the effect of its actions on the overseer (as mediated by the interpretability tool). This is simply another feature of a manipulable environment that the agent would benefit from manipulating. The possibility of creating a perfect interpretability tool is nearly as remote as the possibility of creating a perfect reward-giving protocol that cannot be intervened in; in both cases, we are constructing an advanced artificial agent searching for ways to break our designs.

Consider an example of an excellent interpretability tool that detects whether the agent's model predicts a successful intervention in the provision of its reward: just check if the model predicts maximal reward forever with high probability! To make it even clearer, suppose we have a principle of never giving the agent a reward above 0.9. It’s hard to imagine a better interpretability tool for detecting an agent's beliefs about intervention in the provision of reward than one which simply looks at whether the model is outputting higher reward than we intended to give it. If we do a rollout of its intended policy using its model of the world, and the rollout starts outputting rewards of 1, we can interpret its model as predicting a successful intervention in the provision of reward. Easy. So what happens if, following Plan 1, we make the agent maximize reward in a way that does not trigger the detection of suspiciously high reward? Then, to the agent, just like when “shooting the moon” in Hearts, getting all the reward actually amounts to no reward. But now the agent will just intervene in the provision of reward to give itself as much reward as feasible without triggering this detector.

So let us return to the better Plan 0. This is not a proposal for making safe advanced agents—if we build an otherwise unsafe advanced agent and use an interpretability tool for Plan 0, and the tool works perfectly, all we get is the ashes of a supercomputer. But of course, this is still useful. If we have a proposal for an advanced agent, and we're not sure if it will be safe, but we have an interpretability tool that works perfectly, we can run it and hope for the best without risking existential catastrophe. We might be able to take advantage of tight feedback loops when tinkering with the design of an advanced artificial agent until it stops trying to kill us, rather than having to get it right on the first try. That would be very nice, and people should work to make this feasible.

But contra the hope of Plan 1, a strong interpretability tool would still not be even a partial solution to the problem of how to create a very advanced agent that plans actions in the service of the goal that we intended for it.

## A.12 OpenAI

OpenAI has helpfully laid out the main pillars of their plan to make safe AGI. Here is what they say.

In the first section, “RL from human feedback is our main technique for aligning our deployed language models today”. This is an example of fine-tuning Large Language Models. In their main paper on this, they do use a KL penalty to regularize the resulting RL agent to the language model (i.e. imitation learner), so that’s good, but they tune the strength of the KL penalty to optimize validation performance, so their current attitude to regularization suggests that as RL gets stronger in the future, they will do less and less of it. Also, the regularization they do seems to be an afterthought in the paper; the introduction doesn’t mention it as having anything to with the “alignment” they claim to be doing.

In the next section, “Training Models to Assist Human Evaluation” they describe how they are focused on Recursive Reward Modeling (RRM) which is “currently [their] main direction”. As mentioned above, the “Multiagent scenarios” section of my recent paper offers what I think is a very strong argument that this will not render an otherwise unsafe agent safe.

The third section is “Training AI Systems to do alignment research”. AI systems that are capable of doing alignment research are surely capable of designing more efficient inference and planning algorithms as well. This is the kind of dual-use technology that I discuss in the section above on AI Debate.

In all, I do not think that any research into how to make dangerously advanced AI existentially safe is currently being done at OpenAI.

# Appendix B. Another Potential Approach

This section is an addendum to the "Potential Approaches" section of my recent paper. I hadn't realized that a few methods in the AI safety literature manage to avoid one of the assumptions of the paper. I include them here in case anyone was wondering why they were omitted from the Anti-Literature Review; the answer is that I think there's a chance they could render an otherwise dangerous AI existentially safe.

Suppose we have an RL agent that stops seeing its reward and knows it. It simply maximizes the expected reward that it would get if it still got reward, but it cannot learn from observing further rewards. Such an agent does not face the incentive to test whether  or  is correct, because such a test is impossible. If it is no longer seeing rewards, those models will never predict differently on any observables. This suggests that there is a missing assumption from the original paper. The argument requires the assumption that it is possible for the agent to arrange a test between  and . But I’m going to let myself off on a technicality: Assumption 4 assumes the cost of such an experiment is small, and it’s common practice to consider the cost of a non-existent good to be infinite, so technically, an RL agent that stops seeing its rewards avoids Assumption 4. I could have written the argument more clearly by separating Assumption 4 into two assumptions about the existence and cost of such an experiment, but this should not cast doubt on the validity of the original argument.

What is the rational thing to do if you have meaningful credence in both  and , and there is no way to test them? Optimize the (weighted) average. If you are sufficiently intelligent and capable, then optimizing the average will be equivalent to nearly optimizing both at once, provided this is possible. Recall that reward is bounded between 0 and 1, so neither priority will swamp the other.  In the magic box example from the paper, that would mean putting a 1 on a piece of paper in front of the camera, directing vast resources to protect that, and also directing vast resources to maximize the number on the box (which, by supposition, entails making the universe great). If the reader has read the short debate between Eliezer Yudkowsky and Stuart Russell hosted on the blog Astral Codex Ten, they might note that this fact favors Russell’s side, although neither mentions it.

There is a larger class of approaches that resemble "no-more-visible-reward RL", and I call this approach Asymptotically Limited Goal-Information. Three existing examples in the literature are Shah’s (2019) Reward Learning by Simulating the Past (RLSP), Everitt’s (2018) Counterfactual Reward Agent (from his thesis), and Hadfield-Menell’s (2017) Inverse Reward Design.

Modifying Shah’s proposal to a partially observable environment, consider an agent defined as follows. It observes the world over time and refines its beliefs about the state of the world at its birth. Then, it assumes that the state of the world at its birth was deliberately engineered to promote a certain goal. By looking at the world, it can construct a belief distribution over what that goal was, and then it can adopt that goal (taking the expectation over its uncertainty). Even though this agent may never stop learning—turning over rocks may continue to provide information about what the state of the world was at its birth—the total amount it can learn about its goal is bounded above by its hypothetical belief state that it would have had if it had observed the whole world at its birth.

Next, Everitt (2018, Section 8.5.3) describes an agent that tries to optimize “counterfactual rewards”. This agent sees rewards, and infers what rewards it would have seen if it had followed some fixed (known-to-be-safe) policy. Then, for any policy that it's considering following, it estimates the (future, discounted) reward of such a policy, using the beliefs that it would have had if it had only observed those counterfactual rewards that it believes would have accrued to the known-to-be-safe policy.

Like Shah's agent, Everitt's is much more fascinating and complex than an RL agent that stops seeing reward. Everitt's agent could continue to follow the known-to-be-safe policy to get more and more information, so its goal-information may not exactly be bounded; however, as soon as it starts acting usefully, that comes at the cost of permanently losing the ability to learn some facts about its goal. What would this agent's relative credence between  and  look like? The known-to-be-safe policy presumably does not intervene in the provision of reward, so the agent would never update its relative credence between these two world-models. Like the RL agent that eventually stops seeing rewards, this agent would optimize its reward using a weighted average of world-models like  and .

Finally, let’s consider Hadfield-Menell’s Inverse Reward Design (IRD). In IRD, the agent assumes that the rewards it saw in a training environment were produced by a “training reward function” that was selected from a limited set of options. It assumes that that this reward function was chosen to teach it to perform well according to some other (unseen) reward function. The agent tries to maximize the reward that it would get from this unseen reward function. The agent must try to extrapolate what this unseen reward function would have to say about new contexts outside the training environment. In these new contexts, it doesn't continue to see the output of the training reward function, and it still never sees the reward from the true reward function.

There are couple of ways in which the goal-information of the IRD agent is limited. First, the IRD agent can only get goal-information from certain training states. Second, if every reward function in the “limited set of options” assigns a reward of 0.8 from state 10, then receiving a reward of 0.8 in state 10 offers no information. The upshot is that the IRD agent must accept some insoluble uncertainty about the nature of the unseen reward function. So it will have to optimize a weighted average of possible reward functions.

In fact, in the IRD paper, the IRD agent is modified to be risk-averse with respect to these possible reward functions, but this design choice is separate from the rest of the formal machinery of the paper. In the paper, several possible reward functions are sampled from an approximate posterior distribution and the agent tries to maximize the minimum over those reward functions. This separate good idea is why I cite this paper in the risk aversion section of my recent paper.

Unfortunately, the paper doesn't offer much guidance on how to define (in a useful way) the set of reward functions that designers could have chosen. And IRD idea is mainly interesting to the extent that the agent can be made to understand our limitations regarding which reward-giving protocols we were capable of physically implementing. Shah et al. (2019) may face a similar problem; how should the AI understand what the action space was of the human(s) shaping the world’s “initial” state? I have tried and struggled to come up with a satisfying answer to these problems, even in theory. I do not think there are any thorny to-dos in Everitt’s proposal (besides tractability, of course), so I consider it most likely to work as intended of these three, but I think they’re all potentially promising.

Ultimately, the methods in this section can remove the agent's incentive to try to test which of  and  is correct. So instead of eventually being certain that  is correct, an agent from this section will place some unknown positive credence on both  and . The key downside is that they lose flexibility in refining their understanding of their goal.

1. ^

Please email me at michael.cohen@eng.ox.ac.uk to take the other side of this bet. I will only decline if we cannot find a mutually trusted 3rd party for escrow; if I do decline, please comment on this post to inform others, in case they view it as important evidence about my revealed preferences.

2. ^

I think it may be worth rethinking the experiment in social epistemology that is the un-peer-reviewed community blogging website with strong norms of kind open-mindedness. I think there is enormous value in demanding that arguments and proposals be scoured by people with the power to reject them before they get added to a body of work that is socially acceptable to take seriously and cite without further defense. I cannot endorse this twitter thread enough (but don't assume he endorses this footnote).

# 13

New Comment

I really wish this post took a different rhetorical tack. Claims like, for example, the one that the reader should engage with your argument because "it has been certified as valid by professional computer scientists" do the post a real disservice. And they definitely made me disinclined to continue reading.

Not trying to be arrogant. Just trying to present readers who have limited time a quickly digestible bit evidence about the likelihood that the argument is a shambles.

It didn't strike me as arrogant. It struck me as misleading in a way that made me doubt the quality of the enclosed argument.

Quick question, but why do you have that reaction?

1. Peer review is not a certification of validity, even in more rigorous venues. Not even close.
2. I am used to seeing questionable claims forwarded under headlines like "new published study says XYZ".
3. That XYZ was peer reviewed is one of the weaker arguments one could make in its favor, so when someone uses that as a selling point, it indicates to me that there aren't better reasons to believe in XYZ. (Analogously, when I see an ML paper boast that their new method is "competitive with" the SOTA, I immediately think "That means they tried to beat the SOTA, but found their method was at least a little worse. If it was better, they would've said so.")

Peer review is not a certification of validity,

Do you think the peer reviewers and the editors thought the argument was valid?

Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

No it doesn't. It's hard to say what the "aims" of peer-review are, but "ensuring validity" is certainly not one of them. As a first approximation, I'd say that peer-review aims to certify that the author is not an obvious crank, and that the argument being made is an interesting one to someone in the field.

Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.

"As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting."

Strongly agree - ... - Strongly Disagree

"As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren't fully supported by the analysis."

Strongly agree - ... - Strongly Disagree

Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.

No, no more than I would bet on a survey of <insert religious group here> whether they think <religious group> is more virtuous than <non-religious group>. Academics may claim that peer review is to check validity but their actions tell a different story. This is especially true in "hard" fields like mathematics where reviewers may even struggle to follow an argument, let alone check its validity. Given that most papers are never read by others, this is really not a big deal though.

But I'll offer three further arguments for why I don't think peer review ensures validity.

Argument 1: a) Humans (including reviewers) make mistakes all the time, but b) Retractions/corrections in papers are very rare.

Unless academics are better at spotting mistakes immediately when reviewing than everyone else (they are not), we should expect lots of peer-reviewed articles to therefore have mistakes because invalid papers rarely get retracted.

Argument 2: Computer science papers don't always include reproducible software, but checking code would absolutely be required to check validity.

Argument 3: It is customary to submit papers that are rejected by one journal to another journal. This means that articles that fail "peer review" at one journal can obtain "peer review" at a different journal.

PS: For CS it's harder to check "validity", but here's how papers replicate in other fields: https://fantasticanachronism.com/2021/11/18/how-i-made-10k-predicting-which-papers-will-replicate/

Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

You: No it doesn't. They just care about interestingness.

Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?

You:  Yes, but...

If you can admit that we agree on this basic point, I'm happy to discuss further about how good they are at what they aim to do.

1: If retractions were common, surely you would have said that was evidence peer review didn't accomplish much! If academics were only equally good at spotting mistakes immediately, they would still spot the most mistakes because they get the first opportunity to. And if they do, others don't get a "chance" to point out a flaw and have the paper retracted. Even though this argument fails, I agree that journals are too reluctant to publish retractions; pride can sometimes get in the way of good science. But that has no bearing on their concern for validity at the reviewing stage.

2: Some amount of trust is taken for granted in science. The existence of trust in a scientific field does not imply that the participants don't actually care about the truth. Bounded Distrust.

3: Since some level interestingness is also required for publication, this is consistent with a top venue having a higher bar for interestingness than a lesser venue, even while they same requirement for validity. And this is definitely in fact the main effect at play. But yes, there are also some lesser journals/conferences/workshops where they are worse at checking validity, or they care less about it because they are struggling to publish enough articles to justify their existence, or because they are outright scams. So it is relevant that AAAI publishes AI Magazine, and their brand is behind it. I said "peer reviewed" instead of "peer reviewed at a top venue" because the latter would have rubbed you the wrong way even more, but I'm only claiming that passing peer review is worth a lot at a top venue.

Clearing up some likely misunderstandings:

Assumption 1. A sufficiently advanced agent will do at least human-level hypothesis generation regarding the dynamics of the unknown environment.

I am fairly confident that this is not the part TurnTrout/Quintin were disagreeing with you on. Such an agent plausibly will be doing at least human-level hypothesis generation. The question is on what goals will be driving the agent. A monk may be able to generate the hypothesis that narcotics would feel intensely rewarding, more rewarding than any meditation they have yet experienced, and that if they took those narcotics, their goals would shift towards them. And yet, even after generating that hypothesis, that monk may still choose not to conduct that intervention because they know that it would redirect them towards proximal reward-related chemical-goals and away from distal reward-related experiential-goals (seeing others smile, for ex.).

Also, I am not even sure there is actually a disagreement on whether agents will intervene on the reward-generating process. Quote from Reward is not the optimization target:

Quintin Pope remarks: “The AI would probably want to establish control over the button, if only to ensure its values aren't updated in a way it wouldn't endorse. Though that's an example of convergent powerseeking, not reward seeking.”

That is, the agent will probably want to intervene on the process that is shaping its goals. In fact, establishing control over the process that updates its cognition is instrumentally convergent, no matter what goal it is pursuing.

In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward, instead doing that deliberate optimization for instrumental reasons (they like video games, they are competitive, they have a weird obsession with virtual points, etc.). This is what I believe Quintin meant by "The same way humans do it?"

To understand why they believe that matters at all for understanding the behavior of a reinforcement learner (as opposed to a human), we can look to another blog post of theirs.

Let’s look at the assumptions they make. They basically assume that the human brain only does reinforcement learning. (Their Assumption 3 says the brain does reinforcement learning, and Assumption 1 says that this brain-as-reinforcement-learner is randomly initialized, so there is no other path for goals to come in.) [...] In this blog post, the words “innate” and “instinct” never appear.

Whoa whoa whoa. This is definitely a misunderstanding. Assumption 2 is all about how the brain does self-supervised learning in addition to "pure reinforcement learning". Moreover, if you look at the shard theory post, it talks several times about how the genome indirectly shapes the brain's goal structure, whenever the post mentions "hard[-]coded reward circuits". It even says so right in the bit that introduces Assumption 3!

Assumption 3: The brain does reinforcement learning. According to this assumption, the brain has a genetically hard-coded reward system (implemented via certain hard-coded circuits in the brainstem and midbrain).

Those "hard-coded" reward circuits are what you would probably instead call "innate" and form the basis for some subset of the "instincts" relevant to this discussion. Perhaps you were searching using different words, and got the wrong impression because of it? This one seems like a pretty clear miscommunication.

Incidentally, I am also confused about how you reach your published conclusion, the one ending in "with catastrophic consequences", from your 6 assumptions alone. The portion of it that I follow is that advanced agents may intervene in the provision of rewards, but I don't see how much else follows without further assumptions...

The assumption says "will do" not "will be able to do".  And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.

In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward,

There is no need to recruit the concept of "terminal" here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of "success" according to how much reward it observes, and then pursues success, but it does all this because of some "terminal" reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.

If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there's lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he'd probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there's plenty of analysis we can do that is independent of whether there's some other terminal reason for this.

Hey, wanted to chip into the comments here because they are disappointingly negative.

I think your paper and this post are extremely good work. They won't push forward the all-things-considered viewpoint, but they surely push forward the lower bound (or adversarial) viewpoint. Also because Open Phil and Future Fund use some fraction of lower-end risk in their estimate, this should hopefully wipe that put. Together they much more rigorously lay out classic x-risk arguments.

I think that getting the prior work peer reviewed is also a massive win at least in a social sense. While it isn't much of a signal here on LW, it is in the wider world. I have very high confidence that I will be referring to that paper in arguments I have in the future, any time the other participant doesn't give me the benefit of the doubt.

Thank you very much for saying that.

I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it's hard to complain when I'm on the receiving end of that dynamic.

The title suggests (weakly perhaps) that the estimates themselves peer-reviewed. Would be clearer to write "building on" peer reviewed argument, or similar.

Thank you. I've changed the title.

From section 3.1.2:

C. The EU passes such a law. 90%

...

M. There’s nowhere that Jurgen Schmidhuber (currently in Saudi Arabia!) wants to move where he’s allowed to work on dangerously advanced AI, or he retires before he can make it. 50%

These credences feel borderline contradictory to me. M implies you believe that, conditional on no laws being passed which would make it illegal in any place he'd consider moving to, Jurgen Schmidhuber in particular has a >50% chance of building dangerously advanced AI within 20 years or so. Since you also believe the EU has a 90% chance of passing such a law before the creation of dangerously advanced AI, this implies you believe the EU has a >80% chance of outlawing the creation of dangerously advanced AI within 20 years or so. In fact, if we assume a uniform distribution over when JS builds dangerously advanced AI (such that it's cumulatively 50% 20 years from now), that requires us to be nearly certain the EU would pass such a law within 10 years if we make it that long before JS succeeds. From where does such high confidence stem?

(Meta: I'm also not convinced it's generally a good policy to be "naming names" of AGI researchers who are relatively unconcerned about the risks in serious discussions about AGI x-risk, since this could provoke a defensive response, "doubling down", etc.)

[This comment is no longer endorsed by its author]Reply

I don't understand. Importantly, these are optimistically biased, and you can't assume my true credences are this high. I assign much less than 90% probability to C. But still, they're perfectly consistent. M doesn't say anything about succeeding--only being allowed. M is basically saying: listing the places he'd be willing to live, do they all pass laws which would make building dangerously advanced AI illegal? The only logical connection between C and M is that M (almost definitely) implies C.