Recommended Sequences

Embedded Agency
AGI safety from first principles
Iterated Amplification

Recent Discussion

With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept.  Thanks also for comments from Ramana Kumar.

Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers.

Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios.

This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause.  Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower.

 Unipolar take-offsMultipolar take-offs
Slow take-offs<not this post>Part 1 of this post
Fast take-offs<not this
...
2Andrew Critch1dThanks for the pointer to grace2020whose [https://aiimpacts.org/misalignment-and-misuse-whose-values-are-manifest/]! I've added it to the original post now under "successes in our agent-agnostic thinking". For sure, that is the point of the "successes" section. Instead of "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes" I should probably have said "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes, and to my eye there should be more communication across the boundary of that bubble."
8CarlShulman2dRight now the United States has a GDP of >$20T, US plus its NATO allies and Japan >$40T, the PRC >$14T, with a world economy of >$130T. For AI and computing industries the concentration is even greater. These leading powers are willing to regulate companies and invade small countries based on reasons much less serious than imminent human extinction. They have also avoided destroying one another with nuclear weapons. If one-to-one intent alignment works well enough that one's own AI will not blatantly lie about upcoming AI extermination of humanity, then superintelligent locally-aligned AI advisors will tell the governments of these major powers (and many corporate and other actors with the capacity to activate governmental action) about the likely downside of conflict or unregulated AI havens (meaning specifically the deaths of the top leadership and everyone else in all countries). Within a country, one-to-one intent alignment for government officials or actors who support the government means superintelligent advisors identify and assist in suppressing attempts by an individual AI company or its products to overthrow the government. Internationally, with the current balance of power (and with fairly substantial deviations from it) a handful of actors have the capacity to force a slowdown or other measures to stop an outcome that will otherwise destroy them. They (and the corporations that they have legal authority over, as well as physical power to coerce) are few enough to make bargaining feasible, and powerful enough to pay a large 'tax' while still being ahead of smaller actors. And I think they are well enough motivated to stop their imminent annihilation, in a way that is more like avoiding mutual nuclear destruction than cosmopolitan altruistic optimal climate mitigation timing. That situation could change if AI enables tiny firms and countries to match the superpowers in AI capabilities or WMD before leading powers can block it. So I agree with other
2Andrew Critch1dCarl, thanks for this clear statement of your beliefs. It sounds like you're saying (among other things) that American and Chinese cultures will not engage in a "race-to-the-bottom" in terms of how much they displace human control over the AI technologies their companies develop. Is that right? If so, could you give me a % confidence on that position somehow? And if not, could you clarify? To reciprocate: I currently assign a ≥10% chance of a race-to-the-bottom on AI control/security/safety between two or more cultures this century, i.e., I'd bid 10% to buy in a prediction market on this claim if it were settlable. In more detail, I assign a ≥10% chance to a scenario where two or more cultures each progressively diminish the degree of control they exercise over their tech, and the safety of the economic activities of that tech to human existence, until an involuntary human extinction event. (By comparison, I assign at most around a ~3% chance of a unipolar "world takeover" event, i.e., I'd sell at 3%.) I should add that my numbers for both of those outcomes are down significantly from ~3 years ago due to cultural progress in CS/AI (see this ACM blog post [https://acm-fca.org/2018/03/29/negativeimpacts/]) allowing more discussion of (and hence preparation for) negative outcomes, and government pressures to regulate the tech industry.
2Rohin Shah2dPlanned summary for the Alignment Newsletter: Planned opinion (shared with Another (outer) alignment failure story [https://www.alignmentforum.org/posts/AyNHoTWWAJ5eb99ji/another-outer-alignment-failure-story] ):
3Andrew Critch1dYes, I agree with this. Yes! +10 to this! For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here [The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.]) seem to think I'm saying you shouldn't work on alignment (which I'm not), which triggers a "Yes, this is the most valuable thing" reply. I'm trying to say "Hey, if you care about AI x-risk, alignment isn't the only game in town", and staking some personal reputation points to push against the status quo where almost-everyone x-risk oriented will work on alignment almost-nobody x-risk-oriented will work on cooperation/coordination or multi/multi delegation. Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"...
4Paul Christiano16hIn fairness, writing [https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=GvnDcxYxg9QznBobv] “marginal deep-thinking researchers [should not] allocate themselves to making alignment […] cheaper/easier/better” is pretty similar to saying “one shouldn’t work on alignment.” (I didn’t read you as saying that Paul or Rohin shouldn’t work on alignment, and indeed I’d care much less about that than about a researcher at CHAI arguing that CHAI students shouldn’t work on alignment.) On top of that, in your prior post [https://www.lesswrong.com/posts/hvGoYXi2kgnS3vxqb/some-ai-research-areas-and-their-relevance-to-existential-1#Interpretability_in_ML__IntML_] you make stronger claims: * "Contributions to OODR research are not particularly helpful to existential safety in my opinion.” * “Contributions to preference learning are not particularly helpful to existential safety in my opinion” * “In any case, I see AI alignment in turn as having two main potential applications to existential safety:” (excluding the main channel Paul cares about and argues for, namely that making alignment easier improves the probability that the bulk of deployed ML systems are aligned and reduces the competitive advantage for misaligned agents) In the current post you (mostly) didn’t make claims about the relative value of different areas, and so I was (mostly) objecting to arguments that I consider misleading or incorrect. But you appeared to be sticking with the claims from your prior post and so I still ascribed those views to you in a way that may have colored my responses. I’m not really claiming that AI alignment is the most important thing to work on (though I do think it’s among the best ways to address problems posed by misaligned AI systems in particular). I’m generally supportive of and excited about a wide variety of approaches to improving society’s ability to cope with future challenges (th
5Rohin Shah19hI think that probably would be true. Fwiw my reaction is not "Critch thinks Rohin should do something else", it's more like "Critch is saying something I believe to be false on an important topic that lots of other people will read". I generally want us as a community to converge to true beliefs on important things (part of my motivation for writing a newsletter) and so then I'd say "but actually alignment still seems like the most valuable thing on the margin because of X, Y and Z". (I've had enough conversations with you at this point to know the axes of disagreement, and I think you've convinced me that "which one is better on the margin" is not actually that important a question to get an answer to. So now I don't feel as much of an urge to respond that way. But that's how I started out.)
1Andrew Critch1hGot it, thanks!
1romeostevensit3dFurther prior art: Accelerando [https://www.antipope.org/charlie/blog-static/fiction/accelerando/accelerando-intro.html]
1Sammy Martin3dGreat post! I'm glad someone has outlined in clear terms what these failures look like, rather than the nebulous 'multiagent misalignment', as it lets us start on a path to clarifying what (if any) new mitigations or technical research are needed. Agent-agnostic perspective is a very good innovation for thinking about these problems - is line between agentive and non-agentive behaviour is often not clear, and it's not like there is a principled metaphysical distinction between the two (e.g. Dennett and the Intentional Stance). Currently, big corporations can be weakly modelled this way and individual humans are fully agentive, but Transformative AI will bring up a whole spectrum of more and less agentive things that will fill up the rest of this spectrum. There is a sense in which, if the outcome is something catastrophic, there must have been misalignment, and if there was misalignment then in some sense at least some individual agents were misaligned. Specifically, the systems in your Production Web weren't intent-aligned because they weren't doing what we wanted them to do, and were at least partly deceiving us. Assuming this is the case, 'multipolar failure' requires some subset of intent misalignment. But it's a special subset because it involves different kinds of failures to the ones we normally talk about. It seems like you're identifying some dimensions of intent alignment as those most likely to be neglected because they're the hardest to catch, or because there will be economic incentives to ensure AI isn't aligned in that way, rather than saying that there some sense in which the transformative AI in the production web scenario is 'fully aligned' but still produces an existential catastrophe. I think that the difference between your Production Web and Paul Christiano's subtle creeping Outer Alignment failure scenario [https://www.lesswrong.com/posts/AyNHoTWWAJ5eb99ji/another-outer-alignment-failure-story] is just semantic - you say that the AIs in
4Raymond Arnold4dCurated. I appreciated this post for a combination of: * laying out several concrete stories about how AI could lead to human extinction * layout out a frame for how think about those stories (while acknowledging other frames one could apply to the story) * linking to a variety of research, with more thoughts what sort of further research might be helpful. I also wanted to highlight this section: Which is a thing I think I once heard Critch talk about, but which I don't think had been discussed much on LessWrong, and which I'd be interested in seeing more thoughts and distillation of.
3Paul Christiano7dQuantitatively I think that entities without instrumental resources win very, very slowly. For example, if the average savings rate is 99% and my personal savings rate is only 95%, then by the time that the economy grows 10,000x my share of the world will have fallen by about half. The levels of consumption needed to maintain human safety and current quality of life seems quite low (and the high-growth during which they have to be maintained is quite low). Also, typically taxes transfer (way more) than that much value from high-savers to low-savers. It's not clear to me what's happening with taxes in your story. I guess you are imagining low-tax jurisdictions winning out, but again the pace at which that happens is even slower and it is dwarfed by the typical rate of expropriation from war. From my end it feels like the big difference is that quantitatively I think the overhead of achieving human values is extremely low, so the dynamics you point to are too weak to do anything before the end of time (unless single-single alignment turns out to be hard). I don't know exactly what your view on this is. If you agree that the main source of overhead is single-single alignment, then I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination (my sense is that I'm quite skeptical about most of the particular kinds of work you advocate). If you disagree, then I expect the main disagreement is about those other sources of overhead (e.g. you might have some other particular things in mind, or you might feel that unknown-unknowns are a larger fraction of the total risk, or something else). Could you explain the advantage you are imagining? Some candidates, none of which I think are your view: * Single-single alignment failures---e.g. it's easier to build a widget-maximizing corpora
1Andrew Critch1dYes. That is also my sense, and a major reason I suspect multi/multi delegation dynamics will remain neglected among x-risk oriented researchers for the next 3-5 years at least. Yes, I think coordination costs will by default pose a high overhead cost to preserving human values among systems with the potential to race to the bottom on how much they preserve human values. Yes. Imagine two competing cultures A and B have transformative AI tech. Both are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values. The shift is by design subtle enough not to trigger leaders of A and B to have a bargaining meeting to regulate against A' (contrary to Carl's narrative [https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=FsQaed6XLpxiXyda5] where leaders coordinate against loss of control). Subculture A' comes to dominate discourse and cultural narratives in A, and makes A faster/more productive than B, such as through the development of fully automated companies as in one of the Production Web [https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=FsQaed6XLpxiXyda5#The_Production_Web__v_1a__management_first_] stories. The resulting advantage of A is enough for A to begin dominating or at least threatening B geopolitically, but by that time leaders in A have little power to squash A', so instead B follows suit by allowing a highly automation-oriented subculture B's to develop. These advantages are small enough not to trigger regulatory oversight, but when integrated over time they are not "tiny". This results in the gradual empowerment of humans who are misaligned with preserving human existence, until those humans also lose control of their own existence, perhaps willfully, or perhaps carelessly, or through a mix of both.
2Paul Christiano1dI was asking you why you thought A' would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future? * One obvious reason is single-single misalignment---A' is willing to deploy misaligned AI in order to get an advantage, while B isn't---but you say "their tech is aligned with them" so it sounds like you're setting this aside. But maybe you mean that A' has values that make alignment easy, while B has values that make alignment hard, and so B's disadvantage still comes from single-single misalignment even though A''s systems are aligned? * Another advantage is that A' can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn't seem like it can cause A' to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment. * Wei Dai has suggested [https://www.alignmentforum.org/posts/Sn5NiiD5WBi4dLzaB/agi-will-drastically-increase-economies-of-scale] that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concer
1Andrew Critch1hAh! Yes, this is really getting to the crux of things. The short answer is that I'm worried about the following failure mode: Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins. (Here's, I'm using the word "culture" to encode a mix of information subsuming utility functions, beliefs, and decision theory, cognitive capacities, and other features determining the general tendencies of an agent or collective.) Of course, an easy antidote to this failure mode is to have A or B win instead of A', because A and B both have some human values other than power-maximizing. The problem is that this whole situation is premised on a conflict between A and B over which culture should win, and then the following observation applies: In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization). This observation is slightly different from observations that "simple values dominate engineering efforts" as seen in stories about singleton paperclip maximizers. A key feature of the Production Web dynamic is now just that it's easy to build production maximizers, but that it's easy to accidentally cooperate on building a production-maximizing systems that destroy both you and your competitors. Thanks for noticing whatever you think are the inconsistencies; if you have time, I'd love for you to point them out. This seems pretty likely to me. The bolded attribution to Dai above is
3Paul Christiano7dI think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled labor) better than others (e.g. the median line worker). That's a really important and common complaint with the existing economic order, but I don't really see how it indicates a Pareto improvement or is related to the central thesis of your post about firms failing to help their shareholders. (In general wage labor is supposed to benefit you by giving you money, and then the question is whether the stuff you spend money on benefits you.))
6Paul Christiano7dIn your story, trillion-dollar tech companies are trying to make their systems do what they want and failing. My best understanding of your position is: "Sure, but they will be trying really hard. So additional researchers working on the problem won't much change their probability of success, and you should instead work on more-neglected problems." My position is: * Eventually people will work on these problems, but right now they are not working on them very much and so a few people can be a big proportional difference. * If there is going to be a huge investment in the future, then early investment and training can effectively be very leveraged. Scaling up fields extremely quickly is really difficult for a bunch of reasons. * It seems like AI progress may be quite fast, such that it will be extra hard to solve these problems just-in-time if we don't have any idea what we are doing in advance. * On top of all that, for many use cases people will actually be reasonably happy with misaligned systems like those in your story (that e.g. appear to be doing a good job, keep the board happy, perform well as evaluated by the best human-legible audits...). So it seems like commercial incentives may not push us to safe levels of alignment.
2Andrew Critch1dThat is not my position if "you" in the story is "you, Paul Christiano" :) The closest position I have to that one is : "If another Paul comes along who cares about x-risk, they'll have more positive impact by focusing on multi-agent and multi-stakeholder issues or 'ethics' with AI tech than if they focus on intent alignment, because multi-agent and multi-stakeholder dynamics will greatly affect what strategies AI stakeholders 'want' their AI systems to pursue." If they tried to get you to quit working on alignment, I'd say "No, the tech companies still need people working on alignment for them, and Paul is/was one of those people. I don't endorse converting existing alignment researchers to working on multi/multi delegation theory (unless they're naturally interested in it), but if a marginal AI-capabilities-bound researcher comes along, I endorse getting them set up to think about multi/multi delegation more than alignment."
11Paul Christiano7dI'm fine with talking about alignment as a scalar (I think we both agree that it's even messier than a single scalar). But I'm saying: 1. The individual systems in your could do something different that would be much better for their principals, and they are aware of that fact, but they don't care. That is to say, they are very misaligned. 2. The story is risky precisely to the extent that these systems are misaligned. The systems in your story aren't maximizing profit in the form of real resources delivered to shareholders (the normal conception of "profit"). Whatever kind of "profit maximization" they are doing does not seem even approximately or myopically aligned with shareholders. I don't think the most obvious "something better to do" is to reduce competitive pressures, it's just to actually benefit shareholders. And indeed the main mystery about your story is why the shareholders get so screwed by the systems that they are delegating to, and how to reconcile that with your view that single-single alignment is going to be a solved problem because of the incentives to solve it. I think this system is misaligned. Keeping me locally happy with your decisions while drifting further and further from what I really want is a paradigm example of being misaligned, and e.g. it's what would happen if you made zero progress on alignment and deployed existing ML systems in the context you are describing. If I take your stuff and don't give it back when you ask, and the only way to avoid this is to check in every day in a way that prevents me from acting quickly in the world, then I'm misaligned. If I do good things only when you can check while understanding that my actions lead to your death, then I'm misaligned. These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research. (I definitely agree that an aligned system isn't automatically successful at bargaining.)
15Paul Christiano7dOverall, I think I agree with some of the most important high-level claims of the post: * The world would be better if people could more often reach mutually beneficial deals. We would be more likely to handle challenges that arise, including those that threaten extinction (and including challenges posed by AI, alignment and otherwise). It makes sense to talk about "coordination ability" as a critical causal factor in almost any story about x-risk. * The development and deployment of AI may provide opportunities for cooperation to become either easier or harder (e.g. through capability differentials, alignment failures, geopolitical disruption, or distinctive features of artificial minds). So it can be worthwhile to do work in AI targeted at making cooperation easier, even and especially for people focused on reducing extinction risks. I also read the post as also implying or suggesting some things I'd disagree with: * That there is some real sense in which "cooperation itself is the problem." I basically think all of the failure stories will involve some other problem that we would like to cooperate to solve, and we can discuss how well humanity cooperates to solve it (and compare "improve cooperation" to "work directly on the problem" as interventions). In particular, I think the stories in this post would basically be resolved if singles-single alignment worked well, and that taking the stories in this post seriously suggests that progress on single-single alignment makes the world better (since evidently people face a tradeoff between single-single alignment and other goals, so that progress on single-single alignment changes what point on that tradeoff curve they will end up at, and since compromising on single-single alignment appears necessary to any of the bad outcomes in this story). * Relatedly, that cooperation plays a qualitatively different role than other kinds of cognitive enhanceme
5Andrew Critch1dThanks for this synopsis of your impressions, and +1 to the two points you think we agree on. As for these, some of them are real positions I hold, while some are not: I don't hold that view. I the closest view I hold is more like: "Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment." I don't hold the view you attribute to me here, and I agree wholesale with the following position, including your comparisons of cooperation with brain enhancement and improving belief accuracy: ... with one caveat: some beliefs are self-fulfilling, such as cooperation/defection. There are ways of improving belief accuracy that favor defection, and ways that favor cooperation. Plausibly to me, the ways of improving belief accuracy that favor defection are worse that mo accuracy improvement at all. I'm particularly firm in this view, though; it's more of a hedge. I do hold this view! Particularly the bolded part. I also agree with the bolded parts of your counterpoint, but I think you might be underestimating the value of technical work (e.g., CSC, MARL) directed at improving coordination amongst existing humans and human institutions. I think blockchain tech is a good example of an already-mildly-transformative technology for implementing radically mutually transparent and cooperative strategies through smart contracts. Make no mistake: I'm not claiming blockchain tehc is going to "save the world"; rather, it's changing the way people cooperate, and is doing so as a result of a technical insight. I think more technical insights are in order to improve cooperation and/or the global structure of society, and it's worth spending research efforts to find them. Reminder: this is not a bid for you personally to quit working on alignment! My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens. In practice
6Paul Christiano21hSounds like we are on broadly the same page. I would have said "Aligning ML systems is more likely if we understand more about how to align ML systems, or are better at coordinating to differentially deploy aligned systems, or are wiser or smarter or..." and then moved on to talking about how alignment research quantitatively compares to improvements in various kinds of coordination or wisdom or whatever. (My bottom line from doing this exercise is that I feel more general capabilities typically look less cost-effective on alignment in particular, but benefit a ton from the diversity of problems they help address.) I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through. I'm reading this (and your prior post) as bids for junior researchers to shift what they focus on. My hope is that seeing the back-and-forth in the comments will, in expectation, help them decide better.

> My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.

I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Yeah I agree we probably can't reach convergence on how alignment affects deployment time, at least not in this medium (especially since a lot of info about company policies / plans / standards are covered under NDAs), so I also think it's good to leave this... (read more)

Servant of Many Masters: Shifting priorities in Pareto-optimal sequential decision-making (Andrew Critch and Stuart Russell)

Servant of Many Masters: Shifting priorities in Pareto-optimal sequential decision-making

Summary

A policy (over some partially observable Markov decision process (POMDP)) is Pareto optimal with respect to two agents with different utility functions if it is not possible to construct a policy that achieves higher utility for one of the agents without doing worse for the other agent. A result by Harsanyi shows that for agents that have the same beliefs, Pareto optimal policies act as if they are maximizing some weighted sum of the two agents' utility functions. However, what if the agents have different beliefs?

Interestingly, if two agents disagree about the world, it is possible to construct policies that are better for both...

Thanks for doing this! I'm excited to see this sequence grow, it's the sort of thing that could serve the function of a journal or textbook.

Special thanks to Kate Woolverton for comments and feedback.

There has been a lot of work and discussion surrounding the speed and continuity of AI takeoff scenarios, which I do think are important variables, but in my opinion ones which are relatively less important when compared to many other axes on which different takeoff scenarios could differ.

In particular, one axis on which different takeoff scenarios can differ that I am particularly interested in is their homogeneity—that is, how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get...

4Rohin Shah19hNot sure why I didn't respond to this, sorry. I agree with the claim "we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world". I don't see this claim as particularly relevant to predicting the future.

OK, thanks. YMMV but some people I've read / talked to seem to think that before we have successful world-takeover attempts, we'll have unsuccessful ones--"sordid stumbles." If this is true, it's good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.

A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It's plausible to me that we'll get stuff like that before it's t... (read more)

Introduction

This review is part of a project with Joe Collman and Jérémy Perret to try to get as close as possible to peer review when giving feedback on the Alignment Forum. Our reasons behind this endeavor are detailed in our original post asking for suggestions of works to review; but the gist is that we hope to bring further clarity to the following questions:

  • How many low-hanging fruits in terms of feedback can be plucked by getting into a review mindset and seeing the review as part of one’s job?
  • Given the disparate state of research in AI Alignment, is it possible for any researcher to give useful feedback on any other research work in the field?
  • What sort of reviews are useful for AI Alignment research?

Instead of thinking about...

3Joe_Collman2dTaking your last point first: I entirely agree on that. Most of my other points were based on the implicit assumption that readers of your post don't think something like "It's directly clear that 9 OOM will almost certainly be enough, by a similar argument". Certainly if they do conclude anything like that, then it's going to massively drop their odds on 9-12 too. However, I'd still make an argument of a similar form: for some people, I expect that argument may well increase the 5-8 range more (than proportionately) than the 1-4 range. On (1), I agree that the same goes for pretty-much any argument: that's why it's important. If you update without factoring in (some approximation of) your best judgement of the evidence's impact on all hypotheses, you're going to get the wrong answer. This will depend highly on your underlying model. On the information content of the post, I'd say it's something like "12 OOMs is probably enough (without things needing to scale surprisingly well)". My credence for low OOM values is mostly based on worlds where things scale surprisingly well. I don't think this is weird. What matters isn't what the post talks about directly - it's the impact of the evidence provided on the various hypotheses. There's nothing inherently weird about evidence increasing our credence in [TAI by +10OOM] and leaving our credence in [TAI by +3OOM] almost unaltered (quite plausibly because it's not too relevant to the +3OOM case). Compare the 1-2-3 coins example: learning y tells you nothing about the value of x. It's only ruling out any part of the 1 outcome in the sense that it maintains [x_heads & something independent is heads], and rules out [x_heads & something independent is tails]. It doesn't need to talk about x to do this. You can do the same thing with the TAI first at k OOM case - call that Tk. Let's say that your post is our evidence e and that e+ stands for [e gives a compelling argument against T13+]. Updating on e+ you get the following
3Daniel Kokotajlo1dWait, shouldn't it be the ratio p[Tk & e+] : p[Tk & e-]? Maybe both ratios work fine for our purposes, but I certainly find it more natural to think in terms of &.
3Joe_Collman1dUnless I've confused myself badly (always possible!), I think either's fine here. The | version just takes out a factor that'll be common to all hypotheses: [p(e+) / p(e-)]. (since p(Tk & e+) ≡ p(Tk | e+) * p(e+)) Since we'll renormalise, common factors don't matter. Using the | version felt right to me at the time, but whatever allows clearer thinking is the way forward.

I'm probably being just mathematically confused myself; at any rate, I'll proceed with the p[Tk & e+] : p[Tk & e-] version since that comes more naturally to me. (I think of it like: Your credence in Tk is split between two buckets, the Tk&e+ and Tk&e- bucket, and then when you update you rule out the e- bucket. So what matters is the ratio between the buckets; if it's relatively high (compared to the ratio for other Tx's) your credence in Tk goes up, if it's relatively low it goes down.

Anyhow, I totally agree tha... (read more)

This post has benefited greatly from discussion with Sam Eisenstat, Caspar Oesterheld, and Daniel Kokotajlo.

Last year, I wrote a post claiming there was a Dutch Book against CDTs whose counterfactual expectations differ from EDT. However, the argument was a bit fuzzy.

I recently came up with a variation on the argument which gets around some problems; I present this more rigorous version here.

Here, "CDT" refers -- very broadly -- to using counterfactuals to evaluate expected value of actions. It need not mean physical-causal counterfactuals. In particular, TDT counts as "a CDT" in this sense.

"EDT", on the other hand, refers to the use of conditional probability to evaluate expected value of actions.

Put more mathematically, for action , EDT uses , and CDT uses . I'll write and ...

7Alex Mennen3dI think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work. For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract for some sufficiently small price2difcdt(a)≠edt(a), even ifais not the optimal action (let's saybis the optimal action). When the time comes to take an action, the agent's best bet isb′(prime meaning sell the contract for priced). The way I described the set-up, the agent doesn't choose betweenaanda′, because actions other than the top choice all happen with probability epsilon. The fact that the agent sells the contract back in its top choice isn't a Dutch book, because the case where the agent's top choice goes through is the case in which the contract is worthless, and the contract's value is derived from other cases. We could modify the epsilon exploration assumption so that the agent also chooses betweenaanda′even while its top choice isb′. That is, there's a lower bound on the probability with which the agent takes an action in{a,a′}, but even if that bound is achieved, the agent still has some flexibility in distributing probability betweenaanda′. In this case, contrary to your argument, the agent will preferarather thana′, i.e., it will not get Dutch booked. This is because the agent is still choosingb′as the only action with high probability, andcdt(a) refers to the expected consequence of the agent choosingaas its intended action, so the agent cannot usecdt(a)when calculating which ofaora′is better to pick as its next choice if its attempt to implement intended actionb′fails. Another source of uncertain
2Abram Demski1dOK, here's my position. As I said in the post, the real answer is that this argument simply does not apply if the agent knows its action. More generally: the argument applies precisely to those actions to which the agent ascribes positive probability (directly before deciding). So, it is possible for agents to maintain a difference between counterfactual and evidential expectations. However, I think it's rarely normatively correct for an agent to be in such a position. Even though the decision procedure of CDT is deterministic, this does not mean that agents described by CDT know what they will do in the future. We can think of this in terms of logical induction: the market is not 100% certain of its own beliefs, and in particular, doesn't typically know precisely what the maximum-expectation-action is. One way of seeing the importance of this is to point out that CDT is a normative theory, not a descriptive one. CDT is supposed to tell you what arbitrary agents should do. The recommendations are supposed to apply even to, say, epsilon-exploring agents (who are not described by CDT, strictly speaking). But here we see that CDT recommends being dutch-booked! Therefore, CDT is not a very good normative theory, at least for epsilon-explorers. (So I'm addressing your epsilon-exploration example by differentiating between the agent's algorithm and the CDT decision theory. The agent isn't dutch-booked, but CDT recommends a dutch book.) Granted, we could argue via dutch book that agents should know their own actions, if those actions are deterministic consequences of a know agent-architecture. However, theories of logical uncertainty tell us that this is not (always) realistic. In particular, we can adapt the bounded-resource-dutch-book idea from logical induction. According to this idea, some dutch-book-ability is OK, but agents should not be boundlessly exploitable by resource-bounded bookies. This idea leads me to think that efficiently computable sequences of act
2Abram Demski2dI thought about these things in writing this, but I'll have to think about them again before making a full reply. Another similar scenario would be: we assume the probability of an action is small if it's sub-optimal, but smaller the worse it is.
6tailcalled4dWouldn't it be P(Act=a|do(buy B)) rather than P(Act=a)? Like my thought would be that the logical thing for CDT would be to buy the contract and then as a result its expected utilities change, which leads to its probabilities changing, and as a result it doesn't want to sell the contract. I'd think this argument only puts a bound on how much cdt and edt can differ, rather than on whether they can differ at all. Very possible I'm missing something though.
2Abram Demski2dI agree with this, but I was assuming the CDT agent doesn't think buying B will influence the later decision. This, again, seems plausible if the payoff is made sufficiently small. I believe that there are some other points in my proof which make similar assumptions, which would ideally be made clearer in a more formal write-up. However, I think CDT advocates will not generally take this to be a sticking point. The structure of my argument is to take a pre-existing scenario, and then add bets. For my argument to work, the bets need to be "independent" of critical things (causally and/or evidentially independent) -- in the example you point out, the action taken later needs to be causally independent of the bet made earlier (more specifically, causal-conditioning on the bet should not change beliefs about what action will be taken). This is actually very similar to traditional Dutch-book arguments, which treat the bets as totally independent of everything. I could argue that it's just part of the thought experiment; if you concede that there could be a scenario like that, then you concede that CDT gets dutch-booked. If you don't buy that, but you do buy Dutch Books as a methodology more generally, then I think you have to claim there's some rule which forbids "situations like this" (so CDT has to think the bets are not independent of everything else, in such a way as to spoil my argument). I would be very interested if you could propose a sensible view like this. However, I think not: there doesn't seem to be anything about the scenario which violates some principle of causality or rationality. If you forbid scenarios like this, you seem to be forbidding a very reasonable scenario, for no good reason (other than to save CDT).
0tailcalled2dHow do you make the payoff small? Isn't your Dutch-book argument more recursive than standard ones? Your contract only pays out if you act, so the value of the dutch book causally depends on the action you choose.
2Abram Demski2dSure, do you think that's a concern? I was noting the similarity in this particular respect (pretending that bets are independent of everything), not in all respects. Note, in particular, that traditional dutch book arguments make no explicit assumption one way or the other about whether the propositions have to do with actions under the agent's control. So I see two possible interpretations of traditional Dutch books: 1. They apply to "recursive" stuff, such as things you have some influence over. For example, I can bet on a presidential election, even though I can also vote in a presidential election. In this case, what we have here is not weirder. This is the position I prefer. 2. They can't apply to "recursive" stuff. In this case, presumably we don't think standard probability theory applies to stuff we have influence over. This could be a respectable position, and I've seen it discussed. However, I don't buy it. I've seen philosophers answer this kind of think with the following argument: what if you had a little imp on your shoulder, who didn't influence you in any way but who watched you and formed predictions? The imp could have probabilistic beliefs about your actions. The standard dutch book arguments would apply to the imp. Why should you be in such a different position from the imp? For example, multiply the contract payoff by 0.001. Think of it this way. Making bets about your actions (or things influenced by your actions) can change your behavior. But if you keep the bets small enough, then you shouldn't change your behavior; the bets are less important than other issues. (Unless two actions are exactly tied, in terms of other issues.) I will concede that this isn't 100% convincing. Perhaps different laws of probability should apply to actions we can influence. OTOH, I'm not sure what laws those would be.
0tailcalled1dI disagree, I don't think it's a simple binary thing. I don't think Dutch book arguments in general never apply to recursive things, but it's more just that the recursion needs to be modelled in some way, and since your OP didn't do that, I ended up finding the argument confusing. I don't think your argument goes through for the imp, since it never needs to decide its action, and therefore the second part of selling the contract back never comes up? Hmm, on further reflection, I had an effect in mind which doesn't necessarily break your argument, but which increases the degree to which other counterarguments such as AlexMennen's break your argument. This effect isn't necessarily solved by multiplying the contract payoff (since decisions aren't necessarily continuous as a function of utilities), but it may under many circumstances be approximately solved by it. So maybe it doesn't matter so much, at least until AlexMennen's points are addressed so I can see where it fits in with that.

Hmm, on further reflection, I had an effect in mind which doesn't necessarily break your argument, but which increases the degree to which other counterarguments such as AlexMennen's break your argument. This effect isn't necessarily solved by multiplying the contract payoff (since decisions aren't necessarily continuous as a function of utilities), but it may under many circumstances be approximately solved by it. So maybe it doesn't matter so much, at least until AlexMennen's points are addressed so I can see where it fits in with that.

Replied.

2Abram Demski1dBut what does that look like? How should it make a difference? (This isn't a rhetorical question; I would be interested in a positive position. My lack of interest is, significantly, due to a lack of positive positions in this direction.) Ah, true, but the imp will necessarily just make EDT-type predictions anyway. So the imp argument reaches a similar conclusion. But I'm not claiming the imp argument is very strong in any case, it's just an intuition pump.

I've felt like the problem of counterfactuals is "mostly settled" (modulo some math working out) for about a year, but I don't think I've really communicated this online. Partly, I've been waiting to write up more formal results. But other research has taken up most of my time, so I'm not sure when I would get to it.

So, the following contains some "shovel-ready" problems. If you're convinced by my overall perspective, you may be interested in pursuing some of them. I think these directions have a high chance of basically solving the problem of counterfactuals (including logical counterfactuals).

Another reason for posting this rough write-up is to get feedback: am I missing the mark? Is this not what counterfactual reasoning is about? Can you illustrate remaining problems with...

6Ben Pace5dWow that's exciting! Very interesting that you think that.
3Abram Demski2dNow I feel like I should have phrased it more modestly, since it's really "settled modulo math working out", even though I feel fairly confident some version of the math should work out.
11Vanessa Kosoy5dI only skimmed this post for now, but a few quick comments on links to infra-Bayesianism: It's true that these questions still need work, but I think it's rather clear that something like "there are no traps" is a sufficient condition for learnability. For example, if you have a finite set of "episodic" hypotheses (i.e. time is divided into episodes, and no states is preserved from one episode to another), then a simple adversarial bandit algorithm (e.g. Exp3) that treats the hypotheses as arms leads to learning. For a more sophisticated example, consider Tian et al [https://arxiv.org/abs/2010.15020] which is formulated in the language of game theory, but can be regarded as an infra-Bayesian regret bound for infra-MDPs. True, but IMO the way to incorporate "radical probabilism" is via what I called Turing RL. I'm not sure what precisely you mean by "CDT vs EDT insight" but our latest post [https://www.lesswrong.com/posts/GS5P7LLLbSSExb3Sk/the-many-faces-of-infra-beliefs] might be relevant: it shows how you can regard infra-Bayesian hypotheses as joint beliefs about observations and actions, EDT-style. Is there a way to operationalize "respecting logic"? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?
5Abram Demski5d"Respect logic" means either (a) assigning probability one to tautologies (at least, to those which can be proved in some bounded proof-length, or something along those lines), or, (b) assigning probability zero to contradictions (again, modulo boundedness). These two properties should be basically equivalent (ie, imply each other) provided the proof system is consistent. If it's inconsistent, they imply different failure modes. My contention isn't that infra-bayes could fail due to not respecting logic. Rather, it's simply not obvious whether/how it's possible to make an interesting troll bridge problem for something which doesn't respect logic. EG, the example I mentioned of a typical RL agent -- the obvious way to "translate" Troll Bridge to typical RL is for the troll to blow up the bridge if and only if the agent takes an exploration step. But, this isn't sufficiently like the original Troll Bridge problem to be very interesting. By no means do I mean to indicate that there's an argument that agents have to "respect logic" buried somewhere in this write-up (or the original troll-bridge writeup, or my more recent explanation of troll bridge, or any other posts which I linked). If I want to argue such a thing, I'd have to do so separately. And, in fact, I don't think I want to argue that an agent is defective if it doesn't "respect logic". I don't think I can pull out a decision problem it'll do poorly on, or such. I a little bit want to argue that a decision theory is less revealing if it doesn't represent an agent as respecting logic, because I tend to think logical reasoning is an important part of an agent's rationality. EG, a highly capable general-purpose RL agent should be interpretable as using logical reasoning internally, even if we can't see that in the RL algorithm which gave rise to it. (In which case you might want to ask how the RL agent avoids the troll-bridge problem, even though the RL algorithm itself doesn't seem to give rise to any inter
2Vanessa Kosoy4dI guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don't know what would result. Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).

Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).

I agree that radical probabilism can be thought of as bayesian-with-a-side-channel, but it's nice to have a more general characterization where the side channel is black-box, rather than an explicit side-channel which we e... (read more)

Load More