Recommended Sequences

Embedded Agency
AGI safety from first principles
Iterated Amplification

Recent Discussion

With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept.  Thanks also for comments from Ramana Kumar.

Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers.

Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios.

This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause.  Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower.

 Unipolar take-offsMultipolar take-offs
Slow take-offs<not this post>Part 1 of this post
Fast take-offs<not this
...
1Sammy Martin11hGreat post! I'm glad someone has outlined in clear terms what these failures look like, rather than the nebulous 'multiagent misalignment', as it lets us start on a path to clarifying what (if any) new mitigations or technical research are needed. Agent-agnostic perspective is a very good innovation for thinking about these problems - is line between agentive and non-agentive behaviour is often not clear, and it's not like there is a principled metaphysical distinction between the two (e.g. Dennett and the Intentional Stance). Currently, big corporations can be weakly modelled this way and individual humans are fully agentive, but Transformative AI will bring up a whole spectrum of more and less agentive things that will fill up the rest of this spectrum. There is a sense in which, if the outcome is something catastrophic, there must have been misalignment, and if there was misalignment then in some sense at least some individual agents were misaligned. Specifically, the systems in your Production Web weren't intent-aligned because they weren't doing what we wanted them to do, and were at least partly deceiving us. Assuming this is the case, 'multipolar failure' requires some subset of intent misalignment. But it's a special subset because it involves different kinds of failures to the ones we normally talk about. It seems like you're identifying some dimensions of intent alignment as those most likely to be neglected because they're the hardest to catch, or because there will be economic incentives to ensure AI isn't aligned in that way, rather than saying that there some sense in which the transformative AI in the production web scenario is 'fully aligned' but still produces an existential catastrophe. I think that the difference between your Production Web and Paul Christiano's subtle creeping Outer Alignment failure scenario [https://www.lesswrong.com/posts/AyNHoTWWAJ5eb99ji/another-outer-alignment-failure-story] is just semantic - you say that the AIs in
2Raymond Arnold21hCurated. I appreciated this post for a combination of: * laying out several concrete stories about how AI could lead to human extinction * layout out a frame for how think about those stories (while acknowledging other frames one could apply to the story) * linking to a variety of research, with more thoughts what sort of further research might be helpful. I also wanted to highlight this section: Which is a thing I think I once heard Critch talk about, but which I don't think had been discussed much on LessWrong, and which I'd be interested in seeing more thoughts and distillation of.
13Paul Christiano4dOverall, I think I agree with some of the most important high-level claims of the post: * The world would be better if people could more often reach mutually beneficial deals. We would be more likely to handle challenges that arise, including those that threaten extinction (and including challenges posed by AI, alignment and otherwise). It makes sense to talk about "coordination ability" as a critical causal factor in almost any story about x-risk. * The development and deployment of AI may provide opportunities for cooperation to become either easier or harder (e.g. through capability differentials, alignment failures, geopolitical disruption, or distinctive features of artificial minds). So it can be worthwhile to do work in AI targeted at making cooperation easier, even and especially for people focused on reducing extinction risks. I also read the post as also implying or suggesting some things I'd disagree with: * That there is some real sense in which "cooperation itself is the problem." I basically think all of the failure stories will involve some other problem that we would like to cooperate to solve, and we can discuss how well humanity cooperates to solve it (and compare "improve cooperation" to "work directly on the problem" as interventions). In particular, I think the stories in this post would basically be resolved if singles-single alignment worked well, and that taking the stories in this post seriously suggests that progress on single-single alignment makes the world better (since evidently people face a tradeoff between single-single alignment and other goals, so that progress on single-single alignment changes what point on that tradeoff curve they will end up at, and since compromising on single-single alignment appears necessary to any of the bad outcomes in this story). * Relatedly, that cooperation plays a qualitatively different role than other kinds of cognitive enhanceme
5Jonathan Uesato5dThanks for the great post. I found this collection of stories and framings very insightful. 1. Strong +1 to "Problems before solutions." I'm much more focused when reading this story (or any threat model) on "do I find this story plausible and compelling?" (which is already a tremendously high bar) before even starting to get into "how would this update my research priorities?" 2. I wanted to add a mention to Katja Grace's "Misalignment and Misuse [https://aiimpacts.org/misalignment-and-misuse-whose-values-are-manifest/]" as another example discussing how single-single alignment problems and bargaining failures can blur together and exacerbate each other. The whole post is really short, but I'll quote anyways: In the post's story, both "misalignment" and "misuse" seem like two different, both valid, frames on the problem. 3. I liked the way this point is phrased on agent-agnostic and agent-centric (single-single alignment-focused) approaches as complementary. At one extreme end, in the world where we could agree on what constitutes an acceptable level of xrisk, and could agree to not build AI systems which exceed this level, and give ourselves enough time to figure out the alignment issues in advance, we'd be fine! (We would still need to do the work of actually figuring out a bunch of difficult technical and philosophical questions, but importantly, we would have the time and space to do this work.) To the extent we can't do this, what are the RAAPs, such as intense competition, which prevent us from doing so? And at the other extreme, if we develop really satisfying solutions to alignment, we also shouldn't end up in worlds where we have "little human insight" or factories "so pervasive, well-defended, and intertwined with our basic needs that we are unable to stop them from operating." I think Paul often makes this point in the context of discussing an alignment tax. We can both decrease the size of the tax, and make the tax more appealing/more easily enfo
15Vanessa Kosoy6dI don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual "unipolar" sense. These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners. I do see two reasons why multipolar scenarios might require more technical research: 1. Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient way (a tragedy of the commons among the AIs), and maybe this can be prevented by designing the AIs in particular ways. 2. In a multipolar scenario, aligned AI might have to compete with already deployed unaligned AI, meaning that safety must not come on expense of capability[1] [#fn-moX7MkHzSFuzaAyiM-1]. In addition, aligning a single AI to multiple users also requires extra technical research (we need to somehow balance the goals of the different users and solve the associated mechanism design problem.) However, it seems that this article is arguing for something different, since none of the above aspects are highlighted in the description of the scenarios. So, I'm confused. -------------------------------------------------------------------------------- 1. In fact, I suspect this desideratum is impossible in its strictest form, and we actually have no choice but somehow making sure aligned AIs have a significant head start on all unaligned AIs. ↩︎ [#fnref-moX7MkHzSFuzaAyiM-1]
4Andrew Critch5dI don't mean to say this post warrants a new kind of AI alignment research, and I don't think I said that, but perhaps I'm missing some kind of subtext I'm inadvertently sending? I would say this post warrants research on multi-agent RL and/or AI social choice and/or fairness and/or transparency, none of which are "new kinds" of research (I promoted them heavily in my preceding post), and none of which I would call "alignment research" (though I'll respect your decision to call all these topics "alignment" if you consider them that). I would say, and I did say: I do hope that the RAAP concept can serve as a handle for noticing structure in multi-agent systems, but again I don't consider this a "new kind of research", only an important/necessary/neglected kind of research for the purposes of existential safety. Apologies if I seemed more revolutionary than intended. Perhaps it's uncommon to take a strong position of the form "X is necessary/important/neglected for human survival" without also saying "X is a fundamentally new type of thinking that no one has done before", but that is indeed my stance for X∈{a variety of non-alignment AI research areas [https://www.lesswrong.com/posts/hvGoYXi2kgnS3vxqb/some-ai-research-areas-and-their-relevance-to-existential-1] }.
2Vanessa Kosoy5dFrom your reply to Paul, I understand your argument to be something like the following: 1. Any solution to single-single alignment will involve a tradeoff between alignment and capability. 2. If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe. 3. If AI systems are designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff. 4. Given the technical knowledge to design cooperative AI, the incentives are in favor of cooperative AI since cooperative AIs can come ahead by striking mutually-beneficial deals even purely in terms of capability. Therefore, producing such technical knowledge will prevent catastrophe. 5. We might still need regulation to prevent players who irrationally choose to deploy uncooperative AI, but this kind of regulation is relatively easy to promote since it aligns with competitive incentives (an uncooperative AI wouldn't have much of an edge, it would just threaten to drag everyone into a mutually destructive strategy). I think this argument has merit, but also the following weakness: given single-single alignment, we can delegate the design of cooperative AI to the initial uncooperative AI. Moreover, uncooperative AIs have an incentive to self-modify into cooperative AIs, if they assign even a small probability to their peers doing the same. I think we definitely need more research to understand these questions better, but it seems plausible we can reduce cooperation to "just" solving single-single alignment.
2Andrew Critch6dHow are you inferring this? From the fact that a negative outcome eventually obtained? Or from particular misaligned decisions each system made? It would be helpful if you could point to a particular single-agent decision in one of the stories that you view as evidence of that single agent being highly misaligned with its user or creator. I can then reply with how I envision that decision being made even with high single-agent alignment. Yes, this^.
16Paul Christiano5dI also thought the story strongly suggested single-single misalignment, though it doesn't get into many of the concrete decisions made by any of the systems so it's hard to say whether particular decisions are in fact misaligned. Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders? I guess this is probably just a gloss you are putting on the combined behavior of multiple systems, but you kind of take it for given rather than highlighting it as a serious bargaining failure amongst the machines, and more importantly you don't really say how or why this would happen. How is this goal concretely implemented, if none of the agents care about it? How exactly does the terminal goal of benefiting shareholders disappear, if all of the machines involved have that goal? Why does e.g. an individual firm lose control of its resources such that it can no longer distribute them to shareholders? The implicit argument seems to apply just as well to humans trading with each other and I'm not sure why the story is different if we replace the humans with aligned AI. Such humans will tend to produce a lot, and the ones who produce more will be more influential. Maybe you think we are already losing sight of our basic goals and collectively pursuing alien goals, whereas I think we are just making a lot of stuff instrumentally which is mostly ultimately turning into stuff humans want (indeed I think we are mostly making too little stuff). This sounds like directly saying that firms are misaligned. I guess you are saying that individual AI systems within the firm are aligned, but the firm collectively is somehow misaligned? But not much is said about how or why that happens. It says things like: But an aligned firm will also be fully-automated, will participate in this network of trades, will produce at approximately maximal efficiency, and so on. Where does the aligned firm end up using its resources in a way that's incompatibl
9Andrew Critch5dIt seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment"). I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the "alignment scalar", rather than ways of guaranteeing the "perfect alignment" boolean. (I sometimes use "misaligned" as a boolean due to it being easier for people to agree on what is "misaligned" than what is "aligned".) In general, I think it's very unsafe to pretend numbers that are very close to 1 are exactly 1, because e.g., 1^(10^6) = 1 whereas 0.9999^(10^6) very much isn't 1, and the way you use the word "aligned" seems unsafe to me in this way. (Perhaps you believe in some kind of basin of convergence around perfect alignment that causes sufficiently-well-aligned systems to converge on perfect alignment, in which case it might make sense to use "aligned" to mean "inside the convergence basin of perfect alignment". However, I'm both dubious of the width of that basin, and dubious that its definition is adequately social-context-independent [e.g., independent of the bargaining stances of other stakeholders], so I'm back to not really believing in a useful Boolean notion of alignment, only scalar alignment.) In any case, I agree profit maximization it not a perfectly aligned goal for a company, however, it is a myopically pursued goal in a tragedy of the commons resulting from a failure to agree (as you point out) on something better to do (e.g., reducing competitive pressures to maximize profits). I agree that it is a bargaining failure if everyone ends up participating in a system that everyone thinks is bad; I thought that would be an obvious reading of the stories, but apparently it wasn't! Sorry about that. I meant to indicate this with the pointers to
3Paul Christiano4dQuantitatively I think that entities without instrumental resources win very, very slowly. For example, if the average savings rate is 99% and my personal savings rate is only 95%, then by the time that the economy grows 10,000x my share of the world will have fallen by about half. The levels of consumption needed to maintain human safety and current quality of life seems quite low (and the high-growth during which they have to be maintained is quite low). Also, typically taxes transfer (way more) than that much value from high-savers to low-savers. It's not clear to me what's happening with taxes in your story. I guess you are imagining low-tax jurisdictions winning out, but again the pace at which that happens is even slower and it is dwarfed by the typical rate of expropriation from war. From my end it feels like the big difference is that quantitatively I think the overhead of achieving human values is extremely low, so the dynamics you point to are too weak to do anything before the end of time (unless single-single alignment turns out to be hard). I don't know exactly what your view on this is. If you agree that the main source of overhead is single-single alignment, then I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination (my sense is that I'm quite skeptical about most of the particular kinds of work you advocate). If you disagree, then I expect the main disagreement is about those other sources of overhead (e.g. you might have some other particular things in mind, or you might feel that unknown-unknowns are a larger fraction of the total risk, or something else). Could you explain the advantage you are imagining? Some candidates, none of which I think are your view: * Single-single alignment failures---e.g. it's easier to build a widget-maximizing corpora
3Paul Christiano4dI think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled labor) better than others (e.g. the median line worker). That's a really important and common complaint with the existing economic order, but I don't really see how it indicates a Pareto improvement or is related to the central thesis of your post about firms failing to help their shareholders. (In general wage labor is supposed to benefit you by giving you money, and then the question is whether the stuff you spend money on benefits you.))
6Paul Christiano4dIn your story, trillion-dollar tech companies are trying to make their systems do what they want and failing. My best understanding of your position is: "Sure, but they will be trying really hard. So additional researchers working on the problem won't much change their probability of success, and you should instead work on more-neglected problems." My position is: * Eventually people will work on these problems, but right now they are not working on them very much and so a few people can be a big proportional difference. * If there is going to be a huge investment in the future, then early investment and training can effectively be very leveraged. Scaling up fields extremely quickly is really difficult for a bunch of reasons. * It seems like AI progress may be quite fast, such that it will be extra hard to solve these problems just-in-time if we don't have any idea what we are doing in advance. * On top of all that, for many use cases people will actually be reasonably happy with misaligned systems like those in your story (that e.g. appear to be doing a good job, keep the board happy, perform well as evaluated by the best human-legible audits...). So it seems like commercial incentives may not push us to safe levels of alignment.
5Paul Christiano4dI'm fine with talking about alignment as a scalar (I think we both agree that it's even messier than a single scalar). But I'm saying: 1. The individual systems in your could do something different that would be much better for their principals, and they are aware of that fact, but they don't care. That is to say, they are very misaligned. 2. The story is risky precisely to the extent that these systems are misaligned. The systems in your story aren't maximizing profit in the form of real resources delivered to shareholders (the normal conception of "profit"). Whatever kind of "profit maximization" they are doing does not seem even approximately or myopically aligned with shareholders. I don't think the most obvious "something better to do" is to reduce competitive pressures, it's just to actually benefit shareholders. And indeed the main mystery about your story is why the shareholders get so screwed by the systems that they are delegating to, and how to reconcile that with your view that single-single alignment is going to be a solved problem because of the incentives to solve it. I think this system is misaligned. Keeping me locally happy with your decisions while drifting further and further from what I really want is a paradigm example of being misaligned, and e.g. it's what would happen if you made zero progress on alignment and deployed existing ML systems in the context you are describing. If I take your stuff and don't give it back when you ask, and the only way to avoid this is to check in every day in a way that prevents me from acting quickly in the world, then I'm misaligned. If I do good things only when you can check while understanding that my actions lead to your death, then I'm misaligned. These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research. (I definitely agree that an aligned system isn't automatically successful at bargaining.)

Meta

This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes. At the end I’m going to go through some salient ways you could vary the story.

This isn’t intended to be a particularly great story (and it’s pretty informal). I’m still trying to think through what I expect to happen if alignment turns out to be hard, and this more like the most recent entry in a long journey of gradually-improving stories.

I wrote this up a few months ago and was reminded to post...

5Wei Dai8hThe ending of the story feels implausible to me, because there's a lack of explanation of why the story doesn't side-track onto some other seemingly more likely failure mode first. (Now that I've re-read the last part of your post, it seems like you've had similar thoughts already, but I'll write mine down anyway. Also it occurs to me that perhaps I'm not the target audience of the story.) For example: 1. In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persuasion/propaganda at each other? (Maybe this is what you meant by "people yelling at each other"?) 2. Why don't AI safety researchers try to leverage AI to improve AI alignment, for example implementing DEBATE and using that to further improve alignment, or just an adhoc informal version where you ask various AI advisors to come up with improved alignment schemes and to critique/defend each others' ideas? (My expectation is that we end up with one or multiple sequences of "improved" alignment schemes that eventually lock in wrong solutions to some philosophical or metaphilosophical problems, or has some other problem that is much subtler than the kind of outer alignment failure described here.)

In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persuasion/propaganda at each other? (Maybe this is what you meant by "people yelling at each other"?)

It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)

Why don't AI safety researchers try to leverage AI t

... (read more)
8Daniel Kokotajlo2dThanks for this, this is awesome! I'm hopeful in the next few years for there to be a collection of stories like this. I'm a bit surprised that the outcome is worse than you expect, considering that this scenario is "easy mode" for societal competence and inner alignment, which seem to me to be very important parts of the overall problem. Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence? Some other threads to pull on: --In this story, there aren't any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it's more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven't had major war for seventy years, and maybe that's because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff? IDK, I worry that the reasons why we haven't had war for seventy years may be largely luck / observer selection effects, and also separately even if that's wrong, I worry that the reasons won't persist through takeoff (e.g. some factions may develop ways to shoot down ICBMs, or prevent their launch in the first place, or may not care so much if there is nuclear winter) --Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on "under the hood" so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future? Why aren't they fighting each other as well as the humans? Or maybe they do fight each other but you didn't focus on that aspect of the story because it's less relevant to us? --Yeah, society will very likely not be that competent IMO. I think that's the biggest implausi
5Andrew Critch3dPaul, thanks writing this; it's very much in line with the kind of future I'm most worried about. For me, it would be super helpful if you could pepper throughout the story mentions of the term "outer alignment" indicating which events-in-particular you consider outer alignment failures. Is there any chance you could edit it to add in such mentions? E.g., I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).
4Paul Christiano3dI'd say that every single machine in the story is misaligned, so hopefully that makes it easy :) I'm basically always talking about intent alignment, as described in this post [https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6]. (I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)
5Raymond Arnold4dThere's a lot of intellectual meat in this story that's interesting. But, my first comment was: "I'm finding myself surprisingly impressed about some aesthetic/stylistic choices here, which I'm surprised I haven't seen before in AI Takeoff Fiction." In normal english phrasing across multiple paragraphs, there's a sort of rise-and-fall of tension. You establish a minor conflict, confusion, or an open loop of curiosity [https://www.lesswrong.com/posts/sjbp8qfuxbnFmvXGk/open-loops-in-fiction], and then something happens that resolves it a bit. This isn't just about the content of 'what happens', but also what sort of phrasing one uses. In verbal audio storytelling, this often is accompanied with the pitch of your voice rising and falling. And this story... even moreso than Accelerando or other similar works, somehow gave me this consistent metaphorical vibe of "rising pitch". Like, some club music where it keeps sounding like the bass is about to drop, but instead it just keeps rising and rising. Something about most of the paragraph structures feel like they're supposed to be the first half of a two-paragraph-long-clause, and then instead... another first half of a clause happens, and another. And this was incredibly appropriate for what the story was trying to do. I dunno how intentional any of that was but I quite appreciated it, and am kinda in awe and boggled and what precisely created the effect – I don't think I'd be able to do it on purpose myself without a lot of study and thought.

This is independent research. To make it possible for me to continue writing posts like this, please consider supporting me.

Many thanks to Professor Littman for reviewing a draft of this post.


Yesterday, at a seminar organized by The Center for Human-compatible AI (CHAI), Professor Michael Littman gave a presentation entitled "The HCI of HAI'', or "The Human Computer Interaction of Human-compatible Artificial Intelligence". Professor Littman is a computer science professor at Brown who has done foundational work in reinforcement learning as well as many other areas of computer science. It was a very interesting presentation and I would like to reflect a little on what was said.

The basic question Michael addressed was: "how do we get machines to do what we want?" and his talk was structured around...

2Charlie Steiner4dEven when talking about how humans shouldn't always be thought of as having some "true goal" that we just need to communicate, it's so difficult to avoid talking in that way :) We naturally phrase alignment as alignment to something - and if it's not humans, well, it must be "alignment with something bigger than humans." We don't have the words to be more specific than "good" or "good for humans," without jumping straight back to aligning outcomes to something specific like "the goals endorsed by humans under reflective equilibrium" or whatever. We need a good linguistic-science fiction story about a language with no such issues.

Yes, I agree, it's difficult to find explicit and specific language for what it is that we would really like to align AI systems with. Thank you for the reply. I would love to read such a story!

1Alex Flint6dThank you for the kind words. Well it would definitely be a mistake to build an AI system that extracts human intentions at some fixed point in time and treats them as fixed forever, yes? So it seems to me that it would be better to build systems predicated on that which is the underlying generator of the trajectory of human intentions. When I say "something bigger that human's intentions should be aligned with" I don't mean "physically bigger", I mean "prior to" or "the cause of". For example, the work concerning corrigibility is about building AI systems that can be modified later, yes? But why is it good to have AI systems that can be modified later? I would say that the implicit claim underlying corrigibility research is that we believe humans have the capacity to, over time, slowly and with many detours, align our own intentions with that which is actually good. So we believe that if we align AI systems with human intentions in a way that is not locked in, then we will be aligning AI systems with that which is actually good. I'm not claiming this is true, just that this is a premise of corrigibility being good. Another way of looking at it: Suppose we look at a whole universe with a single human embedded in it, and we ask: where in this system should we look in order to discover the trajectory of this human's intentions as they evolve through time? We might draw a boundary around the human's left foot and ask: can we discover the trajectory of this human's intentions by examining the configuration of this part of the world? We might draw a boundary around the human's head and ask the same question, and I think some would say in this case that the answer is yes, we can discover the human's intentions by examining the configuration of the head. But this is a remarkably strong claim: it asserts that there is no information crucial to tracing the trajectory of the human's intentions over time in any part of the system outside the head. It we draw a boundary aro

This post has benefited greatly from discussion with Sam Eisenstat, Caspar Oesterheld, and Daniel Kokotajlo.

Last year, I wrote a post claiming there was a Dutch Book against CDTs whose counterfactual expectations differ from EDT. However, the argument was a bit fuzzy.

I recently came up with a variation on the argument which gets around some problems; I present this more rigorous version here.

Here, "CDT" refers -- very broadly -- to using counterfactuals to evaluate expected value of actions. It need not mean physical-causal counterfactuals. In particular, TDT counts as "a CDT" in this sense.

"EDT", on the other hand, refers to the use of conditional probability to evaluate expected value of actions.

Put more mathematically, for action , EDT uses , and CDT uses . I'll write and ...

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract fo... (read more)

Superrationality, and generalizations of it, must treat options differently depending on how they're named.

Consider the penny correlation game: Both players decide independently on either head or tails. Then if they decided on the same thing, they each get one util, otherwise they get nothing. You play this game with an exact copy of yourself. You reason: since the other guy is an exact copy of me, whatever I do he will do the same thing. So we will get the util. Then you pick heads because its first alphabetically or some other silly consideration, and then you win. How good that you got to play with a copy, otherwise you would have only gotten half an util.

Now consider the penny anti-correlation game: Both players decide independently on...

I don't see how the two problems are the same. They are basically the agreement and symmetry breaking problems of distributed computing, and those two are not equivalent in all models. What you're saying is simply that in the no-communication model (where the same algorithm is used on two processes that can't communicate), these two problems are not equivalent. But they are asking for fundamentally different properties, and are not equivalent in many models that actually allow communication. 

This is definitely a hack, but it seems to solve many problems around Cartesian Boundaries. Much of this is development of earlier ideas about the Predict-O-Matic, see there if something is unclear.

Phylactery Decision Theory takes a Base Decision Theory (BDT) as an input and builds something around it, creating a new modified decision theory. Its purpose it to give its base the ability to """learn""" its position in the world.

I'll start by explaining a model of it in a Cartesian context. Lets say we have an agent, with a set of designated input and output channels. Then it makes its "decisions" like this: First, it has a probability distribution over everything, including the values of the output channels in the future, and updates it based on the...

I feel like doing a better job of motivating why we should care about this specific problem might help get you more feedback.

If we want to alter a decision theory to learn its set of inputs and outputs, your proposal makes sense to me at first glance. But I'm not sure why I should particularly care, or why there is even a problem to begin with solution. The link you provide doesn't help me much after skimming it, and I (and I assume many people) almost never read something that requires me to read other posts without even a summary of the references. I mad... (read more)

2Abram Demski3dOne problem with this is that it doesn't actually rank hypotheses by which is best (in expected utility terms), just how much control is implied. So it won't actually converge to the best self-fulfilling prophecy (which might involve less control). Another problem with this is that it isn't clear how to form the hypothesis "I have control over X".
1Bunthut3dYou don't. I'm using talk about control sometimes to describe what the agent is doing from the outside, but the hypothesis it believes all have a form like "The variables such and such will be as if they were set by BDT given such and such inputs". For the first setup, where its trying to learn what it has control over, thats true. But you can use any ordering of hypothesis for the descent, so we can just take "how good that world is" as our ordering. This is very fragile of course. If theres uncountably many great but unachievable worlds, we fail, and in any case we are paying for all this with performance on "ordinary learning". If this were running in a non-episodic environment, we would have to find a balance between having the probability of hypothesis decline according to goodness, and avoiding the "optimistic humean troll" hypothesis by considering complexity as well. It really seems like I ought to take "the active ingredient" of this method out, if I knew how.
2Abram Demski2dRight, but then, are all other variables unchanged? Or are they influenced somehow? The obvious proposal is EDT -- assume influence goes with correlation. Another possible answer is "try all hypotheses about how things are influenced."
Load More