Recommended Sequences

Embedded Agency
AGI safety from first principles
Iterated Amplification

Recent Discussion


This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes. At the end I’m going to go through some salient ways you could vary the story.

This isn’t intended to be a particularly great story (and it’s pretty informal). I’m still trying to think through what I expect to happen if alignment turns out to be hard, and this more like the most recent entry in a long journey of gradually-improving stories.

I wrote this up a few months ago and was reminded to post...

Curated. This was perhaps the most detailed yet informative story I've read about how failure will go down. As you say at the start it's making several key assumptions, it's not your 'mainline' failure story. Thx for making the assumptions explicit, and discussing how to vary them at the end. I'd like to see more people write stories written under different assumptions.

The sorts of stories Eliezer has told in the past have focused on 10-1000x faster takeoffs than discussed here, so those stories are less extended (you kinda just wake up one day then everyo... (read more)

2Wei Dai1d(Apologies for the late reply. I've been generally distracted by trying to take advantage of perhaps fleeting opportunities in the equities markets, and occasionally by my own mistakes while trying to do that.) How are people going to avoid contact with adversarial content, aside from [] "go into an info bubble with trusted AIs and humans and block off any communications from the outside"? (If that is happening a lot, it seems worthwhile say so explicitly in the story since that might be surprising/unexpected to a lot of readers?) Ok, in that case I think it would be useful to say a few words in the OP about why in this story, they don't have the desired effect, like, what happened when the safety researchers tried this? I can empathize with this motivation, but argue that "a kind of AI that will reach the right conclusions about everything" isn't necessarily incompatible with "humans retain enough control to do whatever they decide is right down the line" since such an AI could allow humans to retain control (and merely act as an assistant/advisor, for example) instead of forcibly imposing its decisions on everyone. For example, all or most humans lose their abilities for doing philosophical reasoning that will eventually converge to philosophical truths, because they go crazy from AI-powered memetic warfare, or come under undue influence of AI advisors who lack such abilities themselves but are extremely convincing. Or humans lock in what they currently think are their values/philosophies in some form (e.g., as utility functions in AI, or asking their AIs to help protect the humans themselves from value drift while unable to effectively differentiate between "drift" and "philosophical progress") to try to protect them from a highly volatile and unpredictable world.
2Adele Lopez7dHow bad is the ending supposed to be? Are just people who fight the system killed, and otherwise, humans are free to live in the way AI expects them to (which might be something like keep consuming goods and providing AI-mediated feedback on the quality of those goods)? Or is it more like once humans are disempowered no machine has any incentive to keep them around anymore, so humans are not-so-gradually replaced with machines? The main point of intervention in this scenario that stood out to me would be making sure that (during the paragraph beginning with "For many people this is a very scary situation.") we at least attempt to use AI-negotiators to try to broker an international agreement to stop development of this technology until we understood it better (and using AI-designed systems for enforcement/surveillance). Is there anything in particular that makes this infeasible?
6Paul Christiano7dI think that most likely either humans are killed incidentally as part of the sensor-hijacking (since that's likely to be the easiest way to deal with them), or else AI systems reserve a negligible fraction of their resources to keep humans alive and happy (but disempowered) based on something like moral pluralism or being nice or acausal trade (e.g. the belief that much of their influence comes from the worlds in which they are simulated by humans who didn't mess up alignment and who would be willing to exchange a small part of their resources in order to keep the people in the story alive and happy). I don't think this is infeasible. It's not the intervention I'm most focused on, but it may be the easiest way to avoid this failure (and it's an important channel for advance preparations to make things better / important payoff for understanding what's up with alignment and correctly anticipating problems).
5CarlShulman7dI think the one that stands out the most is 'why isn't it possible for some security/inspector AIs to get a ton of marginal reward by whistleblowing against the efforts required for a flawless global camera grab?' I understand the scenario says it isn't because the demonstrations are incomprehensible, but why/how?
7Paul Christiano7dYes, if demonstrations are comprehensible then I don't think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us. The global camera grab must involve plans that aren't clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don't understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that's about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can't figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can't really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control). (There is obviously a lot you could say about all the tools at the human's disposal to circumvent this kind of problem.) This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) "correct" generalization. Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can't solve alignment well enough in the intervening time, I do agree that it's unlikely we can solve it in advance.)
2Rohin Shah7dPlanned summary for the Alignment Newsletter:
2Rohin Shah7dPlanned opinion (shared with What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) [] )

I've felt like the problem of counterfactuals is "mostly settled" (modulo some math working out) for about a year, but I don't think I've really communicated this online. Partly, I've been waiting to write up more formal results. But other research has taken up most of my time, so I'm not sure when I would get to it.

So, the following contains some "shovel-ready" problems. If you're convinced by my overall perspective, you may be interested in pursuing some of them. I think these directions have a high chance of basically solving the problem of counterfactuals (including logical counterfactuals).

Another reason for posting this rough write-up is to get feedback: am I missing the mark? Is this not what counterfactual reasoning is about? Can you illustrate remaining problems with...

3Abram Demski6dI agree that radical probabilism can be thought of as bayesian-with-a-side-channel, but it's nice to have a more general characterization where the side channel is black-box, rather than an explicit side-channel which we explicitly update on. This gives us a picture of the space of rational updates. EG, the logical induction criterion allows for a large space of things to count as rational. We get to argue for constraints on rational behavior by pointing to the existence of traders which enforce those constraints, while being agnostic about what's going on inside a logical inductor. So we have this nice picture, where rationality is characterized by non-exploitability wrt a specific class of potential exploiters. Here's an argument for why this is an important dimension to consider: 1. Human value-uncertainty is not particularly well-captured by Bayesian uncertainty, as I imagine you'll agree. One particular complaint is realizability: we have no particular reason to assume that human preferences are within any particular space of hypotheses we can write down. 2. One aspect of this can be captured by InfraBayes: it allows us to eliminate the realizability assumption, instead only assuming that human preferences fall within some set of constraints which we can describe. 3. However, there is another aspect to human preference-uncertainty: human preferences change over time. Some of this is irrational, but some of it is legitimate philosophical deliberation. 4. And, somewhat in the spirit of logical induction, humans do tend to eventually address the most egregious irrationalities. 5. Therefore, I tend to think that toy models of alignment (such as CIRL, DRL, DIRL) should model the human as a radical probabilist; not because it's a perfect model, but because it constitutes a major incremental improvement wrt modeling what kind of uncertainty humans have over our own preferences. Recognizing preferences as a thing whic
3Vanessa Kosoy3dI'm not convinced this is the right desideratum for that purpose. Why should we care about exploitability by traders if making such trades is not actually possible given the environment and the utility function? IMO epistemic rationality is subservient to instrumental rationality, so our desiderata should be derived from the later. Actually I am rather skeptical/agnostic on this. For me it's fairly easy to picture that I have a "platonic" utility function, except that the time discount is dynamically inconsistent (not exponential). I am in favor of exploring models of preferences which admit all sorts of uncertainty and/or dynamic inconsistency, but (i) it's up to debate how much degrees of freedom we need to allow there and (ii) I feel that the case logical induction is the right framework for this is kinda weak (but maybe I'm missing something).

I'm not convinced this is the right desideratum for that purpose. Why should we care about exploitability by traders if making such trades is not actually possible given the environment and the utility function? IMO epistemic rationality is subservient to instrumental rationality, so our desiderata should be derived from the later.

So, one point is that the InfraBayes picture still gives epistemics an important role: the kind of guarantee arrived at is a guarantee that you won't do too much worse than the most useful partial model expects. So, we can think ... (read more)

3Abram Demski3dThis does make sense to me, and I view it as a weakness of the idea. However, the productivity of dutch-book type thinking in terms of implying properties which seem appealing for other reasons speaks heavily in favor of it, in my mind. A formal connection to more pragmatic criteria would be great. But also, maybe I can articulate a radical-probabilist position without any recourse to dutch books... I'll have to think more about that. I'm not sure how to double crux with this intuition, unfortunately. When I imagine the perspective you describe, I feel like it's rolling all dynamic inconsistency into time-preference and ignoring the role of deliberation. My claim is that there is a type of change-over-time which is due to boundedness, and which looks like "dynamic inconsistency" from a classical bayesian perspective, but which isn't inherently dynamically inconsistent. EG, if you "sleep on it" and wake up with a different, firmer-feeling perspective, without any articulable thing you updated on. (My point isn't to dogmatically insist that you haven't updated on anything, but rather, to point out that it's useful to have the perspective where we don't need to suppose there was evidence which justifies the update as Bayesian, in order for it to be rational.)

Epistemic status: not confident enough to bet against someone who’s likely to understand this stuff.

The lottery ticket hypothesis of neural network learning (as aptly described by Daniel Kokotajlo) roughly says:

When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.

This is a very simple, intuitive, and useful picture to have in mind, and the original paper presents interesting evidence for at least some form of the hypothesis. Unfortunately, the strongest forms of the hypothesis do not seem plausible - e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big.

Meanwhile, a...

8Daniel Kokotajlo10hThanks! I'm afraid I don't understand the math yet but I'll keep trying. In the meantime: Can you say more about why? It's not obvious to me that they are not big enough. Would you agree they probably contain edge detectors, circle detectors, etc. at initialization? Also, it seems that some subnetworks/tickets are already decent at the task at initialization, see e.g. this paper. [] Is that not "dog-recognizing subcircuits at initialization?" Or something similar?
4johnswentworth7hThe problem is what we mean by e.g. "dog recognizing subcircuit". The simplest version would be something like "at initialization, there's already one neuron which lights up in response to dogs" or something like that. (And that's basically the version which would be needed in order for a gradient descent process to actually pick out that lottery ticket.) That's the version which I'd call implausible: function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits. I would argue that dog-detectors are a lot more special than random circuits even a priori, but not so much more special that the space-of-functions-that-special is less than exponentially large. (For very small circuits like edge detectors, it's more plausible that some neurons implement that function right from the start.) The thing in the paper you linked is doing something different from that. At initialization, the neurons in the subcircuits they're finding would not light up in recognition of a dog, because they're still connected to a bunch of other stuff that's not in the subcircuit - the subcircuit only detects dogs once the other stuff is disconnected. And, IIUC, SGD should not reliably "find" those tickets: because no neurons in the subcircuit are significantly correlated with dogs, SGD doesn't have any reason to upweight them for dog-recognition. So what's going on in that paper is different from what's going on in normal SGD-trained nets (or at least not the full story).

One important thing to note here is that the LTH paper doesn't demonstrate that SGD "finds" a ticket: just that the subnetwork you get by training and pruning could be trained alone in isolation to higher accuracy. That doesn't mean that the weights in the original training are the same when the network is trained in isolation!

5Vanessa Kosoy1dIIUC, here's a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn't require heuristic gradient descent): for square loss it's just solving a large linear system once, for many other losses it should amount to convex optimization for which we have provable efficient algorithms. And, I guess it's underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I'm guessing some of the linked papers might have done something like this?
3johnswentworth1dThis basically matches my current understanding. (Though I'm not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven't read up on the topic enough to be sure.

With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept.  Thanks also for comments from Ramana Kumar.

Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers.

Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios.

This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause.  Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower.

 Unipolar take-offsMultipolar take-offs
Slow take-offs<not this post>Part 1 of this post
Fast take-offs<not this

Thankfully, there have already been some successes in agent-agnostic thinking about AI x-risk

Also Sotala 2018 mentions the possibility of control over society gradually shifting over to a mutually trading collective of AIs (p. 323-324) as one "takeoff" route, as well as discussing various economic and competitive pressures to shift control over to AI systems and the possibility of a “race to the bottom of human control” where state or business actors [compete] to reduce human control and [increase] the autonomy of their AI systems to obtain an edge over th... (read more)

2Andrew Critch6dThanks for the pointer to grace2020whose []! I've added it to the original post now under "successes in our agent-agnostic thinking". For sure, that is the point of the "successes" section. Instead of "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes" I should probably have said "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes, and to my eye there should be more communication across the boundary of that bubble."
5Andrew Critch6dThanks for this synopsis of your impressions, and +1 to the two points you think we agree on. As for these, some of them are real positions I hold, while some are not: I don't hold that view. I the closest view I hold is more like: "Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment." I don't hold the view you attribute to me here, and I agree wholesale with the following position, including your comparisons of cooperation with brain enhancement and improving belief accuracy: ... with one caveat: some beliefs are self-fulfilling, such as cooperation/defection. There are ways of improving belief accuracy that favor defection, and ways that favor cooperation. Plausibly to me, the ways of improving belief accuracy that favor defection are worse that mo accuracy improvement at all. I'm particularly firm in this view, though; it's more of a hedge. I do hold this view! Particularly the bolded part. I also agree with the bolded parts of your counterpoint, but I think you might be underestimating the value of technical work (e.g., CSC, MARL) directed at improving coordination amongst existing humans and human institutions. I think blockchain tech is a good example of an already-mildly-transformative technology for implementing radically mutually transparent and cooperative strategies through smart contracts. Make no mistake: I'm not claiming blockchain tehc is going to "save the world"; rather, it's changing the way people cooperate, and is doing so as a result of a technical insight. I think more technical insights are in order to improve cooperation and/or the global structure of society, and it's worth spending research efforts to find them. Reminder: this is not a bid for you personally to quit working on alignment! My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens. In practice
6Paul Christiano6dSounds like we are on broadly the same page. I would have said "Aligning ML systems is more likely if we understand more about how to align ML systems, or are better at coordinating to differentially deploy aligned systems, or are wiser or smarter or..." and then moved on to talking about how alignment research quantitatively compares to improvements in various kinds of coordination or wisdom or whatever. (My bottom line from doing this exercise is that I feel more general capabilities typically look less cost-effective on alignment in particular, but benefit a ton from the diversity of problems they help address.) I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through. I'm reading this (and your prior post) as bids for junior researchers to shift what they focus on. My hope is that seeing the back-and-forth in the comments will, in expectation, help them decide better.
1Andrew Critch5dYeah I agree we probably can't reach convergence on how alignment affects deployment time, at least not in this medium (especially since a lot of info about company policies / plans / standards are covered under NDAs), so I also think it's good to leave this question about deployment-time as a hanging disagreement node. Yes to both points; I'd thought of writing a debate dialogue on this topic trying to cover both sides, but commenting with you about it is turning out better I think, so thank for that!
1Andrew Critch6dYes. That is also my sense, and a major reason I suspect multi/multi delegation dynamics will remain neglected among x-risk oriented researchers for the next 3-5 years at least. Yes, I think coordination costs will by default pose a high overhead cost to preserving human values among systems with the potential to race to the bottom on how much they preserve human values. Yes. Imagine two competing cultures A and B have transformative AI tech. Both are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values. The shift is by design subtle enough not to trigger leaders of A and B to have a bargaining meeting to regulate against A' (contrary to Carl's narrative [] where leaders coordinate against loss of control). Subculture A' comes to dominate discourse and cultural narratives in A, and makes A faster/more productive than B, such as through the development of fully automated companies as in one of the Production Web [] stories. The resulting advantage of A is enough for A to begin dominating or at least threatening B geopolitically, but by that time leaders in A have little power to squash A', so instead B follows suit by allowing a highly automation-oriented subculture B's to develop. These advantages are small enough not to trigger regulatory oversight, but when integrated over time they are not "tiny". This results in the gradual empowerment of humans who are misaligned with preserving human existence, until those humans also lose control of their own existence, perhaps willfully, or perhaps carelessly, or through a mix of both.
2Paul Christiano6dI was asking you why you thought A' would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future? * One obvious reason is single-single misalignment---A' is willing to deploy misaligned AI in order to get an advantage, while B isn't---but you say "their tech is aligned with them" so it sounds like you're setting this aside. But maybe you mean that A' has values that make alignment easy, while B has values that make alignment hard, and so B's disadvantage still comes from single-single misalignment even though A''s systems are aligned? * Another advantage is that A' can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn't seem like it can cause A' to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment. * Wei Dai has suggested [] that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concer
5Andrew Critch5dAh! Yes, this is really getting to the crux of things. The short answer is that I'm worried about the following failure mode: Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins. (Here's, I'm using the word "culture" to encode a mix of information subsuming utility functions, beliefs, and decision theory, cognitive capacities, and other features determining the general tendencies of an agent or collective.) Of course, an easy antidote to this failure mode is to have A or B win instead of A', because A and B both have some human values other than power-maximizing. The problem is that this whole situation is premised on a conflict between A and B over which culture should win, and then the following observation applies: In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization). This observation is slightly different from observations that "simple values dominate engineering efforts" as seen in stories about singleton paperclip maximizers. A key feature of the Production Web dynamic is now just that it's easy to build production maximizers, but that it's easy to accidentally cooperate on building a production-maximizing systems that destroy both you and your competitors. Thanks for noticing whatever you think are the inconsistencies; if you have time, I'd love for you to point them out. This seems pretty likely to me. The bolded attribution to Dai above is
8Paul Christiano5dI'm wondering why the easiest way is to copy A'---why was A' better at acquiring influence in the first place, so that copying them or investing in them is a dominant strategy? I think I agree that once you're at that point, A' has an advantage. This doesn't feel like other words to me, it feels like a totally different claim. In the production web story it sounds like the web is made out of different firms competing for profit and influence with each other, rather than a set of firms that are willing to leave profit on the table to benefit one another since they all share the value of maximizing production. For example, you talk about how selection drives this dynamic, but the firm that succeed are those that maximize their own profits and influence (not those that are willing to leave profit on the table to benefit other firms). So none of the concrete examples of Wei Dai's economies of scale seem to actually seem to apply to give an advantage for the profit-maximizers in the production web. For example, natural monopolies in the production web wouldn't charge each other marginal costs, they would charge profit-maximizing profits. And they won't share infrastructure investments except by solving exactly the same bargaining problem as any other agents (since a firm that indiscriminately shared its infrastructure would get outcompeted). And so on. This seems like a core claim (certainly if you are envisioning a scenario like the one Wei Dai describes), but I don't yet understand why this happens. Suppose that the US and China both both have productive widget-industries. You seem to be saying that their widget-industries can coordinate with each other to create lots of widgets, and they will do this more effectively than the US and China can coordinate with each other. Could you give some concrete example of how the US widget industry and the Chinese widget industries coordinate with each other to make more widgets, and why this behavior is selected? For examp
2Andrew Critch4dHmm, perhaps this is indicative of a key misunderstanding. Why not? The third paragraph of the story indicates that: "Companies closer to becoming fully automated achieve faster turnaround times, deal bandwidth, and creativity of negotiations." In other words, at that point it could certainly happen that two monopolies would agree to charge each other lower cost if it benefitted both of them. (Unless you'd count that as instance of "charging profit-maximizing costs"?) The concern is that the subprocesses of each company/institution that get good at (or succeed at) bargaining with other institutions are subprocesses that (by virtue of being selected for speed and simplicity) are less aligned with human existence than the original overall company/institution, and that less-aligned subprocess grows to take over the institution, while always taking actions that are "good" for the host institution when viewed as a unilateral move in an uncoordinated game (hence passing as "aligned"). At this point, my plan is try to consolidate what I think the are main confusions in the comments of this post, into one or more new concepts to form the topic of a new post.
1Ben Pace4dSounds great! I was thinking myself about setting aside some time to write a summary of this comment section (as I see it).
2Andrew Critch6dThat is not my position if "you" in the story is "you, Paul Christiano" :) The closest position I have to that one is : "If another Paul comes along who cares about x-risk, they'll have more positive impact by focusing on multi-agent and multi-stakeholder issues or 'ethics' with AI tech than if they focus on intent alignment, because multi-agent and multi-stakeholder dynamics will greatly affect what strategies AI stakeholders 'want' their AI systems to pursue." If they tried to get you to quit working on alignment, I'd say "No, the tech companies still need people working on alignment for them, and Paul is/was one of those people. I don't endorse converting existing alignment researchers to working on multi/multi delegation theory (unless they're naturally interested in it), but if a marginal AI-capabilities-bound researcher comes along, I endorse getting them set up to think about multi/multi delegation more than alignment."
11CarlShulman7dRight now the United States has a GDP of >$20T, US plus its NATO allies and Japan >$40T, the PRC >$14T, with a world economy of >$130T. For AI and computing industries the concentration is even greater. These leading powers are willing to regulate companies and invade small countries based on reasons much less serious than imminent human extinction. They have also avoided destroying one another with nuclear weapons. If one-to-one intent alignment works well enough that one's own AI will not blatantly lie about upcoming AI extermination of humanity, then superintelligent locally-aligned AI advisors will tell the governments of these major powers (and many corporate and other actors with the capacity to activate governmental action) about the likely downside of conflict or unregulated AI havens (meaning specifically the deaths of the top leadership and everyone else in all countries). Within a country, one-to-one intent alignment for government officials or actors who support the government means superintelligent advisors identify and assist in suppressing attempts by an individual AI company or its products to overthrow the government. Internationally, with the current balance of power (and with fairly substantial deviations from it) a handful of actors have the capacity to force a slowdown or other measures to stop an outcome that will otherwise destroy them. They (and the corporations that they have legal authority over, as well as physical power to coerce) are few enough to make bargaining feasible, and powerful enough to pay a large 'tax' while still being ahead of smaller actors. And I think they are well enough motivated to stop their imminent annihilation, in a way that is more like avoiding mutual nuclear destruction than cosmopolitan altruistic optimal climate mitigation timing. That situation could change if AI enables tiny firms and countries to match the superpowers in AI capabilities or WMD before leading powers can block it. So I agree with other
2Andrew Critch6dCarl, thanks for this clear statement of your beliefs. It sounds like you're saying (among other things) that American and Chinese cultures will not engage in a "race-to-the-bottom" in terms of how much they displace human control over the AI technologies their companies develop. Is that right? If so, could you give me a % confidence on that position somehow? And if not, could you clarify? To reciprocate: I currently assign a ≥10% chance of a race-to-the-bottom on AI control/security/safety between two or more cultures this century, i.e., I'd bid 10% to buy in a prediction market on this claim if it were settlable. In more detail, I assign a ≥10% chance to a scenario where two or more cultures each progressively diminish the degree of control they exercise over their tech, and the safety of the economic activities of that tech to human existence, until an involuntary human extinction event. (By comparison, I assign at most around a ~3% chance of a unipolar "world takeover" event, i.e., I'd sell at 3%.) I should add that my numbers for both of those outcomes are down significantly from ~3 years ago due to cultural progress in CS/AI (see this ACM blog post []) allowing more discussion of (and hence preparation for) negative outcomes, and government pressures to regulate the tech industry.
4CarlShulman5dThe US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't. But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment (for both companies and governments). Competitive pressures are the main reason why AI systems with inadequate 1-to-1 alignment would be given long enough leashes to bring catastrophe. I would cosign Vanessa and Paul's comments about these scenarios being hard to fit with the idea that technical 1-to-1 alignment work is much less impactful than cooperative RL or the like. If this means that a 'robot rebellion' would include software produced by more than one company or country, I think that that is a substantial possibility, as well as the alternative, since competitive dynamics in a world with a few giant countries and a few giant AI companies (and only a couple leading chip firms) can mean that the way safety tradeoffs work is by one party introducing rogue AI systems that outcompete by not paying an alignment tax (and intrinsically embodying in themselves astronomically valuable and expensive IP), or cascading alignment failure in software traceable to a leading company/consortium or country/alliance. But either way reasonably effective 1-to-1 alignment methods (of the 'trying to help you and not lie to you and murder you with human-level abilities' variety) seem to eliminate a supermajority of the risk. [I am separately skeptical that technical work on multi-agent RL is particularly helpful, since it can be done by 1-to-1 aligned systems when they are smart, and the more important coordination problems seem to be earlier between humans in the development phase.]
5JesseClifton2dI'm not sure I understand yet. For example, here’s a version of Flash War that happens seemingly without either the principals knowingly taking gargantuan risks or extreme intent-alignment failure. 1. The principals largely delegate to AI systems on military decision-making, mistakenly believing that the systems are extremely competent in this domain. 2. The mostly-intent-aligned AI systems, who are actually not extremely competent in this domain, make hair-trigger commitments of the kind described in the OP. The systems make their principals aware of these commitments and (being mostly-intent-aligned) convince their principals “in good faith” that this is the best strategy to pursue. In particular they are convinced that this will not lead to existential catastrophe. 3. The commitments are triggered as described in the OP, leading to conflict. The conflict proceeds too quickly for the principals to effectively intervene / the principals think their best bet at this point is to continue to delegate to the AIs. 4. At every step both principals and AIs think they’re doing what’s best by the respective principals’ lights. Nevertheless, due to a combination of incompetence at bargaining and structural factors (e.g., persistent uncertainty about the other side’s resolve), the AIs continue to fight to the point of extinction or unrecoverable collapse. Would be curious to know which parts of this story you find most implausible.
2CarlShulman2dMainly such complete (and irreversible!) delegation to such incompetent systems being necessary or executed. If AI is so powerful that the nuclear weapons are launched on hair-trigger without direction from human leadership I expect it to not be awful at forecasting that risk. You could tell a story where bargaining problems lead to mutual destruction, but the outcome shouldn't be very surprising on average, i.e. the AI should be telling you about it happening with calibrated forecasts.
1JesseClifton1dOk, thanks for that. I’d guess then that I’m more uncertain than you about whether human leadership would delegate to systems who would fail to accurately forecast catastrophe. It’s possible that human leadership just reasons poorly about whether their systems are competent in this domain. For instance, they may observe that their systems perform well in lots of other domains, and incorrectly reason that “well, these systems are better than us in many domains, so they must be better in this one, too”. Eagerness to deploy before a more thorough investigation of the systems’ domain-specific abilities may be exacerbated by competitive pressures. And of course there is historical precedent for delegation to overconfident military bureaucracies. On the other hand, to the extent that human leadership is able to correctly assess their systems’ competence in this domain, it may be only because there has been a sufficiently successful AI cooperation research program. For instance, maybe this research program has furnished appropriate simulation environments to probe the relevant aspects of the systems’ behavior, transparency tools for investigating cognition about other AI systems, norms for the resolution of conflicting interests and methods for robustly instilling those norms, etc, along with enough researcher-hours applying these tools to have an accurate sense of how well the systems will navigate conflict. As for irreversible delegation — there is the question of whether delegation is in principle reversible, and the question of whether human leaders would want to override their AI delegates once war is underway. Even if delegation is reversible, human leaders may think that their delegates are better suited to wage war on their behalf once it has started. Perhaps because things are simply happening so fast for them to have confidence that they could intervene without placing themselves at a decisive disadvantage.
4Rohin Shah7dPlanned summary for the Alignment Newsletter: Planned opinion (shared with Another (outer) alignment failure story [] ):
4Andrew Critch6dYes, I agree with this. Yes! +10 to this! For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here [The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.]) seem to think I'm saying you shouldn't work on alignment (which I'm not), which triggers a "Yes, this is the most valuable thing" reply. I'm trying to say "Hey, if you care about AI x-risk, alignment isn't the only game in town", and staking some personal reputation points to push against the status quo where almost-everyone x-risk oriented will work on alignment almost-nobody x-risk-oriented will work on cooperation/coordination or multi/multi delegation. Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"...
8Paul Christiano6dIn fairness, writing [] “marginal deep-thinking researchers [should not] allocate themselves to making alignment […] cheaper/easier/better” is pretty similar to saying “one shouldn’t work on alignment.” (I didn’t read you as saying that Paul or Rohin shouldn’t work on alignment, and indeed I’d care much less about that than about a researcher at CHAI arguing that CHAI students shouldn’t work on alignment.) On top of that, in your prior post [] you make stronger claims: * "Contributions to OODR research are not particularly helpful to existential safety in my opinion.” * “Contributions to preference learning are not particularly helpful to existential safety in my opinion” * “In any case, I see AI alignment in turn as having two main potential applications to existential safety:” (excluding the main channel Paul cares about and argues for, namely that making alignment easier improves the probability that the bulk of deployed ML systems are aligned and reduces the competitive advantage for misaligned agents) In the current post you (mostly) didn’t make claims about the relative value of different areas, and so I was (mostly) objecting to arguments that I consider misleading or incorrect. But you appeared to be sticking with the claims from your prior post and so I still ascribed those views to you in a way that may have colored my responses. I’m not really claiming that AI alignment is the most important thing to work on (though I do think it’s among the best ways to address problems posed by misaligned AI systems in particular). I’m generally supportive of and excited about a wide variety of approaches to improving society’s ability to cope with future challenges (th
7Rohin Shah6dI think that probably would be true. Fwiw my reaction is not "Critch thinks Rohin should do something else", it's more like "Critch is saying something I believe to be false on an important topic that lots of other people will read". I generally want us as a community to converge to true beliefs on important things (part of my motivation for writing a newsletter) and so then I'd say "but actually alignment still seems like the most valuable thing on the margin because of X, Y and Z". (I've had enough conversations with you at this point to know the axes of disagreement, and I think you've convinced me that "which one is better on the margin" is not actually that important a question to get an answer to. So now I don't feel as much of an urge to respond that way. But that's how I started out.)
3Andrew Critch5dGot it, thanks!

Or: Big Timelines Crux Operationalized

What fun things could one build with +12 orders of magnitude of compute? By ‘fun’ I mean ‘powerful.’ This hypothetical is highly relevant to AI timelines, for reasons I’ll explain later.

Summary (Spoilers):

I describe a hypothetical scenario that concretizes the question “what could be built with 2020’s algorithms/ideas/etc. but a trillion times more compute?”  Then I give some answers to that question. Then I ask: How likely is it that some sort of TAI would happen in this scenario? This second question is a useful operationalization of the (IMO) most important, most-commonly-discussed timelines crux:  “Can we get TAI just by throwing more compute at the problem?” I consider this operationalization to be the main contribution of this post; it directly plugs into Ajeya’s timelines

9Daniel Kokotajlo1dUpdate: After talking to various people, it appears that (contrary to what the poll would suggest) there are at least a few people who answer Question 2 (all three variants) with less than 80%. In light of those conversations, and more thinking on my own, here is my current hot take on how +12 OOMs could turn out to not be enough: 1. Maybe the scaling laws will break. Just because GPT performance has fit a steady line across 5 orders of magnitude so far (or whatever) doesn't mean it will continue for another 5. Maybe it'll level off for some reason we don't yet understand. Arguably this is what happened with LSTMs? Anyhow, for timelines purposes what matters is not whether it'll level off by the time we are spending +12 OOMs of compute, but rather more like whether it will level off by the time we are spending +6 OOMs of compute. I think it's rather unlikely to level off that soon, but it might. Maybe 20% chance. If this happens, then probably Amp(GPT-7) and the like wouldn't work. (80%?) The others are less impacted, but maybe we can assume OmegaStar probably won't work either. Crystal Nights, SkunkWorks, and Neuromorph... don't seem to be affected by scaling laws though. If this were the only consideration, my credence would be something like 15% chance that Crystal Nights and OmegaStar don't work, and then independently, maybe 30% chance that none of the others work too, for a total of 95% answer to Question Two... :/ I could fairly easily be convinced that it's more like a 40% chance instead of 15% chance, in which case my answer is still something like 85%... :( 2. Maybe the horizon length framework plus scaling laws really will turn out to be a lot more solid than I think. In other words, maybe +12 OOMs is enough to get us some really cool chatbots and whatnot but not anything transformative or PONR-inducing; for those tasks we need long-horizon training... (Medium-horizons can be handled by +12 OOMs). Unsurprisingly to those who've read my sequence on take
5Abram Demski1dSo, how does the update to the AI and compute trend [] factor in?

It is irrelevant to this post, because this post is about what our probability distribution over orders of magnitude of compute should be like. Once we have said distribution, then we can ask: How quickly (in clock time) will we progress through the distribution / explore more OOMs of compute? Then the AI and compute trend, and the update to it, become relevant.

But not super relevant IMO. The AI and Compute trend was way too fast to be sustained, people at the time even said so. This recent halt in the trend is not surprising. What matters is what the tren... (read more)

4Abram Demski1dIs there a reference for this?
2Daniel Kokotajlo1dWhat Gwern said. :) But I don't know for sure what the person I talked to had in mind.
10gwern1d []

As a follow-up to the Walled Garden discussion about Kelly betting, Scott Garrabrant made some super-informal conjectures to me privately, involving the idea that some class of "nice" agents would "Kelly bet influence", where "influence" had something to do with anthropics and acausal trade.

I was pretty incredulous at the time. However, as soon as he left the discussion, I came up with an argument for a similar fact. (The following does not perfectly reflect what Scott had in mind, by any means. His notion of "influence" was very different, for a start.)

The meat of my argument is just Critch's negotiable RL theorem. In fact, that's practically the entirety of my argument. I'm just thinking about the consequences in a different way from how I have before.


Rather than...

3Gurkenglas2dSuppose instead of a timeline with probabilistic events, the coalition experiences the full tree of all possible futures - but we translate everything to preserve behavior. Then beliefs encode which timelines each member cares about, and bets trade influence (governance tokens) between timelines.
2Abram Demski1dCan you justify Kelly "directly" in terms of Pareto-improvement trades rather than "indirectly" through Pareto-optimality? I feel this gets at the distinction between the selfish vs altruistic view.
2SimonM2dI also looked into this after that discussion. At the time I thought that this might have been something special about Kelly, but when I did some calculations afterwards I found that I couldn't get this to work in the other direction. I haven't fully parsed what you mean by: So take the following with a (large) grain of salt before I can recheck my reasoning, but: Everything you've written (as I currently understand it) also applies for many other betting strategies. eg if everyone was betting (the same constant) fractional Kelly. Specifically the market will clear at the same price (weighted average probability) and "everyone who put money on the winning side picks up a fraction of money proportional to the fraction they originally contributed to that side".

I also looked into this after that discussion. At the time I thought that this might have been something special about Kelly, but when I did some calculations afterwards I found that I couldn't get this to work in the other direction.

I'm not sure what you mean here. What is "this" in "looked into this" -- Critch's theorem? What is "the other direction"?

Everything you've written (as I currently understand it) also applies for many other betting strategies. eg if everyone was betting (the same constant) fractional Kelly.

Specifically the market will clear at

... (read more)
Load More