All of Daniel Kokotajlo's Comments + Replies

Another (outer) alignment failure story

Thanks for this, this is awesome! I'm hopeful in the next few years for there to be a collection of stories like this.

This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes.

I'm a bit surprised that the outcome is worse than you expect, considering that this scenario is "easy mode" for ... (read more)

5Paul Christiano3hThe main way it's worse than I expect is that I expect future people to have a long (subjective) time to solve these problems and to make much more progress than they do in this story. I don't think it's right to infer much about my stance on inner vs outer alignment. I don't know if it makes sense to split out "social competence" in this way. The lack of a hot war in this story is mostly from the recent trend. There may be a hot war prior to things heating up, and then the "takeoff" part of the story is subjectively shorter than the last 70 years. I'm extremely skeptical of an appeal to observer selection effects changing the bottom line about what we should infer from the last 70 years. Luck sounds fine though. I don't think the AI systems are all on the same team. That said, to the extent that there are "humans are deluded" outcomes that are generally preferable according to many AI's values, I think the AIs will tend to bring about such outcomes. I don't have a strong view on whether that involves explicit coordination. I do think the range for every-wins outcomes (amongst AIs) is larger because of the "AI's generalize 'correctly'" assumption, so this story probably feels a bit more like "us vs them" than a story that relaxed that assumption. I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald's and Burger King's marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story? I'm imagining that's the case in this story. Failure is early enough in this story that e.g. the human's investment in sensor networks and rare expensive audits isn't slowing them down very much compared to the "rogue" AI. Such "rogue" AI could provide a competitive pressure, but I think it's a minority of the competitive pressure overall (and at any rate it has the same role/effect as the other competitive pressure described in this story). We will be deploying m
How do scaling laws work for fine-tuning?

Not according to this paper! They were able to get performance comparable to full-size networks, it seems. IDK.

1Charlie Steiner6dI am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I'm wrong and there's some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.
How do scaling laws work for fine-tuning?

I totally agree that you still have to do all the matrix multiplications of the original model etc. etc. I'm saying that you'll need to do them fewer times, because you'll be training on less data.

Each step costs, say, 6*N flop where N is parameter count. And then you do D steps, where D is how many data points you train on. So total flop cost is 6*N*D. When you fine-tune, you still spend 6*N for each data point, but you only need to train on 0.001D data points, at least according to the scaling laws, at least according to the orthodox inter... (read more)

1Charlie Steiner7dSure, but if you're training on less data it's because fewer parameters is worse :P
How do scaling laws work for fine-tuning?

I think compute cost equals data x parameters, so even if parameters are the same, if data is 3 OOM smaller, then compute cost will be 3 OOM smaller.

I'm not sure I understand your edit question. I'm referring to the scaling laws as discussed and interpreted by Ajeya. Perhaps part of what's going on is that in the sizes of model we've explored so far, bigger models only need a little bit more data, because bigger models are more data-efficient. But very soon it is prophecied that this will stop and we will transition to a slower scaling ... (read more)

1Charlie Steiner7dI'm not sure how your reply relates to my guess, so I'm a little worried. If you're intending the compute comment to be in opposition to my first paragraph, then no - when finetuning a subset of the parameters, compute is not simply proportional to the size of the subset you're finetuning, because you still have to do all the matrix multiplications of the original model, both for inference and gradient propagation. I think the point for the paper only finetuning a subset was to make a scientific point, not save compute. My edit question was just because you said something about expecting the # of steps to be 3 OOM for a 3 OOM smaller model. But iirc really it's more like the compute will be smaller, but the # of steps won't change much (they're just cheaper). Do you have a reference for this picture of "need lots more data to get performance improvements?" I've also heard some things about a transition, but as a transition from compute-limited to data-limited, which means "need lots more compute to get performance improvements."
Review of "Fun with +12 OOMs of Compute"

OK, thanks.

1. I concede that we're not in a position of complete ignorance w.r.t. the new evidence's impact on alternate hypotheses. However, the same goes for pretty much any argument anyone could make about anything. In my particular case I think there's some sense in which, plausibly, for most underlying views on timelines people will have, my post should cause an update more or less along the lines I described. (see below)

2. Even if I'm wrong about that, I can roll out the anti-spikiness argument to argue in favor of <7 OOMs, tho... (read more)

1Joe_Collman1hTaking your last point first: I entirely agree on that. Most of my other points were based on the implicit assumption that readers of your post don't think something like "It's directly clear that 9 OOM will almost certainly be enough, by a similar argument". Certainly if they do conclude anything like that, then it's going to massively drop their odds on 9-12 too. However, I'd still make an argument of a similar form: for some people, I expect that argument may well increase the 5-8 range more (than proportionately) than the 1-4 range. On (1), I agree that the same goes for pretty-much any argument: that's why it's important. If you update without factoring in (some approximation of) your best judgement of the evidence's impact on all hypotheses, you're going to get the wrong answer. This will depend highly on your underlying model. On the information content of the post, I'd say it's something like "12 OOMs is probably enough (without things needing to scale surprisingly well)". My credence for low OOM values is mostly based on worlds where things scale surprisingly well. I don't think this is weird. What matters isn't what the post talks about directly - it's the impact of the evidence provided on the various hypotheses. There's nothing inherently weird about evidence increasing our credence in [TAI by +10OOM] and leaving our credence in [TAI by +3OOM] almost unaltered (quite plausibly because it's not too relevant to the +3OOM case). Compare the 1-2-3 coins example: learning y tells you nothing about the value of x. It's only ruling out any part of the 1 outcome in the sense that it maintains [x_heads & something independent is heads], and rules out [x_heads & something independent is tails]. It doesn't need to talk about x to do this. You can do the same thing with the TAI first at k OOM case - call that Tk. Let's say that your post is our evidence e and that e+ stands for [e gives a compelling argument against T13+]. Updating on e+ you get the following
How do scaling laws work for fine-tuning?

Thanks! Your answer no. 2 is especially convincing to me; I didn't realize the authors used smaller models as the comparison--that seems like an unfair comparison! I would like to see how well these 0.1%-tuned transformers do compared to similarly-sized transformers trained from scratch.

4Rohin Shah8dI don't think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.
Review of "Fun with +12 OOMs of Compute"

I think I'm just not seeing why you think the >12 OOM mass must all go somewhere than the <4 OOM (or really, I would argue, <7 OOM) case. Can you explain more?

Maybe the idea is something like: There are two underlying variables, 'We'll soon get more ideas' and 'current methods scale.' If we get new ideas soon, then <7 are needed. If we don't but 'current methods scale' is true, 7-12 are needed. If neither variable is true then >12 is needed. So then we read my +12 OOMs post and become convinced th... (read more)

4Joe_Collman8d[[ETA, I'm not claiming the >12 OOM mass must all go somewhere other than the <4 OOM case: this was a hypothetical example for the sake of simplicity. I was saying that if I had such a model (with zwomples or the like), then a perfectly good update could leave me with the same posterior credence on <4 OOM. In fact my credence on <4 OOM was increased, but only very slightly]] First I should clarify that the only point I'm really confident on here is the " In general, you can't just throw out the >12 OOM and re-normalise, without further assumptions" argument. I'm making a weak claim: we're not in a position of complete ignorance w.r.t. the new evidence's impact on alternate hypotheses. My confidence in any specific approach is much weaker: I know little relevant data. That said, I think the main adjustment I'd make to your description is to add the possibility for sublinear scaling of compute requirements with current techniques. E.g. if beyond some threshold meta-learning efficiency benefits are linear in compute, and non-meta-learned capabilities would otherwise scale linearly, then capabilities could scale with the square root of compute (feel free to replace with a less silly example of your own). This doesn't require "We'll soon get more ideas" - just a version of "current methods scale" with unlucky (from the safety perspective) synergies. So while the "current methods scale" hypothesis isn't confined to 7-12 OOMs, the distribution does depend on how things scale: a higher proportion of the 1-6 region is composed of "current methods scale (very) sublinearly". My p(>12 OOM | sublinear scaling) was already low, so my p(1-6 OOM | sublinear scaling) doesn't get much of a post-update boost (not much mass to re-assign). My p(>12 OOM | (super)linear scaling) was higher, but my p(1-6 OOM | (super)linear scaling) was low, so there's not too much of a boost there either (small proportion of mass assigned). I do think it makes sense to end up with a post-update cr
How do we prepare for final crunch time?

Hmmm, if this is the most it's been done, then that counts as a No in my book. I was thinking something like "Ah yes, the Viet Cong did this for most of the war, and it's now standard in both the Vietnamese and Chinese armies." Or at least "Some military somewhere has officially decided that this is a good idea and they've rolled it out across a large portion of their force."

Review of "Fun with +12 OOMs of Compute"

Interesting, hmm.

In the 1-2-3 coin case, seeing that y is heads rules out 3, but it also rules out half of 1. (There are two 1 hypotheses, the yheads and the ytails version) To put it another way, terms P(yheads|1)=0.5. So we are ruling-out-and-renormalizing after all, even though it may not appear that way at first glance.

The question is, is something similar happening with the AI OOMs?

I think if the evidence leads us to think things like "This doesn't say anything about TAI at +4 OOM, since my prediction is based on orthogonal variables" ... (read more)

3Joe_Collman12dYes, we're always renormalising at the end - it amounts to saying "...and the new evidence will impact all remaining hypotheses evenly". That's fine once it's true. I think perhaps I wasn't clear with what I mean by saying "This doesn't say anything...". I meant that it may say nothing in absolute terms - i.e. that I may put the same probability of [TAI at 4 OOM] after seeing the evidence as before. This means that it does say something relative to other not-ruled-out hypotheses: if I'm saying the new evidence rules out >12 OOM, and I'm also saying that this evidence should leave p([TAI at 4 OOM]) fixed, I'm implicitly claiming that the >12 OOM mass must all go somewhere other than the 4 OOM case. Again, this can be thought of as my claiming e.g.: [TAI at 4 OOM] will happen if and only if zwomples work There's a 20% chance zwomples work The new 12 OOM evidence says nothing at all about zwomples In terms of what I actually think, my sense is that the 12 OOM arguments are most significant where [there are no high-impact synergistic/amplifying/combinatorial effects I haven't thought of]. My credence for [TAI at < 4 OOM] is largely based on such effects. Perhaps it's 80% based on some such effect having transformative impact, and 20% on we-just-do-straightforward-stuff. [Caveat: this is all just ottomh; I have NOT thought for long about this, nor looked at much evidence; I think my reasoning is sound, but specific numbers may be way off] Since the 12 OOM arguments are of the form we-just-do-straightforward-stuff, they cause me to update the 20% component, not the 80%. So the bulk of any mass transferred from >12 OOM, goes to cases where p([ we-just-did-straightforward-stuff and no strange high-impact synergies occurred ]|[TAI first occurred at this level]) is high.
How do we prepare for final crunch time?

Thanks, this is a great thing to be thinking about and a good list of ideas!

Do other subjects come to mind?

Public speaking skills, persuasion skills, debate skills, etc.

Practice no-cost-too-large productive periods

I like this idea. At AI Impacts we were discussing something similar: having "fire drills" where we spend a week (or even just a day) pretending that a certain scenario has happened, e.g. "DeepMind just announced they have a turing-test-passing system and will demo it a week from now; we've got two journalists asking us fo... (read more)

3Kaj Sotala11d. Yes, e.g. []
Generalizing Power to multi-agent games
this feels like a situation where our naive intuitions about power are just wrong, and if you think about it more, the formal result reflects a meaningful phenomenon. 

Different strokes for different folks, I guess. It feels very different to me.

Review of "Fun with +12 OOMs of Compute"
We now need to reassign most of the 30% mass we have on >13 OOM, but we can't simply renormalise: we haven't (necessarily) gained any information on the viability of [approach X].
Our post-update [TAI <= 5OOM] credence should remain almost exactly 20%. Increasing it to ~26% would not make any sense.

I don't see why this is. From a bayesian perspective, alternative hypotheses being ruled out == gaining evidence for a hypothesis. In what sense have we not gained any information on the viability of approach X? We've learned that one of the alternatives to X (the at least 13 OOM alternative) won't happen.

6Joe_Collman13dWe do gain evidence on at least some alternatives, but not on all the factors which determine the alternatives. If we know something about those factors, we can't usually just renormalise. That's a good default, but it amounts to an assumption of ignorance. Here's a simple example: We play a 'game' where you observe the outcome of two fair coin tosses x and y. You score: 1 if x is heads 2 if x is tails and y is heads 3 if x is tails and y is tails So your score predictions start out at: 1 : 50% 2 : 25% 3 : 25% We look at y and see that it's heads. This rules out 3. Renormalising would get us: 1 : 66.7% 2 : 33.3% 3: 0% This is clearly silly, since we ought to end up at 50:50 - i.e. all the mass from 3 should go to 2. This happens because the evidence that falsified 3 points was insignificant to the question "did you score 1 point?". On the other hand, if we knew nothing about the existence of x or y, and only knew that we were starting from (1: 50%, 2: 25%, 3: 25%), and that 3 had been ruled out, it'd make sense to re-normalise. In the TAI case, we haven't only learned that 12 OOM is probably enough (if we agree on that). Rather we've seen specific evidence that leads us to think 12 OOM is probably enough. The specifics of that evidence can lead us to think things like "This doesn't say anything about TAI at +4 OOM, since my prediction for +4 is based on orthogonal variables", or perhaps "This makes me near-certain that TAI will happen by +10 OOM, since the +12 OOM argument didn't require more than that".
Generalizing Power to multi-agent games
It initially seems unintuitive that as players' strategies improve, their collective Power tends to decrease. The proximate cause of this effect is something like "as your strategy improves, other players lose the power to capitalize off of your mistakes".

"I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals."

IMO, the unintuitive and potentially problematic thing is not that... (read more)

4Alex Turner14dI don’t see why useful power is particularly useful, since it’s taking a non-constant-sum quantity (outside of nash equilibria) and making it constant-sum, which seems misleading. But I also don’t see a problem with the “better play -> less exploitability -> less total Power” reasoning. this feels like a situation where our naive intuitions about power are just wrong, and if you think about it more, the formal result reflects a meaningful phenomenon.
Review of "Fun with +12 OOMs of Compute"

I'm not sure, but I think that's not how updating works? If you have a bunch of hypotheses (e.g. "It'll take 1 more OOM," "It'll take 2 more OOMs," etc.) and you learn that some of them are false or unlikely (only 10% chance of it taking more than 12" then you should redistribute the mass over all your remaining hypotheses, preserving their relative strengths. And yes I have the same intuition about analogical arguments too. For example, let's say you overhear me talking about a bridge being built near my h... (read more)

2Joe_Collman14dThis depends on the mechanism by which you assigned the mass initially - in particular, whether it's absolute or relative. If you start out with specific absolute probability estimates as the strongest evidence for some hypotheses, then you can't just renormalise when you falsify others. E.g. consider we start out with these beliefs: If [approach X] is viable, TAI will take at most 5 OOM; 20% chance [approach X] is viable. If [approach X] isn't viable, 0.1% chance TAI will take at most 5 OOM. 30% chance TAI will take at least 13 OOM. We now get this new information: There's a 95% chance [approach Y] is viable; if [approach Y] is viable TAI will take at most 12 OOM. We now need to reassign most of the 30% mass we have on >13 OOM, but we can't simply renormalise: we haven't (necessarily) gained any information on the viability of [approach X]. Our post-update [TAI <= 5OOM] credence should remain almost exactly 20%. Increasing it to ~26% would not make any sense. For AI timelines, we may well have some concrete, inside-view reasons to put absolute probabilities on contributing factors to short timelines (even without new breakthroughs we may put absolute numbers on statements of the form "[this kind of thing] scales/generalises"). These probabilities shouldn't necessarily be increased when we learn something giving evidence about other scenarios. (the probability of a short timeline should change, but in general not proportionately) Perhaps if you're getting most of your initial distribution from a more outside-view perspective, then you're right.
3Adam Shimi14dAbout the update You're right, that's what would happen with an update. I think that the model I have in mind (although I hadn't explicitly thought about it until know), is something like a distribution over ways to reach TAI (capturing how probable it is that they're the first way to reach AGI), and each option comes with its own distribution (let's say over years). Obviously you can compress that into a single distribution over years, but then you lose the ability to do fine grained updating. For example, I imagine that someone with relatively low probability that prosaic AGI will be the first to reach AGI, upon reading your post, would have reasons to update the distribution for prosaic AGI in the way you discuss, but not to update the probability that prosaic AGI will be the first to reach TAI. On the other hand, if there was a argument centered more around an amount of compute we could plausibly get in a short timeframe (the kind of thing we discuss as potential follow-up work), then I'd expect that this same person, if convinced, would put more probability that prosaic AGI will be the first to reach TAI. Graph-based argument I must admit that I have trouble reading your graph because there's no scale (although I expect the spiky part is centered at +12 OOMs? As for the textual argument, I actually think it makes sense to put quite low probability to +13 OOMs if one agrees with your scenario. Maybe my argument is a bit weird, but it goes something like this: based on your scenarios, it should be almost sure that we can reach TAI with +12 OOMs of magnitude. If it's not the case, then there's something fundamentally difficult about reaching TAI with prosaic AGI (because you're basically throwing all the compute we want at it), and so I expect very little probability of a gain from 1 OOMs. The part about this reasoning that feels weird is that I reason about 13 OOMs based on what happens at 12 OOMs, and the idea that we care about 13 OOMs iff 12 OOMs is not
Review of "Fun with +12 OOMs of Compute"

Thanks! Well, I agree that I didn't really do anything in my post to say how the "within 12 OOMs" credence should be distributed. I just said: If you distribute it like Ajeya does except that it totals to 80% instead of 50%, you should have short timelines.

There's a lot I could say about why I think within 6 OOMs should have significant probability mass (in fact, I think it should have about as much mass as the 7-12 OOM range). But for now I'll just say this: If you agree with me re Question Two, and put (say) 80%+ probability mass by +12 OOMs, but you als... (read more)

3Adam Shimi15dLet me try to make an analogy with your argument. Say we want to make X. What you're saying is "with 10^12 dollars, we could do it that way". Why on earth would I update at all whether it can be done with 10^6 dollars? If your scenario works with that amount, then you should have described it using only that much money. If it doesn't, then you're not providing evidence for the cheaper case. Similarly here, if someone starts with a low credence on prosaic AGI, I can see how your arguments would make them put a bunch of probability mass close to +10^12 compute. But they have no reason to put probability mass anywhere far from that point, since the scenarios you give are tailored to that. And lacking an argument for why you can get that much compute in a short timeline, then they probably end up thinking that if prosaic AGI ever happens, it's probably after every other option. Which seems like the opposite of the point you're trying to make.
Review of "Fun with +12 OOMs of Compute"

Thanks for doing this! I'm honored that you chose my post to review and appreciate all the thought you put into this.

I have one big objection: The thing you think this post assumes, this post does not actually assume. In fact, I don't even believe it! In more detail:

You say:

The relevance of this work appears to rely mostly on the hypothesis that the +12 OOMs of magnitude of compute and all relevant resources could plausibly be obtained in a short time frame. If not, then the arguments made by Daniel wouldn’t have the consequence of making
... (read more)
3Adam Shimi15dYou're welcome! (Talking only for myself here) Rereading your post after seeing this comment: I personally misread this, and understood "the bulk of its mass at the 10^35 mark". The correct reading is more in line with what you're saying here. That's probably a reason why I personnally focused on the +12 OOMs mark (I mean, that's also in the title). So I agree we misunderstood some parts of your post, but I still think our issue remains. Except that instead of being about justifying +12 OOMs of magnitude in the short term, it becomes about justifying why the +12 OOMs examples should have any impact on, let's say, +6 OOMs. I personally don't feel like your examples give me an argument for anywhere but the +12 OOMs mark. That's where they live, and those examples seem to require that much compute, or still a pretty big amount of it. So reading your post makes me feel like I should have more probability mass at this mark or very close to it, but I don't see any reason to update the probability at the +6OOMs mark say. And if the +12 OOMs looks really far, as it does in my point of view, then that definitely doesn't make me update towards shorter timelines.
My AGI Threat Model: Misaligned Model-Based RL Agent
I never meant to make a claim "20 years is definitely in the realm of possibility" but rather to make a claim "even if it takes 20 years, that's still not necessarily enough to declare that we're all good".

Ah, OK. We are on the same page then.

My AGI Threat Model: Misaligned Model-Based RL Agent

Thanks! Yeah, there are plenty of people who think takeoff will take more than a decade--but I guess I'll just say, I'm pretty sure they are all wrong. :) But we should take care to define what the start point of takeoff is. Traditionally it was something like "When the AI itself is doing most of the AI research," but I'm very willing to consider alternate definitions. I certainly agree it might take more than 10 years if we define things in such a way that takeoff has already begun.

Yeah, sorry, when I said "accidents" I
... (read more)
3Steve Byrnes17dOh sorry, I misread what you wrote. Sure, maybe, I dunno. I just edited the article to say "some number of years". I never meant to make a claim "20 years is definitely in the realm of possibility" but rather to make a claim "even if it takes 20 years, that's still not necessarily enough to declare that we're all good".
My AGI Threat Model: Misaligned Model-Based RL Agent

Some nitpicks about your risk model slash ways in which my risk model differs from yours:

1. I think AIs are more likely to be more homogenous on Earth; even in a slow takeoff they might be all rather similar to each other. Partly for the reasons Evan discusses in his post, and partly because of acausal shenanigans. I certainly think that, unfortunately, given all the problems you describe, we should count ourselves lucky if any of the contending AI factions are aligned to our values. I think this is an important research area.

2. I am perhaps more optimisti... (read more)

4Steve Byrnes17dThanks! For homogeneity, I guess I was mainly thinking that in the era of not-knowing-how-to-align-an-AGI, people would tend to try lots of different new things, because nothing so far has worked. I agree that once there's an aligned AGI, it's likely to get copied, and if new better AGIs are trained, people may be inclined to try to keep the procedure as close as possible to what's worked before. I hadn't thought about whether different AGIs with different goals are likely to compromise vs fight. There's Wei Dai's argument [] that compromise is very easy with AGIs because they can "merge their utility functions". But at least this kind of AGI doesn't have a utility function ... maybe there's a way to do something like that with multiple parallel value functions [] , but I'm not sure that would actually work. There are also old posts about AGIs checking each other's source code for sincerity, but can they actually understand what they're looking at? Transparency is hard. And how do they verify that there isn't a backup stashed somewhere else, ready to jump out at a later date and betray the agreement? Also, humans have social instincts that AGIs don't, which pushes in both directions I think. And humans are easier to kill / easier to credibly threaten. I dunno. I'm not inclined to have confidence in any direction. I agree that if a sufficiently smart misaligned AGI is running on a nice supercomputer somewhere, it would have every reason to try to stay right there and pursue its goals within that institution, and it would have every reason to try to escape and self-replicate elsewhere in the world. I guess we can be concerned about both. :-/
My AGI Threat Model: Misaligned Model-Based RL Agent

Great post! I think many of the things you say apply equally well to broader categories of scenario too, e.g. your AGI risk model stuff works (with some modification) for different AGI development models than the one you gave. I'd love to see people spell that out, lest skeptics read this post and reply "but that's not how AGI will be made, therefore this isn't a serious problem."

Assuming slow takeoff (again, fast takeoff is even worse), it seems to me that under these assumptions there would probably be a series of increasingly-w
... (read more)
4Steve Byrnes17dI haven't thought very much about takeoff speeds (if that wasn't obvious!). But I don't think it's true that nobody thinks it will take more than a decade... Like, I don't think Paul Christiano is the #1 slowest of all slow-takeoff advocates. Isn't Robin Hanson slower? I forget. Then a different question is "Regardless of what other people think about takeoff speeds, what's the right answer, or at least what's plausible?" I don't know. A key part is: I'm hazy on when you "start the clock". People were playing with neural networks in the 1990s but we only got GPT-3 in 2020. What were people doing all that time?? Well mostly, people were ignoring neural networks entirely, but they were also figuring out how to put them on GPUs, and making frameworks like TensorFlow and PyTorch and making them progressively easier to use and scale and parallelize, and finding all the tricks like BatchNorm and Xavier initialization and Transformers, and making better teaching materials and MOOCs to spread awareness of how these things work, developing new and better chips tailored to these algorithms (and vice-versa), waiting on Moore's law, and on and on. I find it conceivable that we could get "glimmers of AGI" (in some relevant sense) in algorithms that have not yet jumped through all those hoops, so we're stuck with kinda toy examples for quite a while as we develop the infrastructure to scale these algorithms, the bag of tricks to make them run better, the MOOCs, the ASICs, and so on. But I dunno. Yeah, sorry, when I said "accidents" I meant "the humans did something by accident", not "the AI did something by accident".
My research methodology

Thinking about politics may not be a failure mode; my question was whether it feels "extreme and somewhat strange," sorry for not clarifying. Like, suppose for some reason "doesn't think about politics" was on your list of desiderata for the extremely powerful AI you are building. So thinking about politics would in that case be a failure mode. Would it be an extreme and somewhat strange one?

I'd be interested to hear more about the law-breaking stuff -- what is it about some laws that makes AI breaking them unsurprising/normal... (read more)

Against evolution as an analogy for how humans will create AGI

To make sure I understand: you are saying (a) that our AIs are fairly likely to get significantly more sample-efficient in the near future, and (b) even if they don't, there's plenty of data around.

I think (b) isn't a good response if you think that transformative AI will probably need to be human brain sized and you believe the scaling laws and you think that short-horizon training won't be enough. (Because then we'll need something like 10^30+ FLOP to train TAI, which is plausibly reachable in 20 years but probably not in 10. Tha... (read more)

My research methodology

OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?

(I didn't find the "...something has gone extremely wrong in a way that feels preventable" as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to... (read more)

4Paul Christiano19dI don't really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking---law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons. I'm not really sure if or how this is a reductio. I don't think it's a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that's really all I want to say---that this failure seems preventable, and that intuition doesn't seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn't empirically contingent.
My research methodology
I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post). ... But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

I'd love to hear more about this. To me, "egregious misalignment" feels extremely natural/normal/expected, perhaps due to convergent instrumental goals. ... (read more)

But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).  Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time.  Say, if you imagine somebody at Deepmind coming in without a lot of... (read more)

I think I'm responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.

In some sense changing this view would change my bott... (read more)

My research methodology

Nice post! I'm interested to hear more about how your methodology differs from others. Does this breakdown seem roughly right?

1. Naive AI alignment: We are satisfied by an alignment scheme that can tell a story about how it works. (This is what I expect to happen in practice at many AI labs.)

2. Typical-Case AI Alignment: We aren't satisfied until we try hard to think of ways our scheme could fail, and still it doesn't seem like failure is the most likely outcome. (This is what I expect the better sort of AI labs, the ones with big well-respected safety tea... (read more)

I don't really think of 3 and 4 as very different, there's definitely a spectrum regarding "plausible" and I think we don't need to draw the line firmly---it's OK if over time your "most plausible" failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn't seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it... (read more)

Against evolution as an analogy for how humans will create AGI
Maybe we won’t restart the inner algorithm from scratch every time we edit it, since it’s so expensive to do so. Instead, maybe once in a while we’ll restart the algorithm from scratch (“re-initialize to random weights” or something analogous), but most of the time, we’ll take whatever data structure holds the AI’s world-knowledge, and preserve it between one version of the inner algorithm and its successor. Doing that is perfectly fine and plausible, but again, the result doesn’t look like evolution;
... (read more)
1Steve Byrnes20dHmm, if you don't know which bits are the learning algorithm and which are the learned content, and they're freely intermingling, then I guess you could try randomizing different subsets of the bits in your algorithm, and see what happens, or something, and try to figure it out. This seems like a computationally-intensive and error-prone process, to me, although I suppose it's hard to know. Also, which is which could be dynamic, and there could be bits that are not cleanly in either category. If you get it wrong, then you're going to wind up updating the knowledge instead of the learning algorithm, or get bits of the learning algorithm that are stuck in a bad state but you're not editing them because you think they're knowledge. I dunno. I guess that's not a disproof, but I'm going to stick with "unlikely". With enough compute, can't rule anything out—you could do a blind search over assembly code! I tend to think that more compute-efficient paths to AGI are far likelier to happen than less compute-efficient paths to AGI, other things equal, because the less compute that's needed, the faster you can run experiments, and the more people are able to develop and experiment with the algorithms. Maybe one giant government project can do a blind search over assembly code, but thousands of grad students and employees can run little experiments in less computationally expensive domains.
Against evolution as an analogy for how humans will create AGI
Incidentally, I think GPT-3 is great evidence that human-legible learning algorithms are up to the task of directly learning and using a common-sense world-model. I’m not saying that GPT-3 is necessarily directly on the path to AGI; instead I’m saying, How can you look at GPT-3 (a simple learning algorithm with a ridiculously simple objective) and then say, “Nope! AGI is way beyond what human-legible learning algorithms can do! We need a totally different path!”?

I think the response would be, "GPT-3 may have learned an aw... (read more)

3Steve Byrnes20dGood question! A kinda generic answer is: (1) Transformers were an advance over previous learning algorithms, and by the same token I expect that yet-to-be-invented learning algorithms will be an advance over Transformers; (2) Sample-efficient learning is AFAICT a hot area that lots of people are working on; (3) We do in fact actually have impressively sample-efficient algorithms even if they're not as well-developed and scalable as others at the moment—see my discussion of analysis-by-synthesis [] ; (4) Given that predictive learning offers tons of data, it's not obvious how important sample-efficiency is. More detailed answer: I agree that in the "intelligence via online learning" paradigm I mentioned, you really want to see something once and immediately commit it to memory. Hard to carry on a conversation otherwise! The human brain has two main tricks for this (that I know of). * There's a giant structured memory (predictive world-model) in the neocortex, and a much smaller unstructured memory in the hippocampus, and the latter is basically just an auto-associative memory (with a pattern separator to avoid cross-talk) that memorizes things. Then it can replay it when appropriate. And just like replay learning in ML, or like doing multiple passes through your training data in ML, relevant information can gradually transfer from the unstructured memory to the structured one by repeated replays. * Because the structured memory is in the analysis-by-synthesis paradigm [] (i.e. searching for a generative model that matches the data), it inherently needs less training data, because its inductive biases are a closer match to reality. It's a harder search problem to build the right generative model when you're learning, and it's a harder search problem to find the r
Against evolution as an analogy for how humans will create AGI
Note that evolution is not in this picture: its role has been usurped by the engineers who wrote the PyTorch code. This is intelligent design, not evolution!

IMO you should put evolution in the picture, as another part of the analogy! :) Make a new row at the top, with "Genomes evolving over millions of generations on a planet, as organisms with better combinations of genes outcompete others" on the left and "Code libraries evolving over thousands of days in an industry, as software/ANN's with better code outcompete (in the economy, in the academic prestige competition, in the minds of individual researchers) others" on the right. (Or some shortened version of that)

Four Motivations for Learning Normativity

Thanks! Well, I for one am feeling myself get nerd-sniped by this agenda. I'm resisting so far (so much else to do! Besides, this isn't my comparative advantage) but I'll definitely be reading your posts going forward and if you ever want to bounce ideas off me in a call I'd be down. :)

Four Motivations for Learning Normativity
To be meaningful, this requires whole-process feedback: we need to judge thoughts by their entire chain of origination. (This is technically challenging, because the easiest way to implement process-level feedback is to create a separate meta-level which oversees the rest of the system; but then this meta-level would not itself be subject to oversight.)

I'd be interested to hear this elaborated further. It seems to me to be technically challenging but not very; it feels like the sort of thing that we could probably solve with a couple people working ... (read more)

4Abram Demski21d1. I agree; I'm not claiming this is a multi-year obstacle even. Mainly I included this line because I thought "add a meta-level" would be what some readers would think, so, I wanted to emphasize that that's not a solution. 2. To elaborate on the difficulty: this is challenging because of the recursive nature of the request. Roughly, you need hypotheses which not only claim things at the object level but also hypothesize a method of hypothesis evaluation ie make claims about process-level feedback. Your belief distribution then needs to incorporate these beliefs. (So how much you endorse a hypothesis can depend on how much you endorse that very hypothesis!) And, on top of that, you need to know how to update that big mess when you get more information. This seems sort of like it has to violate Bayes' Law, because when you make an observation, it'll not only shift hypotheses via likelihood ratios to that observation, but also, produce secondary effects where hypotheses get shifted around because other hypotheses which like/dislike them got shifted around. How all of this should work seems quite unclear. 3. Part of the difficulty is doing this in conjunction with everything else, though. Asking for 1 thing that's impossible in the standard paradigm might have an easy answer. Asking for several, each might individually have easy answers, but combining those easy answers might not be possible.
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the "tokens" definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the "single pass through the network" definition, which would mean we are looking for about 10^12... then we get a small discrepancy; the max... (read more)

AI x-risk reduction: why I chose academia over industry

Makes sense. I think we don't disagree dramatically then.

I also think TAI is a less important category for me than x-risk inducing AI.

Also makes sense -- just checking, does x-risk-inducing AI roughly match the concept of "AI-induced potential point of no return" or is it importantly different? It's certainly less of a mouthful so if it means roughly the same thing maybe I'll switch terms. :)

1David Krueger1moum sorta modulo a type error... risk is risk.It doesn't mean the thing has happened (we need to start using some sort of phrase like "x-event" or something for that, I think).
AI x-risk reduction: why I chose academia over industry

When you say academia looks like a clear win within 5-10 years, is that assuming "academia" means "starting a tenure-track job now?" If instead one is considering whether to begin a PhD program, for example, would you say that the clear win range is more like 10-15 years?

Also, how important is being at a top-20 institution? If the tenure track offer was instead from University of Nowhere, would you change your recommendation and say go to industry?

Would you agree that if the industry project you could work on is the one that will eventually build TAI (or be one of the leading builders, if there are multiple) then you have more influence from inside than from outside in academia?

3David Krueger1moYes. My cut-off was probably somewhere between top-50 and top-100, and I was prepared to go anywhere in the world. If I couldn't make into top 100, I think I would definitely have reconsidered academia. If you're ready to go anywhere, I think it makes it much easier to find somewhere with high EV (but might have to move up the risk/reward curve a lot). Yes. But ofc it's hard to know if that's the case. I also think TAI is a less important category for me than x-risk inducing AI.
Suggestions of posts on the AF to review

I'm more interested in feedback on the +12 OOMs one because it's more decision-relevant. It's more of a fuzzy thing, not crunchy logic like the first one I recommended, and therefore less suitable for your purposes (or so I thought when I first answered your question, now I am not sure)

Suggestions of posts on the AF to review

Insofar as you want to do others of mine, my top recommendation would be this one since it got less feedback than I expected and is my most important timelines-related post of all time IMO.

1Adam Shimi1moIf we do only one, which one do you think matters the most?
Epistemological Framing for AI Alignment Research
This list of benefit logically pushed multiple people to argue that we should make AI Alignment paradigmatic.

Who? It would be helpful to have some links so I can go read what they said.

I disagree. Or to be more accurate, I agree that we should have paradigms in the field, but I think that they should be part of a bigger epistemological structure. Indeed, a naive search for a paradigm either results in a natural science-like paradigm, that put too little emphasis on applications and usefulness, or in a premature constraint on the problem we’re try
... (read more)
3Adam Shimi1moThanks for the feedback! That was one of my big frustrations when writing this post: I only saw this topic pop up in personal conversation, not really in published posts. And so I didn't want to give names of people who just discussed that with me on a zoom call or in a chat. But I totally feel you -- I'm always annoyed by posts that pretend to answer a criticism without pointing to it. That's a really impressive comment, because my last rewrite of the post was exactly to hint that this was the "right way" (in my opinion) to make the field paradigmatic, instead of arguing that AI Alignment should be made paradigmatic (what my previous draft attempted). So I basically agree with what you say. If I agreed with what you wrote before, this part strikes me as quite different from what I'm saying. Or more like you're only focusing on one aspect. Because I actually argue for two things: * That we should have a paradigm of the "AIs" part, a paradigm of the "well-behaved" part, and from that we get a paradigm of the solving part. This has nothing to do with the field being young and/or confused, and all about the field being focused on solving a problem. (That's the part I feel your version is missing) * That in the current state of our knowledge, fixing those paradigms is too early; we should instead do more work on comparing and extending multiple paradigms for each of the "slots" from the previous point, and similarly have a go at solving different variants of the problem. That's what you mostly get right. It's partly my fault, because I'm not stating it that way. My point about this is that thinking of your examples as "big paradigms of AI" is the source of the confusion, and a massive problem within the field. If we use my framing instead, then you can split these big proposals into their paradigm for "AIs", their paradigm for "well-behaved", and so the paradigm for the solving part. This actually show you where they agree and where they
Open Problems with Myopia

Welp, this scoops a bunch of the stuff in my "Why acausal trade matters" chapter. :D Nice!

The DDT idea amuses me. I guess it's maybe the best shot we have, but boy do I get a sense of doom when I imagine that the fate of the world depends on our ability to control/steer/oversee AIs as they become more capable than us in many important ways via keeping them dumb in various other important ways. I guess there's that thing the crocodile wrestlers do where you hold their mouth shut since their muscles for opening are much weaker than their ... (read more)

One way of looking at DDT is "keeping it dumb in various ways." I think another way of thinking about is just designing a different sort of agent, which is "dumb" according to us but not really dumb in an intrinsic sense. You can imagine this DDT agent looking at agents that do do acausal trade and thinking they're just sacrificing utility for no reason.

There is some slight awkwardness in that the decision problems agents in this universe actually encounter means that UDT agents will get higher utility than DDT agents.

I agree that the maximum a posterior world doesn't help that much, but I think there is some sense in which "having uncertainty" might be undesirable.

4Evan Hubinger1moFwiw, I also agree with Adele and Eliezer here and just didn't see Eliezer's comment when I was giving my comments.
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

I disagree with Eliezer here:

Chance of discovering or verifying long-term solution(s): I’m not sure whether a “one shot” solution to alignment (that is, a single relatively “clean” algorithm which will work at all scales including for highly superintelligent models) is possible. But if it is, it seems like starting to do a lot of work on aligning narrowly superhuman models probably allows us to discover the right solution sooner than we otherwise would have.
Eliezer Yudkowsky: It's not possible.  Not for us, any
... (read more)
5Adele Lopez1moMy guess is that a "clean" algorithm is still going to require multiple conceptual insights in order to create it. And typically, those insights are going to be found before we've had time to strip away the extraneous ideas in order to make it clean, which requires additional insights. Combine this with the fact that at least some of these insights are likely to be public knowledge and relevant to AGI, and I think Eliezer has the right idea here.
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

I think I agree with Eliezer here, but I'm worried I misunderstand something:

Eliezer Yudkowsky: "Pessimal" is a strange word to use for this apt description of humanity's entire experience with ML to date.  Unless by "generalize" you mean "generalize correctly to one new example from the same distribution" rather than "generalize the underlying concept that a human would".
Ajeya Cotra: I used "pessimal" here in the technical sense that it's assuming if there are N generalizations equally
... (read more)
3Ajeya Cotra1moThe conceptual work I was gesturing at here is more Paul's work, since MIRI's work (afaik) is not really neural net-focused. It's true that Paul's work also doesn't assume a literal worst case; it's a very fuzzy concept I'm gesturing at here. It's more like, Paul's research process is to a) come up with some procedure, b) try to think of any "plausible" set of empirical outcomes that cause the procedure to fail, and c) modify the procedure to try to address that case. (The slipperiness comes in at the definition of "plausible" here, but the basic spirit of it is to "solve for every case" in the way theoretical CS typically aims to do in algorithm design, rather than "solve for the case we'll in fact encounter.")
Fun with +12 OOMs of Compute

Hmmm, it seems we aren't on the same page. (The argument sketch you just made sounds to me like a collection of claims which are either true but irrelevant, or false, depending on how I interpret them.) I'll go back and reread Ajeya's report (or maybe talk to her?) and then maybe we'll be able to get to the bottom of this. Maybe my birds/brains/etc. post directly contradicts something in her report after all.

Fun with +12 OOMs of Compute
(Btw, I have similar feelings about the non-Neuromorph answers too; but "idk I'm not really compelled by this" didn't seem like a particularly constructive comment.)

On the contrary, I've been very (80%?) surprised by the responses so far -- in the Elicit poll, everyone agrees with me! I expected there to be a bunch of people with answers like "10%" and "20%" and then an even larger bunch of people with answers like "50%" (that's what I expected you, Ajeya, etc. to chime in and say). Instead, wel... (read more)

3Rohin Shah1moThe positive reason is basically all the reasons given in Ajeya's report? Like, we don't tend to design much better artifacts than evolution (currently), the evolution-designed artifact is expensive, and reproducing it using today's technology looks like it will need more than 12 OOMs. I don't think the birds/brains/etc post contradicts this reason, as I said before [] (and you seemed to agree).
Fun with +12 OOMs of Compute

At this point I guess I just say I haven't looked into the worm literature enough to say. I can't tell from the post alone whether we've neuromorphed the worm yet or not.

"Qualitatively as impressive as a worm" is a pretty low bar, I think. We have plenty of artificial neural nets that are much more impressive than worms already, so I guess the question is whether we can make one with only 302 neurons that is as impressive as a worm... e.g. can it wriggle in a way that moves it around, can it move away from sources of damage and to... (read more)

4Rohin Shah1moMeh, I don't think it's a worthwhile use of my time to read that literature, but I'd make a bet if we could settle on an operationalization and I didn't have to settle it. I mostly expect that you realize that there were a bunch of things that were super underspecified and they don't have obvious resolutions, and if you just pick a few things then nothing happens and you get gibberish eternally, and if you search over all the underspecified things you run out of your compute budget very quickly. Some things that might end up being underspecified: * How should neurons be connected to each other? Do we just have a random graph with some average degree of connections, or do we need something more precise? * How are inputs connected to the brain? Do we just simulate some signals to some input neurons, that are then propagated according to the physics of neurons? How many neurons take input? How are they connected to the "computation" neurons? * To what extent do we need to simulate other aspects of the human body that affect brain function? Which hormone receptors do we / don't we simulate? For the ones we do simulate, how do we determine what their inputs are? Or do we have to simulate an entire human body (would be way, way more flops)? * How do we take "random draws" of a new brain? Do we need to simulate the way that DNA builds up the brain during development? * Should we build brains that are like that of a human baby, or a human adult, given that the brain structure seems to change between these? I'm not saying any of these things will be the problem. I'm saying that there will be some sort of problem like this (probably many such problems), that I'm probably not going to find reasoning from my armchair. I also wouldn't really change my mind if you had convincing rebuttals to each of them, because the underlying generator is "there are lots of details; some will be devilishly difficult to handle; you only find those by actuall
Fun with +12 OOMs of Compute

Good question! Here's my answer:

--I think Neuromorph has the least chance of succeeding of the five. Still more than 50% though IMO. I'm not at all confident in this.

--Neuromorph =/= an attempt to create uploads. I would be extremely surprised if the resulting AI was recognizeably the same person as was scanned. I'd be mildly surprised if it even seemed human-like at all, and this is conditional on the project working. What I imagine happening conditional on the project working is something like: After a few generations of selection, we get ... (read more)

5Rohin Shah1moMy impression is that the linked blog post is claiming we haven't even been able to get things that are qualitatively as impressive as a worm. So why would we get things that are qualitatively as impressive as a human? I'm not claiming it has to be an upload. I could believe this (based on the argument you mentioned) but it really feels like "maybe this could be true but I'm not that swayed from my default prior of 'it's probably as easy to simulate per neuron'". Also if it were 100x harder, it would cost... $300. Still super cheap. That's what the genetic algorithm is? It probably wasn't run with as 3e17 flops, since compute was way more expensive then, but that's at least evidence that researchers do in fact consider this approach.
Fun with +12 OOMs of Compute

[1] AlphaStar was 10^8 parameters, ten times smaller than a honeybee brain. I think this puts its capabilities in perspective. Yes, it seemed to be more of a heuristic-executor than a long-term planner, because it could occasionally be tricked into doing stupid things repeatedly. But the same is true for insects.

Fun with +12 OOMs of Compute

[2] This is definitely true for Transformers (and LSTMs I think?), but it may not be true for whatever architecture AlphaStar uses. In particular some people I talked to worry that the vanishing gradients problem might make bigger RL models like OmegaStar actually worse. However, everyone I talked to agreed with the “probably”-qualified version of this claim. I’m very interested to learn more about this.

Fun with +12 OOMs of Compute

[3] To avoid catastrophic forgetting, let’s train OmegaStar on all these different games simultaneously, e.g. it plays game A for a short period, then plays game B, then C, etc. and loops back to game A only much later.

Fun with +12 OOMs of Compute

[4] Lukas Finnveden points out that Gwern’s extrapolation is pretty weird. Quoting Lukas: “Gwern takes GPT-3's current performance on lambada; assumes that the loss will fall as fast as it does on "predict-the-next-word" (despite the fact that the lambada loss is currently falling much faster!) and extrapolates current performance (without adjusting for the expected change in scaling law after the crossover point) until the point where the AI is as good as humans (and btw we don't have a source for the stated human performance)

I'd endorse a summary more li... (read more)

Fun with +12 OOMs of Compute

[5] One might worry that the original paper had a biased sample of tasks. I do in fact worry about this. However, this paper tests GPT-3 on a sample of actual standardized tests used for admission to colleges, grad schools, etc. and GPT-3 exhibits similar performance (around 50% correct), and also shows radical improvement over smaller versions of GPT.

Load More