All of Eliezer Yudkowsky's Comments + Replies

Biology-Inspired AGI Timelines: The Trick That Never Works

It does fit well there, but I think it was more inspired by the person I met who thought I was being way too arrogant by not updating in the direction of OpenPhil's timeline estimates to the extent I was uncertain.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Maybe another way of phrasing this - how much warning do you expect to get, how far out does your Nope Vision extend?  Do you expect to be able to say "We're now in the 'for all I know the IMO challenge could be won in 4 years' regime" more than 4 years before it happens, in general?  Would it be fair to ask you again at the end of 2022 and every year thereafter if we've entered the 'for all I know, within 4 years' regime?

Added:  This question fits into a larger concern I have about AI soberskeptics in general (not you, the soberskeptics wou... (read more)

I think I'll get less confident as our accomplishments get closer to the IMO grand challenge. Or maybe I'll get much more confident if we scale up from $1M -> $1B and pick the low hanging fruit without getting fairly close, since at that point further progress gets a lot easier to predict

There's not really a constant time horizon for my pessimism, it depends on how long and robust a trend you are extrapolating from. 4 years feels like a relatively short horizon, because theorem-proving has not had much investment so compute can be scaled up several orde... (read more)

Christiano, Cotra, and Yudkowsky on AI progress

I also think human brains are better than elephant brains at most things - what did I say that sounded otherwise?

2Paul Christiano6dOops, this was in reference to the later part of the discussion where you disagreed with "a human in a big animal body, with brain adapted to operate that body instead of our own, would beat a big animal [without using tools]".
Yudkowsky and Christiano discuss "Takeoff Speeds"

Okay, then we've got at least one Eliezerverse item, because I've said below that I think I'm at least 16% for IMO theorem-proving by end of 2025.  The drastic difference here causes me to feel nervous, and my second-order estimate has probably shifted some in your direction just from hearing you put 1% on 2024, but that's irrelevant because it's first-order estimates we should be comparing here.

So we've got huge GDP increases for before-End-days signs of Paulverse and quick IMO proving for before-End-days signs of Eliezerverse?  Pretty bare port... (read more)

I think IMO gold medal could be well before massive economic impact, I'm just surprised if it happens in the next 3 years. After a bit more thinking (but not actually looking at IMO problems or the state of theorem proving) I probably want to bump that up a bit, maybe 2%, it's hard reasoning about the tails. 

I'd say <4% on end of 2025.

I think this is the flipside of me having an intuition where I say things like "AlphaGo and GPT-3 aren't that surprising"---I have a sense for what things are and aren't surprising, and not many things happen that are... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I expect it to be hella difficult to pick anything where I'm at 75% that it happens in the next 5 years and Paul is at 25%.  Heck, it's not easy to find things where I'm at over 75% that aren't just obvious slam dunks; the Future isn't that easy to predict.  Let's get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.

I frame no prediction about whether Paul is under 16%.  That's a separate matter.  I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I'm like "lol wut" and double their probability, even if it turns out that Paul agrees with me about it.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Ha!  Okay then.  My probability is at least 16%, though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more.  Paul?

EDIT:  I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists.  I'll stand by a >16% probabilit... (read more)

I don't care about whether the AI is open-sourced (I don't expect anyone to publish the weights even if they describe their method) and I'm not that worried about our ability to arbitrate overfitting.

Ajeya suggested that I clarify: I'm significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I'd guess about 5-10% could be done with no deep learning and m... (read more)

2Matthew Barnett6dIf this task is bad for operationalization reasons, there are other theorem proving benchmarks [https://paperswithcode.com/task/automated-theorem-proving]. Unfortunately it looks like there aren't a lot of people that are currently trying to improve on the known benchmarks, as far as I'm aware. The code generation benchmarks [https://paperswithcode.com/task/code-generation] are slightly more active. I'm personally partial to Hendrycks et al.'s APPS benchmark [https://arxiv.org/pdf/2105.09938v3.pdf], which includes problems that "range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability." (Github link [https://github.com/hendrycks/apps]).
4Matthew Barnett6dIt feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul's camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal). The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul's model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul's. Edit: Here are, for example, the individual rankings for 2021: https://www.imo-official.org/year_individual_r.aspx?year=2021 [https://www.imo-official.org/year_individual_r.aspx?year=2021]
Christiano, Cotra, and Yudkowsky on AI progress

Mostly, I think the Future is not very predictable in some ways, and this extends to, for example, it being the possible that 2022 is the year where we start Final Descent and by 2024 it's over, because it so happened that although all the warning signs were Very Obvious In Retrospect they were not obvious in antecedent and so stuff just started happening one day.  The places where I dare to extend out small tendrils of prediction are the rare exception to this rule; other times, people go about saying, "Oh, no, it definitely couldn't start in 2022" a... (read more)

I'm mostly not looking for virtue points, I'm looking for: (i) if your view is right then I get some kind of indication of that so that I can take it more seriously, (ii) if your view is wrong then you get some indication feedback to help snap you out of it.

I don't think it's surprising if a GPT-3 sized model can do relatively good translation. If talking about this prediction, and if you aren't happy just predicting numbers for overall value added from machine translation, I'd kind of like to get some concrete examples of mediocre translations or concrete problems with existing NMT that you are predicting can be improved.

Christiano, Cotra, and Yudkowsky on AI progress

If they've found some way to put a lot more compute into GPT-4 without making the model bigger, that's a very different - and unnerving - development.

Yudkowsky and Christiano discuss "Takeoff Speeds"

(I'm currently slightly hopeful about the theorem-proving thread, elsewhere and upthread.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I have a sense that there's a lot of latent potential for theorem-proving to advance if more energy gets thrown at it, in part because current algorithms seem a bit weird to me - that we are waiting on the equivalent of neural MCTS as an enabler for AlphaGo, not just a bigger investment, though of course the key trick could already have been published in any of a thousand papers I haven't read.  I feel like I "would not be surprised at all" if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO chal... (read more)

Yes, IMO challenge falling in 2024 is surprising to me at something like the 1% level or maybe even more extreme (though could also go down if I thought about it a lot or if commenters brought up relevant considerations, e.g. I'd look at IMO problems and gold medal cutoffs and think about what tasks ought to be easy or hard; I'm also happy to make more concrete per-question predictions). I do think that there could be huge amounts of progress from picking the low hanging fruit and scaling up spending by a few orders of magnitude, but I still don't expect i... (read more)

I feel like I "would not be surprised at all" if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO challenge falls in 2024

Possibly helpful: Metaculus currently puts the chances of the IMO grand challenge falling by 2025 at about 8%. Their median is 2039.

I think this would make a great bet, as it would definitely show that your model can strongly outperform a lot of people (and potentially Paul too). And the operationalization for the bet is already there -- so little work will be needed to do that part.

Yudkowsky and Christiano discuss "Takeoff Speeds"

I kind of want to see you fight this out with Gwern (not least for social reasons, so that people would perhaps see that it wasn't just me, if it wasn't just me).

But it seems to me that the very obvious GPT-5 continuation of Gwern would say, "Gradualists can predict meaningless benchmarks, but they can't predict the jumpy surface phenomena we see in real life."  We want to know when humans land on the moon, not whether their brain sizes continued on a smooth trend extrapolated over the last million years.

I think there's a very real sense in which, yes... (read more)

But it seems to me that the very obvious GPT-5 continuation of Gwern would say, "Gradualists can predict meaningless benchmarks, but they can't predict the jumpy surface phenomena we see in real life."

Don't you think you're making a falsifiable prediction here?

Name something that you consider part of the "jumpy surface phenomena" that will show up substantially before the world ends (that you think Paul doesn't expect). Predict a discontinuity. Operationalize everything and then propose the bet.

Christiano, Cotra, and Yudkowsky on AI progress

I don't necessarily expect GPT-4 to do better on perplexity than would be predicted by a linear model fit to neuron count plus algorithmic progress over time; my guess for why they're not scaling it bigger would be that Stack More Layers just basically stopped scaling in real output quality at the GPT-3 level.  They can afford to scale up an OOM to 1.75 trillion weights, easily, given their funding, so if they're not doing that, an obvious guess is that it's because they're not getting a big win from that.  As for their ability to then make algor... (read more)

5Lukas Finnveden6dWhile GPT-4 wouldn't be a lot bigger than GPT-3, Sam Altman did indicate that it'd use a lot more compute. That's consistent with Stack More Layers still working; they might just have found an even better use for compute. (The increased compute-usage also makes me think that a Paul-esque view would allow for GPT-4 to be a lot more impressive than GPT-3, beyond just modest algorithmic improvements.)
Christiano, Cotra, and Yudkowsky on AI progress

My memory of the past is not great in general, but considering that I bet sums of my own money and advised others to do so, I am surprised that my memory here would be that bad, if it was.

Neither GJO nor Metaculus are restricted to only past superforecasters, as I understand it; and my recollection is that superforecasters in particular, not all participants at GJO or Metaculus, were saying in the range of 20%.  Here's an example of one such, which I have a potentially false memory of having maybe read at the time: https://www.gjopen.com/comments/118530

1Matthew Barnett6dThanks for clarifying. That makes sense that you may have been referring to a specific subset of forecasters. I do think that some forecasters tend to be much more reliable than others (and maybe there was/is a way to restrict to "superforecasters" in the UI). I will add the following piece of evidence, which I don't think counts much for or against your memory, but which still seems relevant. Metaculus shows a histogram of predictions. On the relevant question [https://www.metaculus.com/questions/112/will-googles-alphago-beat-go-player-lee-sedol-in-march-2016/] , a relatively high fraction of people put a 20% chance, but it also looks like over 80% of forecasters put higher credences.
Christiano, Cotra, and Yudkowsky on AI progress

I feel like the biggest subjective thing is that I don't feel like there is a "core of generality" that GPT-3 is missing

I just expect it to gracefully glide up to a human-level foom-ing intelligence

This is a place where I suspect we have a large difference of underlying models.  What sort of surface-level capabilities do you, Paul, predict that we might get (or should not get) in the next 5 years from Stack More Layers?  Particularly if you have an answer to anything that sounds like it's in the style of Gwern's questions, because I think those a... (read more)

5Paul Christiano6dI agree we seem to have some kind of deeper disagreement here. I think stack more layers + known training strategies (nothing clever) + simple strategies for using test-time compute (nothing clever, nothing that doesn't use the ML as a black box) can get continuous improvements in tasks like reasoning (e.g. theorem-proving), meta-learning (e.g. learning to learn new motor skills), automating R&D (including automating executing ML experiments, or proposing new ML experiments), or basically whatever. I think these won't get to human level in the next 5 years. We'll have crappy versions of all of them. So it seems like we basically have to get quantitative. If you want to talk about something we aren't currently measuring, then that probably takes effort, and so it would probably be good if you picked some capability where you won't just say "the Future is hard to predict." (Though separately I expect to make somewhat better predictions than you in most of these domains.) A plausible example is that I think it's pretty likely that in 5 years, with mere stack more layers + known techniques (nothing clever), you can have a system which is clearly (by your+my judgment) "on track" to improve itself and eventually foom, e.g. that can propose and evaluate improvements to itself, whose ability to evaluate proposals is good enough that it will actually move in the right direction and eventually get better at the process, etc., but that it will just take a long time for it to make progress. I'd guess that it looks a lot like a dumb kid in terms of the kind of stuff it proposes and its bad judgment (but radically more focused on the task and conscientious and wise than any kid would be). Maybe I think that's 10% unconditionally, but much higher given a serious effort. My impression is that you think this is unlikely without adding in some missing secret sauce to GPT, and that my picture is generally quite different from your criticallity-flavored model of takeoff.

If you give me 1 or 10 examples of surface capabilities I'm happy to opine. If you want me to name industries or benchmarks, I'm happy to opine on rates of progress. I don't like the game where you say "Hey, say some stuff. I'm not going to predict anything and I probably won't engage quantitatively with it since I don't think much about benchmarks or economic impacts or anything else that we can even talk about precisely in hindsight for GPT-3."

I don't even know which of Gwern's questions you think are interesting/meaningful. "Good meta-learning"--I don't... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

The crazy part is someone spending $1B and then generating $100B/year in revenue (much less $100M and then taking over the world).

Would you say that this is a good description of Suddenly Hominids but you don't expect that to happen again, or that this is a bad description of hominids?

4Paul Christiano7dIt's not a description of hominids at all, no one spent any money on R&D. I think there are analogies where this would be analogous to hominids (which I think are silly, as we discuss in the next part of this transcript). And there are analogies where this is a bad description of hominids (which I prefer).
Yudkowsky and Christiano discuss "Takeoff Speeds"

Thanks for continuing to try on this!  Without having spent a lot of labor myself on looking into self-driving cars, I think my sheer impression would be that we'll get $1B/yr waifutech before we get AI freedom-of-the-road; though I do note again that current self-driving tech would be more than sufficient for $10B/yr revenue if people built new cities around the AI tech level, so I worry a bit about some restricted use-case of self-driving tech that is basically possible with current tech finding some less regulated niche worth a trivial $10B/yr. &nb... (read more)

3Paul Christiano6dYes, I think that value added by automated translation will follow a similar pattern. Number of words translated is more sensitive to how you count and random nonsense, as is number of "users" which has even more definitional issues. You can state a prediction about self-driving cars in any way you want. The obvious thing is to talk about programs similar to the existing self-driving taxi pilots (e.g. Waymo One) and ask when they do $X of revenue per year, or when $X of self-driving trucking is done per year. (I don't know what AI freedom-of-the-road means, do you mean something significantly more ambitious than self-driving trucks or taxis?)
Yudkowsky and Christiano discuss "Takeoff Speeds"

I think you are underconfident about the fact that almost all AI profits will come from areas that had almost-as-much profit in recent years. So we could bet about where AI profits are in the near term, or try to generalize this.

I wouldn't be especially surprised by waifutechnology or machine translation jumping to newly accessible domains (the thing I care about and you shrug about (until the world ends)), but is that likely to exhibit a visible economic discontinuity in profits (which you care about and I shrug about (until the world ends))?  There'... (read more)

4Paul Christiano7dMan, the problem is that you say the "jump to newly accessible domains" will be the thing that lets you take over the world. So what's up for dispute is the prototype being enough to take over the world rather than years of progress by a giant lab on top of the prototype. It doesn't help if you say "I expect new things to sometimes become possible" if you don't further say something about the impact of the very early versions of the product. If e.g. people were spending $1B/year developing a technology, and then after a while it jumps from 0/year to $1B/year of profit, I'm not that surprised. (Note that machine translation is radically smaller than this, I don't know the numbers.) I do suspect they could have rolled out a crappy version earlier, perhaps by significantly changing their project. But why would they necessarily bother doing that? For me this isn't violating any of the principles that make your stories sound so crazy. The crazy part is someone spending $1B and then generating $100B/year in revenue (much less $100M and then taking over the world). (Note: it is surprising if an industry is spending $10T/year on R&D and then jumps from $1T --> $10T of revenue in one year in a world that isn't yet growing crazily. The surprising depends a lot on the numbers involved, and in particular on how valuable it would have been to deploy a worse version earlier and how hard it is to raise money at different scales.)

I'd be happy to disagree about romantic chatbots or machine translation. I'd have to look into it more to get a detailed sense in either, but I can guess. I'm not sure what "wouldn't be especially surprised" means, I think to actually get disagreements we need way more resolution than that so one question is whether you are willing to play ball (since presumably you'd also have to looking into to get a more detailed sense). Maybe we could save labor if people would point out the empirical facts we're missing and we can revise in light of that, but we'd sti... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

And to say it also explicitly, I think this is part of why I have trouble betting with Paul.  I have a lot of ? marks on the questions that the Gwern voice is asking above, regarding them as potentially important breaks from trend that just get dumped into my generalized inbox one day.  If a gradualist thinks that there ought to be a smooth graph of perplexity with respect to computing power spent, in the future, that's something I don't care very much about except insofar as it relates in any known way whatsoever to questions like those the Gwer... (read more)

This seems totally bogus to me.

It feels to me like you mostly don't have views about the actual impact of AI as measured by jobs that it does or the $s people pay for them, or performance on any benchmarks that we are currently measuring, while I'm saying I'm totally happy to use gradualist metrics to predict any of those things. If you want to say "what does it mean to be a gradualist" I can just give you predictions on them. 

To you this seems reasonable, because e.g. $ and benchmarks are not the right way to measure the kinds of impacts we care abou... (read more)

What does it even mean to be a gradualist about any of the important questions like those of the Gwern-voice, when they don't relate in known ways to the trend lines that are smooth?

Perplexity is one general “intrinsic” measure of language models, but there are many task-specific measures too. Studying the relationship between perplexity and task-specific measures is an important part of the research process. We shouldn’t speak as if people do not actively try to uncover these relationships.

I would generally be surprised if there were many highly non-li... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I predict that people will explicitly collect much larger datasets of human behavior as the economic stakes rise. This is in contrast to e.g. theorem-proving working well, although I think that theorem-proving may end up being an important bellwether because it allows you to assess the capabilities of large models without multi-billion-dollar investments in training infrastructure.

Well, it sounds like I might be more bullish than you on theorem-proving, possibly.  Not on it being useful or profitable, but in terms of underlying technology making progr... (read more)

I'm going to make predictions by drawing straight-ish lines through metrics like the ones in the gpt-f paper. Big unknowns are then (i) how many orders of magnitude of "low-hanging fruit" are there before theorem-proving even catches up to the rest of NLP? (ii) how hard their benchmarks are compared to other tasks we care about. On (i) my guess is maybe 2? On (ii) my guess is "they are pretty easy" / "humans are pretty bad at these tasks," but it's somewhat harder to quantify. If you think your methodology is different from that then we will probably end u... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I feel a bit confused about where you think we meta-disagree here, meta-policy-wise.  If you have a thesis about the sort of things I'm liable to disagree with you about, because you think you're more familiar with the facts on the ground, can't you write up Paul's View of the Next Five Years and then if I disagree with it better yet, but if not, you still get to be right and collect Bayes points for the Next Five Years?

I mean, it feels to me like this should be a case similar to where, for example, I think I know more about macroeconomics than your t... (read more)

I think you think there's a particular thing I said which implies that the ball should be in my court to already know a topic where I make a different prediction from what you do.

I've said I'm happy to bet about anything, and listed some particular questions I'd bet about where I expect you to be wronger. If you had issued the same challenge to me, I would have picked one of the things and we would have already made some bets. So that's why I feel like the ball is in your court to say what things you're willing to make forecasts about.

That said, I don't kn... (read more)

Inevitably, you can go back afterwards and claim it wasn't really a surprise in terms of the abstractions that seem so clear and obvious now, but I think it was surprised then

It seems like you are saying that there is some measure that was continuous all along, but that it's not obvious in advance which measure was continuous. That seems to suggest that there are a bunch of plausible measures you could suggest in advance, and lots of interesting action will be from changes that are discontinuous changes on some of those measures. Is that right?

If so, don't... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I wish to acknowledge this frustration, and state generally that I think Paul Christiano occupies a distinct and more clueful class than a lot of, like, early EAs who mm-hmmmed along with Robin Hanson on AI - I wouldn't put, eg, Dario Amodei in that class either, though we disagree about other things.

But again, Paul, it's not enough to say that you weren't surprised by GPT-2/3 in retrospect, it kinda is important to say it in advance, ideally where other people can see?  Dario picks up some credit for GPT-2/3 because he clearly called it in advance. &... (read more)

Suppose your view is "crazy stuff happens all the time" and my view is "crazy stuff happens rarely." (Of course "crazy" is my word, to you it's just normal stuff.) Then what am I supposed to do, in your game?

More broadly: if you aren't making bold predictions about the future, why do you think that other people will? (My predictions all feel boring to me.) And if you do have bold predictions, can we talk about some of them instead?

It seems to me like I want you to say "well I think 20% chance something crazy happens here" and I say "nah, that's more like 5... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I do wish to note that we spent a fair amount of time on Discord trying to nail down what earlier points we might disagree on, before the world started to end, and these Discord logs should be going up later.

From my perspective, the basic problem is that Eliezer's story looks a lot like "business as usual until the world starts to end sharply", and Paul's story looks like "things continue smoothly until their smooth growth ends the world smoothly", and both of us have ever heard of superforecasting and both of us are liable to predict near-term initial seg... (read more)

From my perspective, the basic problem is that Eliezer's story looks a lot like "business as usual until the world starts to end sharply", and Paul's story looks like "things continue smoothly until their smooth growth ends the world smoothly", and both of have ever heard of superforecasting and both of us are liable to predict near-term initial segments by extrapolating straight lines while those are available.

I agree that it's plausible that we both make the same predictions about the near future. I think we probably don't, and there are plenty of disagr... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

The "weirdly uncharitable" part is saying that it "seemed like" I hadn't read it vs. asking.  Uncertainty is one thing, leaping to the wrong guess another.

Yudkowsky and Christiano discuss "Takeoff Speeds"

I read "Takeoff Speeds" at the time.  I did not liveblog my reaction to it at the time.  I've read the first two other items.

I flag your weirdly uncharitable inference.

FWIW, I did not find this weirdly uncharitable, only mildly uncharitable. I have extremely wide error bars on what you have and have not read, and "Eliezer has not read any of the things on that list" was within those error bars. It is really quite difficult to guess your epistemic state w.r.t. specific work when you haven't been writing about it for a while.

(Though I guess you might have been writing about it on Twitter? I have no idea, I generally do not use Twitter myself, so I might have just completely missed anything there.)

I apologize, I shouldn't have leapt to that conclusion.

Ngo and Yudkowsky on alignment difficulty

My reply to your distinction between 'consequentialists' and 'outcome pumps' would be, "Please forget entirely about any such thing as a 'consequentialist' as you defined it; I would now like to talk entirely about powerful outcome pumps.  All understanding begins there, and we should only introduce the notion of how outcomes are pumped later in the game.  Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

(Modulo that lots... (read more)

1Ramana Kumar7dA couple of direct questions I'm stuck on: * Do you agree that Flint's optimizing systems are a good model (or even definition) of outcome pumps? * Are black holes and fires reasonable examples of outcome pumps? I'm asking these to understand the work better. Currently my answers are: * Yes. Flint's notion is one I came to independently when thinking about "goal-directedness". It could be missing some details, but I find it hard to snap out of the framework entirely. * Yes. But maybe not the most informative examples. They're highly non-retargetable.
2Daniel Kokotajlo7dI don't know the relevant history of science, but I wouldn't be surprised if something like the opposite was true: Our modern, very useful understanding of work is an abstraction that grew out of many people thinking concretely about various engines. Thinking about engines was like the homework exercises that helped people to reach and understand the concept of work. Similarly, perhaps it is pedagogically (and conceptually) helpful to begin with the notion of a consequentialist and then generalize to outcome pumps.
Ngo and Yudkowsky on AI capability gains

I think some of your confusion may be that you're putting "probability theory" and "Newtonian gravity" into the same bucket.  You've been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though).  "Probability theory" also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance p... (read more)

it seems to me that you want properly to be asking "How do we know this empirical thing ends up looking like it's close to the abstraction?" and not "Can you show me that this abstraction is a very powerful one?"

I agree that "powerful" is probably not the best term here, so I'll stop using it going forward (note, though, that I didn't use it in my previous comment, which I endorse more than my claims in the original debate).

But before I ask "How do we know this empirical thing ends up looking like it's close to the abstraction?", I need to ask "Does the ab... (read more)

4Adam Shimi11dThat's a really helpful comment (at least for me)! I'm guessing that a lot of the hidden work here and in the next steps would come from asking stuff like: * so I need to alter the bucket for each new idea, or does it instead fit in its current form each time? * does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer [https://www.lesswrong.com/posts/6i3zToomS86oj9bS6/mysterious-answers-to-mysterious-questions] ? * Does the bucket become more simple and more elegant with each new idea that fit in it? Is there some truth in this, or am I completely off the mark? You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field! So I'm left wondering: * Do you disagree with my impression of the value of such a subsequence? * Do you think it would have this value but are spending your time doing something more valuable? * Do you think it would be valuable but really don't want to write it? * Do you think it would be valuable, you could in principle write it, but probably no one would get it even if you did? * Something else I'm failing to imagine? Once again, you do what you want, but I feel like this would be super valuable if there was anyway of making that possible. That's also completely relevant to my own focus on the different epistemic strategies [https://www.lesswrong.com/s/LLEJJoaYpCoS5JYSY/p/FQqcejhNWGG8vHDch] used in alignment research, especially because we don't have access to empirical evidence or trial and error at all for AGI-type problems. (I'm also quite curious if you think
Ngo and Yudkowsky on alignment difficulty

To Rob's reply, I'll add that my own first reaction to your question was that it seems like a map-territory / perspective issue as appears in eg thermodynamics?  Like, this has a similar flavor to asking "What does it mean to say that a classical system is in a state of high entropy when it actually only has one particular system state?"  Adding this now in case I don't have time to expand on it later; maybe just saying that much will help at all, possibly.

Discussion with Eliezer Yudkowsky on AGI interventions

are you interested in Redwood's research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now?

It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like "What actually went into the nonviolence predicate instead of just nonviolence?"  Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.

A positive case for how we might succeed at prosaic AI alignment

All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary

I'm confused from several directions here.  What is a "robust" Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate "an agent that optimizes an objective" are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

No—I'm separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you're going to get it. My step (1) above, which is what I understand that we're talking about, is just about that first piece: understanding what we're going to be shooting for when we set up our training process (and then once we know what we're shooting for we can think about h... (read more)

A positive case for how we might succeed at prosaic AI alignment

Closer, yeah.  In the limit of doing insanely complicated things with Bob you will start to break him even if he is faithfully simulated, you will be doing things that would break the actual Bob; but I think HCH schemes fail long before they get to that point.

1Edouard Harris13dGotcha. Well, that seems right—certainly in the limit case.
Ngo and Yudkowsky on AI capability gains

I second the kudos to Richard, by the way.  In a lot of ways he's an innocent bystander while I say things that aren't aimed mainly at him.

Not a problem. I share many of your frustrations about modesty epistemology and about most alignment research missing the point, so I sympathise with your wanting to express them.

On consequentialism: I imagine that it's pretty frustrating to keep having people misunderstand such an important concept, so thanks for trying to convey it. I currently feel like I have a reasonable outline of what you mean (e.g. to the level where I could generate an analogy about as good as Nate's laser analogy), but I still don't know whether the reason you find it much more c... (read more)

A positive case for how we might succeed at prosaic AI alignment

Eliezer's counterargument is "You don't get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends.  The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you."

To be clear, I agree with this as a response to what Edouard said—and I think it's a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don't think it's a response to what I'm advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn't be too surprised that it's not fully clear).

In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that's clearly not a safe model to amplify,... (read more)

Thanks, that helps. So actually this objection says: "No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you've built yourself an AGI. But since this myopic copying... (read more)

Ngo and Yudkowsky on alignment difficulty

The idea is not that humans are perfect consquentialists, but that they are able to work at all to produce future-steering outputs, insofar as humans actually do work at all, by an inner overlap of the shape of inner parts which has a shape resembling consequentialism, and the resemblance is what does the work.  That is, your objection has the same flavor as "But humans aren't Bayesian!  So how can you say that updating on evidence is what's doing their work of mapmaking?"

3Daniel Kokotajlo13dTo be clear I think I agree with your overall position. I just don't think the argument you gave for it (about bureaucracies etc.) was compelling.
Ngo and Yudkowsky on alignment difficulty

Various previous proposals for utility indifference have foundered on gotchas like "Well, if we set it up this way, that's actually just equivalent to the AI assigning probability 0 to the shutdown button ever being pressed, which means that it'll tend to design the useless button out of itself."  Or, "This AI behaves like the shutdown button gets pressed with a fixed nonzero probability, which means that if, say, that fixed probability is 10%, the AI has an incentive to strongly precommit to making the shutdown button get pressed in cases where the u... (read more)

2Koen Holtman13dGlad you asked. If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt. TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them. I will now expand on this and add more details, using the the example of an emergency stop button. Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform null actions in every future time step. Let's say thatMc is a world model that accurately depicts this situation. I am not going to build an AGI that uses Mc to plan its actions. Instead I build an AGI agent that will plan its next actions by using an incorrect world model Mi. This Mi is different from Mc, but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by Mi, the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action a would maximize utility in this incorrect, imaginary world Mi. I then further construct it to take this same action
A positive case for how we might succeed at prosaic AI alignment

Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.

Endorsed.

2Evan Hubinger15dTo be clear, I agree with this also, but don't think it's really engaging with what I'm advocating for—I'm not proposing any sort of assemblage of reasoners; I'm not really sure where that misconception came from.
Ngo and Yudkowsky on alignment difficulty

You'd also need to prevent the system from knowing too much about its own source code or the computers it was running on. Anyways, this seems to me to mostly fall prey to the safe-but-useless branch of the dilemma; I don't know how to save the world using a theorem-prover that is never exposed to any reality-contaminated theorems. It seems strategically isomorphic to an expensive rock.

2Vanessa Kosoy16dIn general, yes, although we could imagine an AI and/or virtual machine whose design is so simple that it conveys little evidence about the universe. But, sure, it's not at all clear that this is useful against AI risk, and I wasn't implying otherwise. [EDIT: I amended [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/shortform?commentId=9JoBtvLp8bFcs4vWk] the class system to account for this.]
A positive case for how we might succeed at prosaic AI alignment

The notion of (1) seems like the cat-belling problem here; the other steps don't seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.

What pivotal act is this AGI supposed to be executing?  Designing a medium-strong nanosystem?  How would you do that via a myopic system?  That means the AGI needs to design a nanosystem whose purpose spans over time and whose current execution has distant good consequences.  It doesn't matter whether you claim it's being done by something that in... (read more)

1Donald Hobson14dI think you might be able to design advanced nanosystems without AI doing long term real world optimization. Well a sufficiently large team of smart humans could probably design nanotech. The question is how much an AI could help. Suppose unlimited compute. You program a simulation of quantum field theory. Add a GUI to see visualizations and move atoms around. Designing nanosystems is already quite a bit easier. Now suppose you brute force search over all arrangements of 100 atoms within a 1nm box, searching for the configuration that most efficiently transfers torque. You do similar searches for the smallest arrangement of atoms needed to make a functioning logic gate. Then you download an existing microprocessor design, and copy it (but smaller) using your nanologic gates. I know that if you start brute forcing over a trillion atoms, you might find a mesaoptimizer. (Although even then I would suspect that visualization inspection shouldn't result in anything brain hacky. It would only be actually synthesizing such a thing that was dangerous. (or maybe possibly simulating it, if the mesaoptimizer realizes it's in a simulation and there are general simulation escape strategies )) So look at the static output of your brute forcing. If you see anything that looks computational, delete it. Don't brute force anything too big. (Obviously you need human engineers here, any long term real world planning is coming from them.)

The notion of (1) seems like the cat-belling problem here; the other steps don't seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.

I'm surprised that you think (1) is the hard part—though (1) is what I'm currently working on, since I think it's necessary to make a lot of the other parts go through, I expect it to be one of the easiest parts of the story to make work.

What pivotal act is this AGI supposed to be executing? Designing a medium-strong nanosystem?

I left this part purposefully v... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

I'm not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count "the same") had any clear-to-my-model relevance to alignment, or even AGI.  AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we've got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem.  Bigger models generalizing better or worse doesn't say anything obvio... (read more)

1Adam Shimi16dTrying to rephrase it in my own words (which will necessarily lose some details), are you interested in Redwood's research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now? Which might tell us for example "what aspect of these predictable problems crop up first, and why?"
Discussion with Eliezer Yudkowsky on AGI interventions

Well, if viewing it on that level, AlphaFold 2 didn't crack the full problem because it doesn't let you put in a chemical function and get out a protein which performs that function while subject to other constraints of a surrounding wet system, which is the protein folding problem you have to solve to get wet nanotech out the other end, which is why we don't already have general wet nanotech today.

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Both DQN and MuZero have sample-efficient configurations which do much better

Say more?  In particular, do you happen to know what the sample efficiency advantage for EfficientNet was, over the sample-efficient version of MuZero - eg, how many frames the more efficient MuZero would require to train to EfficientNet-equivalent performance?  This seems to me like the appropriate figure of merit for how much EfficientNet improved over MuZero (and to potentially refute the current applicability of what I interpret as the OpenPhil view about how AGI development should look at some indefinite point later when there are no big wins left in AGI).

The sample-efficient DQN is covered on pg2, as the tuned Rainbow. More or less, you just train the DQN a lot more frequently, and a lot more, on its experience replay buffer, with no other modifications; this makes it like 10x more sample-efficient than the usual hundreds of millions of frames quoted for best results (but again, because of diminishing returns and the value of training on new fresh data and how lightweight ALE is to run, this costs you a lot more wallclock/total-compute, which is usually what ALE researchers try to economize and why DQN/Rai... (read more)

Redwood Research’s current project

I validate this as a nonfake alignment research direction that seems important.

1Michele Campolo7moThanks, that page is much more informative than anything else I've read on the orthogonality thesis. 1 From Arbital: Also my claim is an existential claim, and I find it valuable because it could be an opportunity to design aligned AI. 2 Arbital claims that orthogonality doesn't require moral relativism, so it doesn't seem incompatible with what I am calling naturalism in the post. 3 I am ok with rejecting positions similar to what Arbital calls universalist moral internalism. Statements like "All agents do X" cannot be exact.
Disentangling Corrigibility: 2015-2021

Thank you very much!  It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.

The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.

Eg I'd suggest that to avoid confusion this kind of language should be something like "The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced..." &c.

1Koen Holtman8moThanks at lot all! I just edited the post above to change the language as suggested. FWIW, Paul's post on corrigibility here [https://ai-alignment.com/corrigibility-3039e668638] was my primary source for the into that Robert Miles named the technical term. Nice to see the original suggestion as made on Facebook too.
3Ben Pace8moYou're welcome. Yeah "invented the concept" and "named the concept" are different (and both important!).
How do we prepare for final crunch time?

Seems rather obvious to me that the sort of person who is like, "Oh, well, we can't possibly work on this until later" will, come Later, be like, "Oh, well, it's too late to start doing basic research now, we'll have to work with whatever basic strategies we came up with already."

Seems true, but also didn't seem to be what this post was about?

Disentangling Corrigibility: 2015-2021

Why do you think the term "corrigibility" was coined by Robert Miles?  My autobiographical memory tends to be worryingly fallible, but I remember coining this term myself after some brainstorming (possibly at a MIRI workshop).  This is a kind of thing that I usually try to avoid enforcing because it would look bad if all of the concepts that I did in fact invent were being cited as traceable to me - the truth about how much of this field I invented does not look good for the field or for humanity's prospects - but outright errors of this sort sho... (read more)

1Koen Holtman8moI wrote that paper [https://arxiv.org/abs/1908.01695] and abstract back in 2019. Just re-read the abstract. I am somewhat puzzled how you can read the abstract and feel that it makes 'very large claims' that would be 'very surprising' when fulfilled. I don't feel that the claims are that large or hard to believe. Feel free to tell me more when you have read the paper. My more recent papers make somewhat similar claims about corrigibility results, but they use more accessible math.
2Robert Miles8moYeah I definitely wouldn't say I 'coined' it, I just suggested the name
4Ben Pace8moI'm 94% confident it came from a Facebook thread where you blegged for help naming the concept and Rob suggested it. I'll have a look now to find it and report back. Edit: having a hard time finding it, though note that Paul repeats the claim at the top of his post [https://ai-alignment.com/corrigibility-3039e668638] on corrigibility in 2017.
My research methodology

Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.

Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn't fall over, which is not the situation we are currently in.

My research methodology

But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).  Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time.  Say, if you imagine somebody at Deepmind coming in without a lot of... (read more)

  • I still feel fine about what I said, but that's two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
  • Clarifying what I mean by way of analogy: suppose I'm worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I'd say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the log
... (read more)
6Samuel Dylan Martin8moIs a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished. I'm not saying this is an exact analogy for AGI alignment - there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we're likely to try will break on us suddenly in ways we can't fix as we go - treacherous turns, inner misalignment or reactions to distributional shift. It's just that there are different answers to the question of what's the default outcome depending on if you're asking what to expect abstractly or in the context of how things are in fact done. Instrumental Convergence plus a specific potential failure mode (like e.g. we won't pay sufficient attention to out of distribution robustness), is like saying 'you know the vast majority of physically possible bridge designs fall over straight away and also there's a giant crack in that load-bearing concrete pillar over there' - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn't do much to help the case for expecting catastrophic misalignment and isn't enough to establish that failure is a default outcome. It seems like your reason for saying that catastrophic misalignment can't be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis - that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story. = 'because strongly optimizing for almost
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

I expect there to be a massive and important distinction between "passive transparency" and "active transparency", with the latter being much more shaky and potentially concealing of fatality, and the former being cruder as tech at the present rate which is unfortunate because it has so many fewer ways to go wrong.  I hope any terminology chosen continues to make the distinction clear.

Possibly relevant here is my transparency trichotomy between inspection transparency, training transparency, and architectural transparency. My guess is that inspection transparency and training transparency would mostly go in your “active transparency” bucket and architectural transparency would mostly go in your “passive transparency” bucket. I think there is a position here that makes sense to me, which is perhaps what you're advocating, that architectural transparency isn't relying on any sort of path-continuity arguments in terms of how your training ... (read more)

Load More