All of Eliezer Yudkowsky's Comments + Replies

Pivotal outcomes and pivotal processes
  • (the "AI immune system") The whole internet — including space satellites and the internet-of-things — becomes way more secure, and includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.

Define "way more secure".  Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?

Can you talk a bit about the world global... (read more)

An attempted paraphrase, to hopefully-disentangle some claims:

Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) "outside of the Overton window, or something"[1].

Critch, preceding post: Strategies involving non-Overton elements are not worth it

Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements

Eliezer, this comment: the "AI immune system" example is not an example of a strategy with no non-Overton elements

Possible reading: Critch/the reader/Eliezer currently wouldn't be able to name a strategy to... (read more)

Where I agree and disagree with Eliezer

Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.

Yup.  You can definitely train powerful systems on imitation of human thoughts, and in the limit this just gets you a powerful mesa-optimizer that figures out how to imitate them.

2[comment deleted]5d

The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.

I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.

I don't think you've made much argument about when the trans... (read more)

Where I agree and disagree with Eliezer

When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer

Why privately?!  Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does?  This is a phenomenon common in many other fields - and I'd invoke it to explain how the 'tone' of talk about AI safety shifted so quickly once I came right out and was first to say everybody's dead - and if it's also happening on the other side then people need to start talking there too.  Especially if people think they have solutions.  They should talk.

OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn't found any qualitatively new obstacles which might present deep challenges to my new view on alignment. 

Here's one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever ... (read more)

Where I agree and disagree with Eliezer

For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list.

I skimmed through the report and didn't find anything that looked like a centralized bullet point list of difficulties.  I think it's valuable in general if people say what the problems are that they're trying to solve, and then collect them into a place so people can look them over simultaneously.  I realize I haven't done enough of this myself, but if you've already written up the comp... (read more)

I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."

If you are currently looking for the list of difficulties: see the long footnote

If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state... (read more)

Let's See You Write That Corrigibility Tag

Best list so far, imo; it's what to beat.

AGI Ruin: A List of Lethalities

Well, I had to think about this for longer than five seconds, so that's already a huge victory.

If I try to compress your idea down to a few sentences:

The humans ask the AI to produce design tools, rather than designs, such that there's a bunch of human cognition that goes into picking out the particular atomic arrangements or synthesis pathways; and we can piecewise verify that the tool is making accurate predictions; and the tool is powerful enough that we can build molecular nanotech and an uploader by using the tool for an amount of time too short for F... (read more)

2DaemonicSigil13d
Yes, sounds right to me. It's also true that one of the big unproven assumptions here is that we could create an AI strong enough to build such a tool, but too weak to hack humans. I find it plausible, personally, but I don't yet have an easy-to-communicate argument for it.
AGI Ruin: A List of Lethalities

Depends what the evil clones are trying to do.

Get me to adopt a solution wrong in a particular direction, like a design that hands the universe over to them?  I can maybe figure out the first time through who's out to get me, if it's 200 Yudkowsky-years.  If it's 200,000 Yudkowsky-years I think I'm just screwed.

Get me to make any lethal mistake at all?  I don't think I can get to 90% confidence period, or at least, not without spending an amount of Yudkowsky-time equivalent to the untrustworthy source.

AGI Ruin: A List of Lethalities

If I know that it was written by aligned people?  I wouldn't just be trying to evaluate it myself; I'd try to get a team together to implement it, and understanding it well enough to implement it would be the same process as verifying whatever remaining verifiable uncertainty was left about the origins, where most of that uncertainty is unverifiable because the putative hostile origin is plausibly also smart enough to sneak things past you.

3Richard Ngo14d
Sorry, I should have been clearer. Let's suppose that a copy of you spent however long it takes to write an honest textbook with the solution to alignment (let's call it N Yudkowsky-years), and an evil copy of you spent N Yudkowsky-years writing a deceptive textbook trying to make you believe in a false solution to alignment, and you're given one but not told which. How long would it take you to reach 90% confidence about which you'd been given? (You're free to get a team together to run a bunch of experiments and implementations, I'm just asking that you measure the total work in units of years-of-work-done-by-people-as-competent-as-Yudkowsky. And I should specify some safety threshold too - like, in the process of reaching 90% confidence, incurring less than 10% chance of running an experiment which kills you.)
AGI Ruin: A List of Lethalities

Maybe one way to pin down a disagreement here: imagine the minimum-intelligence AGI that could write this textbook (including describing the experiments required to verify all the claims it made) in a year if it tried. How many Yudkowsky-years does it take to safely evaluate whether following a textbook which that AGI spent a year writing will kill you?

Infinite?  That can't be done?

4Richard Ngo15d
Hmm, okay, here's a variant. Assume it would take N Yudkowsky-years to write the textbook from the future described above. How many Yudkowsky-years does it take to evaluate a textbook that took N Yudkowsky-years to write, to a reasonable level of confidence (say, 90%)?
AGI Ruin: A List of Lethalities

Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.

(I do think there's a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)

AGI Ruin: A List of Lethalities

Nearly empty string of uncommon social inputs.  All sorts of empirical inputs, including empirical inputs in the social form of other people observing things.

It's also fair to say that, though they didn't argue me out of anything, Moravec and Drexler and Ed Regis and Vernor Vinge and Max More could all be counted as social inputs telling me that this was an important thing to look at.

AGI Ruin: A List of Lethalities

Well, my disorganized list sure wasn't complete, so why not go ahead and list some of the foreseeable difficulties I left out?  Bonus points if any of them weren't invented by me, though I realize that most people may not realize how much of this entire field is myself wearing various trenchcoats.

Sure—that's easy enough. Just off the top of my head, here's five safety concerns that I think are important that I don't think you included:

  • The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception.

  • It is impossible to verify a model's safety—even given arbitrarily good transparency tools—without access to that model's training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates i

... (read more)
AGI Ruin: A List of Lethalities

Well, there's obviously a lot of points missing!  And from the amount this post was upvoted, it's clear that people saw the half-assed current form as valuable.

Why don't you start listing out all the missing further points, then?  (Bonus points for any that don't trace back to my own invention, though I realize a lot of people may not realize how much of this stuff traces back to my own invention.)

4Evan Hubinger16d
I'm not sure what you mean by missing points? I only included your technical claims, not your sociological ones, if that's what you mean.
AGI Ruin: A List of Lethalities

Humans point to some complicated things, but not via a process that suggests an analogous way to use natural selection or gradient descent to make a mesa-optimizer point to particular externally specifiable complicated things.

6Alex Turner17d
Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world? Likewise, when you wrote, Where is the accident? Did evolution accidentally find a way to reliably orient terminal human values towards the real world? Do people each, individually, accidentally learn to terminally care about the real world? Because the former implies the existence of a better alignment paradigm (that which occurs within the human brain, to take an empty-slate human and grow them into an intelligence which terminally cares about objects in reality), and the latter is extremely unlikely. Let me know if you meant something else. EDIT: Updated a few confusing words.
AGI Ruin: A List of Lethalities

Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time.

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Human raters make systematic errors - regular, compactly describable, predictable errors.... This is indeed one

... (read more)

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Do alignment & safety research, set up regulatory bodies and monitoring systems.

When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.

Not sure exactly what this means. I'm claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.
 

AGI Ruin: A List of Lethalities

Arbital was meant to support galaxy-brained attempts like this; Arbital failed.

AGI Ruin: A List of Lethalities

This seems to me like a case of the imaginary hypothetical "weak pivotal act" that nobody can ever produce.  If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.

Okay, I will try to name a strong-but-checkable pivotal act.

(Having a strong-but-checkable pivotal act doesn't necessarily translate into having a weak pivotal act. Checkability allows us to tell the difference between a good plan and a trapped plan with high probability, but the AI has no reason to give us a good plan. It will just produce output like "I have insufficient computing power to solve this problem" regardless of whether that's actually true. If we're unusually successful at convincing the AI our checking process is bad when it's actually good,... (read more)

AGI Ruin: A List of Lethalities

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

3Koen Holtman19d
Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555 [https://www.lesswrong.com/s/3dCMdafmKmb6dRjMF/p/7EnZgaepSBwaZXA5y], if you ever had AGI technology, and what you can do with that in terms of safety. I can honestly say however that the project of writing that thing, in a way that makes the math somewhat accessible, was not easy.
0handoflixue18d
Anecdotally: even if I could write this post, I never would have, because I would assume that Eliezer cares more about writing, has better writing skills, and has a much wider audience. In short, why would I write this when Eliezer could write it? You might want to be a lot louder if you think it's a mistake to leave you as the main "public advocate / person who writes stuff down" person for the cause.

This document doesn't look to me like something a lot of people would try to write. Maybe it was one of the most important things to write, but not obviously so. Among the steps (1) get the idea to write out all reasons for pessimism, (2) resolve to try, (3) not give up halfway through, and (4) be capable, I would not guess that 4 is the strongest filter.

Announcing the Alignment of Complex Systems Research Group

Just to state the reigning orthodoxy among the Wise, if not among the general population: the interface between "AI developers" and "one AI" appears to be hugely more difficult, hugely more lethal, and vastly qualitatively different, from every other interface.  There's a horrible opsec problem with respect to single defectors in the AI lab selling your code to China which then destroys the world, but this horrible opsec problem has nothing in common with the skills and art needed for the purely technical challenge of building an AGI that doesn't dest... (read more)

The concept of "interfaces of misalignment" does not mainly point to GovAI-style research here (although it also may serve as a framing for GovAI). The concrete domains separated by the interfaces in the figure above are possibly a bit misleading in that sense:

For me, the "interfaces of misalignment" are generating intuitions about what it means to align a complex system that may not even be self-aligned - rather just one aligning part of it. It is expanding not just the space of solutions, but also the space of meanings of "success". (For example, one ext... (read more)

My guess is an attempt to explain where I think we actually differ in "generative intuitions" will be more useful than a direct response to your conclusions, so here it is. How to read it: roughly, this is attempting to just jump past several steps of double-crux to the area where I suspect actual cruxes lie. 

Continuity

In my view, your ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems "weak, won't kill you, but also won't help you with alignment" and "st... (read more)

Six Dimensions of Operational Adequacy in AGI Projects

And if humans had a utility function and we knew what that utility function was, we would not need CEV.  Unfortunately extracting human preferences over out-of-distribution options and outcomes at dangerously high intelligence, using data gathered at safe levels of intelligence and a correspondingly narrower range of outcomes and options, when there exists no sensory ground truth about what humans want because human raters can be fooled or disassembled, seems pretty complicated.  There is ultimately a rescuable truth about what we want, and CEV i... (read more)

4Vanessa Kosoy22d
I agree that it's a tricky problem, but I think it's probably tractable. The way PreDCA [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=vKw6DB9crncovPxED] tries to deal with these difficulties is: * The AI can tell that, even before the AI was turned on, the physical universe was running certain programs. * Some of those programs are "agentic" programs. * Agentic programs have approximately well-defined utility functions. * Disassembling the humans doesn't change anything, since it doesn't affect the programs that were already running[1] [#fn-Xf4y7zp4uQAGe7YHQ-1] before the AI was turned on. * Since we're looking at agent-programs rather than specific agent-actions, there is much more ground for inference about novel situations. Obviously, the concepts I'm using here (e.g. which programs are "running" or which programs are "agentic") are non-trivial to define, but infra-Bayesian physicalism [https://www.lesswrong.com/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized] does allow us the define them (not without some caveats, but hopefully at least to a 1st approximation). -------------------------------------------------------------------------------- 1. More precisely, I am looking at agents which could prevent the AI from becoming turned on, this is what I call "precursors". ↩︎ [#fnref-Xf4y7zp4uQAGe7YHQ-1]
The Commitment Races problem

IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent.  If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair.  They cannot evade that by trying to make some 'commitment' earlier than I do.  I expect that, whatever is the correct and sane version of this ... (read more)

1DaemonicSigil2mo
The Ultimatum game seems like it has pretty much the same type signature as the prisoner's dilemma: Payoff matrix for different strategies, where the players can roll dice to pick which strategy they use. Does timeless decision theory return the "correct answer" (second player rejects greedy proposals with some probability) when you feed it the Ultimatum game?
3Daniel Kokotajlo2mo
I agree with all this I think. This is why I said commitment races happen between consequentialists (I defined that term more narrowly than you do; the sophisticated reasoning you do here is nonconsequentialist by my definition). I agree that agents worthy of the label "rational" will probably handle these cases gracefully and safely. However, I'm not yet supremely confident that the AGIs we end up building will handle these cases gracefully and safely. I would love to become more confident & am looking for ways to make it more likely. If today you go around asking experts for an account of rationality, they'll pull off the shelf CDT or EDT or game-theoretic rationality (nash equilibria, best-respond to opponent) -- something consequentialist in the narrow sense. I think there is a nonzero chance that the relevant AGI will be like this too, either because we explicitly built it that way or because in some young dumb early stage it (like humans) picks up ideas about how to behave from its environment. Or else maybe because narrow-consequentialism works pretty well in single-agent environments and many muti-agent environments too, and maybe by the time the AGI is able to self-modify to something more sophisticated it is already thinking about commitment races and already caught in their destructive logic. (ETA: Insofar as you are saying: "Daniel, worrying about this is silly, any AGI smart enough to kill us all will also be smart enough to not get caught in commitment races" then I say... I hope so! But I want to think it through carefully first; it doesn't seem obvious to me, for the above reasons.)
“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Idea A (for “Alright”): Humanity should develop hardware-destroying capabilities — e.g., broadly and rapidly deployable non-nuclear EMPs — to be used in emergencies to shut down potentially-out-of-control AGI situations, such as an AGI that has leaked onto the internet, or an irresponsible nation developing AGI unsafely.

Sounds obviously impossible in real life, so how about you go do that and then I'll doff my hat in amazement and change how I speak of pivotal acts. Go get gain-of-function banned, even, that should be vastly simpler. Then we can talk ... (read more)

2Andrew Critch22d
Eliezer, from outside the universe I might take your side of this bet. But I don't think it's productive to give up on getting mainstream institutions to engage in cooperative efforts to reduce x-risk. A propos, I wrote the following post in reaction to positions-like-yours-on-this-issue, but FYI it's not just you (maybe 10% you though?): https://www.lesswrong.com/posts/5hkXeCnzojjESJ4eB [https://www.lesswrong.com/posts/5hkXeCnzojjESJ4eB]
Late 2021 MIRI Conversations: AMA / Discussion

Yes, it was an intentional part of the goal.

If there were any possibility of surviving the first AGI built, then it would be nice to have AGI projects promising to do something that wouldn't look like trying to seize control of the Future for themselves, when, much later (subjectively?), they became able to do something like CEV.  I don't see much evidence that they're able to think on the level of abstraction that CEV was stated on, though, nor that they're able to understand the 'seizing control of the Future' failure mode that CEV is meant to preve... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

I would "destroy the world" from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.

From the perspective of my highly similar fellow humans with whom I evolved in context, they'd get nice stuff, because "my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that... (read more)

More Christiano, Cotra, and Yudkowsky on AI progress

Want to +1 that a vaguer version of this was my own rough sense of RNNs vs. CNNs vs. Transformers.

4Paul Christiano7mo
I think transformers are a big deal, but I think this comment is a bad guess at the counterfactual and it reaffirms my desire to bet with you about either history or the future. One bet down, handful to go?
Biology-Inspired AGI Timelines: The Trick That Never Works

As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I'd like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.

7Paul Christiano7mo
I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.) As far as I can tell, the source for your attribution of this "prediction" is: As far as I could tell it sounds from the surrounding text like his "prediction" for transformative impacts from AI was something like "between 2010 and 2030" with broad error bars.
Biology-Inspired AGI Timelines: The Trick That Never Works

It does fit well there, but I think it was more inspired by the person I met who thought I was being way too arrogant by not updating in the direction of OpenPhil's timeline estimates to the extent I was uncertain.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Maybe another way of phrasing this - how much warning do you expect to get, how far out does your Nope Vision extend?  Do you expect to be able to say "We're now in the 'for all I know the IMO challenge could be won in 4 years' regime" more than 4 years before it happens, in general?  Would it be fair to ask you again at the end of 2022 and every year thereafter if we've entered the 'for all I know, within 4 years' regime?

Added:  This question fits into a larger concern I have about AI soberskeptics in general (not you, the soberskeptics wou... (read more)

I think I'll get less confident as our accomplishments get closer to the IMO grand challenge. Or maybe I'll get much more confident if we scale up from $1M -> $1B and pick the low hanging fruit without getting fairly close, since at that point further progress gets a lot easier to predict

There's not really a constant time horizon for my pessimism, it depends on how long and robust a trend you are extrapolating from. 4 years feels like a relatively short horizon, because theorem-proving has not had much investment so compute can be scaled up several orde... (read more)

Christiano, Cotra, and Yudkowsky on AI progress

I also think human brains are better than elephant brains at most things - what did I say that sounded otherwise?

2Paul Christiano7mo
Oops, this was in reference to the later part of the discussion where you disagreed with "a human in a big animal body, with brain adapted to operate that body instead of our own, would beat a big animal [without using tools]".
Yudkowsky and Christiano discuss "Takeoff Speeds"

Okay, then we've got at least one Eliezerverse item, because I've said below that I think I'm at least 16% for IMO theorem-proving by end of 2025.  The drastic difference here causes me to feel nervous, and my second-order estimate has probably shifted some in your direction just from hearing you put 1% on 2024, but that's irrelevant because it's first-order estimates we should be comparing here.

So we've got huge GDP increases for before-End-days signs of Paulverse and quick IMO proving for before-End-days signs of Eliezerverse?  Pretty bare port... (read more)

I think IMO gold medal could be well before massive economic impact, I'm just surprised if it happens in the next 3 years. After a bit more thinking (but not actually looking at IMO problems or the state of theorem proving) I probably want to bump that up a bit, maybe 2%, it's hard reasoning about the tails. 

I'd say <4% on end of 2025.

I think this is the flipside of me having an intuition where I say things like "AlphaGo and GPT-3 aren't that surprising"---I have a sense for what things are and aren't surprising, and not many things happen that are... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I expect it to be hella difficult to pick anything where I'm at 75% that it happens in the next 5 years and Paul is at 25%.  Heck, it's not easy to find things where I'm at over 75% that aren't just obvious slam dunks; the Future isn't that easy to predict.  Let's get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.

I frame no prediction about whether Paul is under 16%.  That's a separate matter.  I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I'm like "lol wut" and double their probability, even if it turns out that Paul agrees with me about it.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Ha!  Okay then.  My probability is at least 16%, though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more.  Paul?

EDIT:  I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists.  I'll stand by a >16% probabilit... (read more)

5Paul Christiano7mo
Based on the other thread I now want to revise this prediction, both because 4% was too low and "IMO gold" has a lot of noise in it based on test difficulty. I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" where "hardest problem" = "usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra." (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.) Maybe I'll go 8% on "gets gold" instead of "solves hardest problem." Would be good to get your updated view on this so that we can treat it as staked out predictions.

I don't care about whether the AI is open-sourced (I don't expect anyone to publish the weights even if they describe their method) and I'm not that worried about our ability to arbitrate overfitting.

Ajeya suggested that I clarify: I'm significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I'd guess about 5-10% could be done with no deep learning and m... (read more)

2Matthew Barnett7mo
If this task is bad for operationalization reasons, there are other theorem proving benchmarks [https://paperswithcode.com/task/automated-theorem-proving]. Unfortunately it looks like there aren't a lot of people that are currently trying to improve on the known benchmarks, as far as I'm aware. The code generation benchmarks [https://paperswithcode.com/task/code-generation] are slightly more active. I'm personally partial to Hendrycks et al.'s APPS benchmark [https://arxiv.org/pdf/2105.09938v3.pdf], which includes problems that "range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability." (Github link [https://github.com/hendrycks/apps]).
4Matthew Barnett7mo
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul's camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal). The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul's model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul's. Edit: Here are, for example, the individual rankings for 2021: https://www.imo-official.org/year_individual_r.aspx?year=2021 [https://www.imo-official.org/year_individual_r.aspx?year=2021]
Christiano, Cotra, and Yudkowsky on AI progress

Mostly, I think the Future is not very predictable in some ways, and this extends to, for example, it being the possible that 2022 is the year where we start Final Descent and by 2024 it's over, because it so happened that although all the warning signs were Very Obvious In Retrospect they were not obvious in antecedent and so stuff just started happening one day.  The places where I dare to extend out small tendrils of prediction are the rare exception to this rule; other times, people go about saying, "Oh, no, it definitely couldn't start in 2022" a... (read more)

I'm mostly not looking for virtue points, I'm looking for: (i) if your view is right then I get some kind of indication of that so that I can take it more seriously, (ii) if your view is wrong then you get some indication feedback to help snap you out of it.

I don't think it's surprising if a GPT-3 sized model can do relatively good translation. If talking about this prediction, and if you aren't happy just predicting numbers for overall value added from machine translation, I'd kind of like to get some concrete examples of mediocre translations or concrete problems with existing NMT that you are predicting can be improved.

Christiano, Cotra, and Yudkowsky on AI progress

If they've found some way to put a lot more compute into GPT-4 without making the model bigger, that's a very different - and unnerving - development.

Yudkowsky and Christiano discuss "Takeoff Speeds"

(I'm currently slightly hopeful about the theorem-proving thread, elsewhere and upthread.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I have a sense that there's a lot of latent potential for theorem-proving to advance if more energy gets thrown at it, in part because current algorithms seem a bit weird to me - that we are waiting on the equivalent of neural MCTS as an enabler for AlphaGo, not just a bigger investment, though of course the key trick could already have been published in any of a thousand papers I haven't read.  I feel like I "would not be surprised at all" if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO chal... (read more)

Yes, IMO challenge falling in 2024 is surprising to me at something like the 1% level or maybe even more extreme (though could also go down if I thought about it a lot or if commenters brought up relevant considerations, e.g. I'd look at IMO problems and gold medal cutoffs and think about what tasks ought to be easy or hard; I'm also happy to make more concrete per-question predictions). I do think that there could be huge amounts of progress from picking the low hanging fruit and scaling up spending by a few orders of magnitude, but I still don't expect i... (read more)

I feel like I "would not be surprised at all" if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO challenge falls in 2024

Possibly helpful: Metaculus currently puts the chances of the IMO grand challenge falling by 2025 at about 8%. Their median is 2039.

I think this would make a great bet, as it would definitely show that your model can strongly outperform a lot of people (and potentially Paul too). And the operationalization for the bet is already there -- so little work will be needed to do that part.

Yudkowsky and Christiano discuss "Takeoff Speeds"

I kind of want to see you fight this out with Gwern (not least for social reasons, so that people would perhaps see that it wasn't just me, if it wasn't just me).

But it seems to me that the very obvious GPT-5 continuation of Gwern would say, "Gradualists can predict meaningless benchmarks, but they can't predict the jumpy surface phenomena we see in real life."  We want to know when humans land on the moon, not whether their brain sizes continued on a smooth trend extrapolated over the last million years.

I think there's a very real sense in which, yes... (read more)

But it seems to me that the very obvious GPT-5 continuation of Gwern would say, "Gradualists can predict meaningless benchmarks, but they can't predict the jumpy surface phenomena we see in real life."

Don't you think you're making a falsifiable prediction here?

Name something that you consider part of the "jumpy surface phenomena" that will show up substantially before the world ends (that you think Paul doesn't expect). Predict a discontinuity. Operationalize everything and then propose the bet.

Christiano, Cotra, and Yudkowsky on AI progress

I don't necessarily expect GPT-4 to do better on perplexity than would be predicted by a linear model fit to neuron count plus algorithmic progress over time; my guess for why they're not scaling it bigger would be that Stack More Layers just basically stopped scaling in real output quality at the GPT-3 level.  They can afford to scale up an OOM to 1.75 trillion weights, easily, given their funding, so if they're not doing that, an obvious guess is that it's because they're not getting a big win from that.  As for their ability to then make algor... (read more)

While GPT-4 wouldn't be a lot bigger than GPT-3, Sam Altman did indicate that it'd use a lot more compute. That's consistent with Stack More Layers still working; they might just have found an even better use for compute.

(The increased compute-usage also makes me think that a Paul-esque view would allow for GPT-4 to be a lot more impressive than GPT-3, beyond just modest algorithmic improvements.)

Christiano, Cotra, and Yudkowsky on AI progress

My memory of the past is not great in general, but considering that I bet sums of my own money and advised others to do so, I am surprised that my memory here would be that bad, if it was.

Neither GJO nor Metaculus are restricted to only past superforecasters, as I understand it; and my recollection is that superforecasters in particular, not all participants at GJO or Metaculus, were saying in the range of 20%.  Here's an example of one such, which I have a potentially false memory of having maybe read at the time: https://www.gjopen.com/comments/118530

1Matthew Barnett7mo
Thanks for clarifying. That makes sense that you may have been referring to a specific subset of forecasters. I do think that some forecasters tend to be much more reliable than others (and maybe there was/is a way to restrict to "superforecasters" in the UI). I will add the following piece of evidence, which I don't think counts much for or against your memory, but which still seems relevant. Metaculus shows a histogram of predictions. On the relevant question [https://www.metaculus.com/questions/112/will-googles-alphago-beat-go-player-lee-sedol-in-march-2016/] , a relatively high fraction of people put a 20% chance, but it also looks like over 80% of forecasters put higher credences.
Christiano, Cotra, and Yudkowsky on AI progress

I feel like the biggest subjective thing is that I don't feel like there is a "core of generality" that GPT-3 is missing

I just expect it to gracefully glide up to a human-level foom-ing intelligence

This is a place where I suspect we have a large difference of underlying models.  What sort of surface-level capabilities do you, Paul, predict that we might get (or should not get) in the next 5 years from Stack More Layers?  Particularly if you have an answer to anything that sounds like it's in the style of Gwern's questions, because I think those a... (read more)

6Paul Christiano7mo
I agree we seem to have some kind of deeper disagreement here. I think stack more layers + known training strategies (nothing clever) + simple strategies for using test-time compute (nothing clever, nothing that doesn't use the ML as a black box) can get continuous improvements in tasks like reasoning (e.g. theorem-proving), meta-learning (e.g. learning to learn new motor skills), automating R&D (including automating executing ML experiments, or proposing new ML experiments), or basically whatever. I think these won't get to human level in the next 5 years. We'll have crappy versions of all of them. So it seems like we basically have to get quantitative. If you want to talk about something we aren't currently measuring, then that probably takes effort, and so it would probably be good if you picked some capability where you won't just say "the Future is hard to predict." (Though separately I expect to make somewhat better predictions than you in most of these domains.) A plausible example is that I think it's pretty likely that in 5 years, with mere stack more layers + known techniques (nothing clever), you can have a system which is clearly (by your+my judgment) "on track" to improve itself and eventually foom, e.g. that can propose and evaluate improvements to itself, whose ability to evaluate proposals is good enough that it will actually move in the right direction and eventually get better at the process, etc., but that it will just take a long time for it to make progress. I'd guess that it looks a lot like a dumb kid in terms of the kind of stuff it proposes and its bad judgment (but radically more focused on the task and conscientious and wise than any kid would be). Maybe I think that's 10% unconditionally, but much higher given a serious effort. My impression is that you think this is unlikely without adding in some missing secret sauce to GPT, and that my picture is generally quite different from your criticallity-flavored model of takeoff.

If you give me 1 or 10 examples of surface capabilities I'm happy to opine. If you want me to name industries or benchmarks, I'm happy to opine on rates of progress. I don't like the game where you say "Hey, say some stuff. I'm not going to predict anything and I probably won't engage quantitatively with it since I don't think much about benchmarks or economic impacts or anything else that we can even talk about precisely in hindsight for GPT-3."

I don't even know which of Gwern's questions you think are interesting/meaningful. "Good meta-learning"--I don't... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

The crazy part is someone spending $1B and then generating $100B/year in revenue (much less $100M and then taking over the world).

Would you say that this is a good description of Suddenly Hominids but you don't expect that to happen again, or that this is a bad description of hominids?

4Paul Christiano7mo
It's not a description of hominids at all, no one spent any money on R&D. I think there are analogies where this would be analogous to hominids (which I think are silly, as we discuss in the next part of this transcript). And there are analogies where this is a bad description of hominids (which I prefer).
Yudkowsky and Christiano discuss "Takeoff Speeds"

Thanks for continuing to try on this!  Without having spent a lot of labor myself on looking into self-driving cars, I think my sheer impression would be that we'll get $1B/yr waifutech before we get AI freedom-of-the-road; though I do note again that current self-driving tech would be more than sufficient for $10B/yr revenue if people built new cities around the AI tech level, so I worry a bit about some restricted use-case of self-driving tech that is basically possible with current tech finding some less regulated niche worth a trivial $10B/yr. &nb... (read more)

4Paul Christiano7mo
Yes, I think that value added by automated translation will follow a similar pattern. Number of words translated is more sensitive to how you count and random nonsense, as is number of "users" which has even more definitional issues. You can state a prediction about self-driving cars in any way you want. The obvious thing is to talk about programs similar to the existing self-driving taxi pilots (e.g. Waymo One) and ask when they do $X of revenue per year, or when $X of self-driving trucking is done per year. (I don't know what AI freedom-of-the-road means, do you mean something significantly more ambitious than self-driving trucks or taxis?)
Yudkowsky and Christiano discuss "Takeoff Speeds"

I think you are underconfident about the fact that almost all AI profits will come from areas that had almost-as-much profit in recent years. So we could bet about where AI profits are in the near term, or try to generalize this.

I wouldn't be especially surprised by waifutechnology or machine translation jumping to newly accessible domains (the thing I care about and you shrug about (until the world ends)), but is that likely to exhibit a visible economic discontinuity in profits (which you care about and I shrug about (until the world ends))?  There'... (read more)

4Paul Christiano7mo
Man, the problem is that you say the "jump to newly accessible domains" will be the thing that lets you take over the world. So what's up for dispute is the prototype being enough to take over the world rather than years of progress by a giant lab on top of the prototype. It doesn't help if you say "I expect new things to sometimes become possible" if you don't further say something about the impact of the very early versions of the product. If e.g. people were spending $1B/year developing a technology, and then after a while it jumps from 0/year to $1B/year of profit, I'm not that surprised. (Note that machine translation is radically smaller than this, I don't know the numbers.) I do suspect they could have rolled out a crappy version earlier, perhaps by significantly changing their project. But why would they necessarily bother doing that? For me this isn't violating any of the principles that make your stories sound so crazy. The crazy part is someone spending $1B and then generating $100B/year in revenue (much less $100M and then taking over the world). (Note: it is surprising if an industry is spending $10T/year on R&D and then jumps from $1T --> $10T of revenue in one year in a world that isn't yet growing crazily. The surprising depends a lot on the numbers involved, and in particular on how valuable it would have been to deploy a worse version earlier and how hard it is to raise money at different scales.)

I'd be happy to disagree about romantic chatbots or machine translation. I'd have to look into it more to get a detailed sense in either, but I can guess. I'm not sure what "wouldn't be especially surprised" means, I think to actually get disagreements we need way more resolution than that so one question is whether you are willing to play ball (since presumably you'd also have to looking into to get a more detailed sense). Maybe we could save labor if people would point out the empirical facts we're missing and we can revise in light of that, but we'd sti... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

And to say it also explicitly, I think this is part of why I have trouble betting with Paul.  I have a lot of ? marks on the questions that the Gwern voice is asking above, regarding them as potentially important breaks from trend that just get dumped into my generalized inbox one day.  If a gradualist thinks that there ought to be a smooth graph of perplexity with respect to computing power spent, in the future, that's something I don't care very much about except insofar as it relates in any known way whatsoever to questions like those the Gwer... (read more)

This seems totally bogus to me.

It feels to me like you mostly don't have views about the actual impact of AI as measured by jobs that it does or the $s people pay for them, or performance on any benchmarks that we are currently measuring, while I'm saying I'm totally happy to use gradualist metrics to predict any of those things. If you want to say "what does it mean to be a gradualist" I can just give you predictions on them. 

To you this seems reasonable, because e.g. $ and benchmarks are not the right way to measure the kinds of impacts we care abou... (read more)

What does it even mean to be a gradualist about any of the important questions like those of the Gwern-voice, when they don't relate in known ways to the trend lines that are smooth?

Perplexity is one general “intrinsic” measure of language models, but there are many task-specific measures too. Studying the relationship between perplexity and task-specific measures is an important part of the research process. We shouldn’t speak as if people do not actively try to uncover these relationships.

I would generally be surprised if there were many highly non-li... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I predict that people will explicitly collect much larger datasets of human behavior as the economic stakes rise. This is in contrast to e.g. theorem-proving working well, although I think that theorem-proving may end up being an important bellwether because it allows you to assess the capabilities of large models without multi-billion-dollar investments in training infrastructure.

Well, it sounds like I might be more bullish than you on theorem-proving, possibly.  Not on it being useful or profitable, but in terms of underlying technology making progr... (read more)

I'm going to make predictions by drawing straight-ish lines through metrics like the ones in the gpt-f paper. Big unknowns are then (i) how many orders of magnitude of "low-hanging fruit" are there before theorem-proving even catches up to the rest of NLP? (ii) how hard their benchmarks are compared to other tasks we care about. On (i) my guess is maybe 2? On (ii) my guess is "they are pretty easy" / "humans are pretty bad at these tasks," but it's somewhat harder to quantify. If you think your methodology is different from that then we will probably end u... (read more)

Load More