# All of Eliezer Yudkowsky's Comments + Replies

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

1Oliver Habryka23d
I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago. I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.

Okay, that makes much more sense.  I initially read the diagram as saying that just lines 1 and 2 were in the box.

If that's how it works, it doesn't lead to a simplified cartoon guide for readers who'll notice missing steps or circular premises; they'd have to first walk through Lob's Theorem in order to follow this "simplified" proof of Lob's Theorem.

4Andrew Critch2mo
Yes to both of you on these points: * Yes to Alex that (I think) you can use an already-in-hand proof of Löb to make the self-referential proof work, and * Yes to Eliezer that that would be cheating wouldn't actually ground out all of the intuitions, because then the "santa clause"-like sentence is still in use in already-in-hand proof of Löb. (I'll write a separate comment on Eliezer's original question.)

Forgive me if this is a dumb question, but if you don't use assumption 3: []([]C -> C) inside steps 1-2, wouldn't the hypothetical method prove 2: [][]C for any C?

Thanks for your attention to this!  The happy face is the outer box.  So, line 3 of the cartoon proof is assumption 3.

If you want the full []([]C->C) to be inside a thought bubble, then just take every line of the cartoon and put into a thought bubble, and I think that will do what you want.

LMK if this doesn't make sense; given the time you've spent thinking about this, you're probably my #1 target audience member for making the more intuitive proof (assuming it's possible, which I think it is).

ETA:  You might have been asking if th...

It would kind of use assumption 3 inside step 1, but inside the syntax, rather than in the metalanguage. That is, step 1 involves checking that the number encoding "this proof" does in fact encode a proof of C. This can't be done if you never end up proving C.

One thing that might help make clear what's going on is that you can follow the same proof strategy, but replace "this proof" with "the usual proof of Lob's theorem", and get another valid proof of Lob's theorem, that goes like this: Suppose you can prove that []C->C, and let n be the number encodi...

So I think that building nanotech good enough to flip the tables - which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than "disassemble all GPUs", which I choose not to name explicitly - is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best...

I don't think you're going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems).  It's also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without t...

The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.

Yes, this is the key question, and I think there’s a clear answer, at least in outline:

What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us d...

Mind space is very wide

Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.

The most AGI-like systems we have today are LLMs, optimized...

Just to restate the standard argument against:

If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect.  They don't want your own preferred outcome.  Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.

(Or, I mean, actually the strategy is "mutually cooperate"?  Simulate a spread of the other possible entities, ...

I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.

To quote Paul again:.

I think that concerns about collusion are relatively widespread amongst the minority of people most interested in AI control. And these concerns have in fact led to people dismissing many otherwise-promising approaches to AI control, so it is de facto an important question

Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart ...

The cogitation here is implicitly hypothesizing an AI that's explicitly considering the data and trying to compress it, having been successfully anchored on that data's compression as identifying an ideal utility function.  You're welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn't arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.

I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.

By "were the humans pointing me towards..." Nate is not asking "did the humans intend to point me towards..." but rather "did the humans actually point me towards..."  That is, we're assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.

3Richard Ngo3mo
I agree that we'll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as "reflecting back on that data" in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).

I'm particularly impressed by "The Floating Droid".  This can be seen as early-manifesting the foreseeable difficulty where:

At kiddie levels, a nascent AGI is not smart enough to model humans and compress its human feedback by the hypothesis "It's what a human rates", and so has object-level hypotheses about environmental features that directly cause good or bad ratings;

When smarter, an AGI forms the psychological hypothesis over its ratings, because that more sophisticated hypothesis is now available to its smarter self as a better way to compress th...

In this particular experiment, the small models did not have an object-level hypotheses. It just had no clue and answered randomly.

I think the experiment shows that sometimes smaller models are too dumb to pick up the misleading correlation, which can though off bigger models.

## Reality hits back on the models we train via loss functions based on reality-generated data. But alignment also hits back on models we train, because we also use loss functions (based on preference data). These seem to be symmetrically powerful forces.

Alignment doesn't hit back, the loss function hits back and the loss function doesn't capture what you really want (eg because killing the humans and taking control of a reward button will max reward, deceiving human raters will increase ratings, etc).  If what we wanted was exactly captured in a loss function, alignment would be easier.  Not easy because outer optimization doesn't create good inner alignment, but easier than the present case.

2Alex Turner5mo
This seems like a type error to me [https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed?commentId=idADigChuWJbooy9j] . What does it mean for a reward function to "capture what I really want"? Can anyone give even a handwavy operationalization of such a scenario, so I can try to imagine something concrete?
• (the "AI immune system") The whole internet — including space satellites and the internet-of-things — becomes way more secure, and includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.

Define "way more secure".  Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?

Can you talk a bit about the world global...

An attempted paraphrase, to hopefully-disentangle some claims:

Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) "outside of the Overton window, or something"[1].

Critch, preceding post: Strategies involving non-Overton elements are not worth it

Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements

Eliezer, this comment: the "AI immune system" example is not an example of a strategy with no non-Overton elements

Possible reading: Critch/the reader/Eliezer currently wouldn't be able to name a strategy to...

Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.

Yup.  You can definitely train powerful systems on imitation of human thoughts, and in the limit this just gets you a powerful mesa-optimizer that figures out how to imitate them.

2[comment deleted]7mo

The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.

I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.

Agreed explicitly for the record.

When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer

Why privately?!  Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does?  This is a phenomenon common in many other fields - and I'd invoke it to explain how the 'tone' of talk about AI safety shifted so quickly once I came right out and was first to say everybody's dead - and if it's also happening on the other side then people need to start talking there too.  Especially if people think they have solutions.  They should talk.

OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn't found any qualitatively new obstacles which might present deep challenges to my new view on alignment.

Here's one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever ...

For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list.

I skimmed through the report and didn't find anything that looked like a centralized bullet point list of difficulties.  I think it's valuable in general if people say what the problems are that they're trying to solve, and then collect them into a place so people can look them over simultaneously.  I realize I haven't done enough of this myself, but if you've already written up the comp...

I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."

If you are currently looking for the list of difficulties: see the long footnote

If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state...

Best list so far, imo; it's what to beat.

If I try to compress your idea down to a few sentences:

The humans ask the AI to produce design tools, rather than designs, such that there's a bunch of human cognition that goes into picking out the particular atomic arrangements or synthesis pathways; and we can piecewise verify that the tool is making accurate predictions; and the tool is powerful enough that we can build molecular nanotech and an uploader by using the tool for an amount of time too short for F...

2DaemonicSigil8mo
Yes, sounds right to me. It's also true that one of the big unproven assumptions here is that we could create an AI strong enough to build such a tool, but too weak to hack humans. I find it plausible, personally, but I don't yet have an easy-to-communicate argument for it.

Depends what the evil clones are trying to do.

Get me to adopt a solution wrong in a particular direction, like a design that hands the universe over to them?  I can maybe figure out the first time through who's out to get me, if it's 200 Yudkowsky-years.  If it's 200,000 Yudkowsky-years I think I'm just screwed.

Get me to make any lethal mistake at all?  I don't think I can get to 90% confidence period, or at least, not without spending an amount of Yudkowsky-time equivalent to the untrustworthy source.

If I know that it was written by aligned people?  I wouldn't just be trying to evaluate it myself; I'd try to get a team together to implement it, and understanding it well enough to implement it would be the same process as verifying whatever remaining verifiable uncertainty was left about the origins, where most of that uncertainty is unverifiable because the putative hostile origin is plausibly also smart enough to sneak things past you.

3Richard Ngo8mo
Sorry, I should have been clearer. Let's suppose that a copy of you spent however long it takes to write an honest textbook with the solution to alignment (let's call it N Yudkowsky-years), and an evil copy of you spent N Yudkowsky-years writing a deceptive textbook trying to make you believe in a false solution to alignment, and you're given one but not told which. How long would it take you to reach 90% confidence about which you'd been given? (You're free to get a team together to run a bunch of experiments and implementations, I'm just asking that you measure the total work in units of years-of-work-done-by-people-as-competent-as-Yudkowsky. And I should specify some safety threshold too - like, in the process of reaching 90% confidence, incurring less than 10% chance of running an experiment which kills you.)

Maybe one way to pin down a disagreement here: imagine the minimum-intelligence AGI that could write this textbook (including describing the experiments required to verify all the claims it made) in a year if it tried. How many Yudkowsky-years does it take to safely evaluate whether following a textbook which that AGI spent a year writing will kill you?

Infinite?  That can't be done?

4Richard Ngo8mo
Hmm, okay, here's a variant. Assume it would take N Yudkowsky-years to write the textbook from the future described above. How many Yudkowsky-years does it take to evaluate a textbook that took N Yudkowsky-years to write, to a reasonable level of confidence (say, 90%)?

Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.

(I do think there's a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)

Nearly empty string of uncommon social inputs.  All sorts of empirical inputs, including empirical inputs in the social form of other people observing things.

It's also fair to say that, though they didn't argue me out of anything, Moravec and Drexler and Ed Regis and Vernor Vinge and Max More could all be counted as social inputs telling me that this was an important thing to look at.

Well, my disorganized list sure wasn't complete, so why not go ahead and list some of the foreseeable difficulties I left out?  Bonus points if any of them weren't invented by me, though I realize that most people may not realize how much of this entire field is myself wearing various trenchcoats.

Sure—that's easy enough. Just off the top of my head, here's five safety concerns that I think are important that I don't think you included:

• The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception.

• It is impossible to verify a model's safety—even given arbitrarily good transparency tools—without access to that model's training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates i

...

Well, there's obviously a lot of points missing!  And from the amount this post was upvoted, it's clear that people saw the half-assed current form as valuable.

Why don't you start listing out all the missing further points, then?  (Bonus points for any that don't trace back to my own invention, though I realize a lot of people may not realize how much of this stuff traces back to my own invention.)

4Evan Hubinger8mo
I'm not sure what you mean by missing points? I only included your technical claims, not your sociological ones, if that's what you mean.

Humans point to some complicated things, but not via a process that suggests an analogous way to use natural selection or gradient descent to make a mesa-optimizer point to particular externally specifiable complicated things.

6Alex Turner8mo
Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world? Likewise, when you wrote, Where is the accident? Did evolution accidentally find a way to reliably orient terminal human values towards the real world? Do people each, individually, accidentally learn to terminally care about the real world? Because the former implies the existence of a better alignment paradigm (that which occurs within the human brain, to take an empty-slate human and grow them into an intelligence which terminally cares about objects in reality), and the latter is extremely unlikely. Let me know if you meant something else. EDIT: Updated a few confusing words.

Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time.

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Human raters make systematic errors - regular, compactly describable, predictable errors.... This is indeed one

...

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Do alignment & safety research, set up regulatory bodies and monitoring systems.

When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.

Not sure exactly what this means. I'm claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.

Arbital was meant to support galaxy-brained attempts like this; Arbital failed.

This seems to me like a case of the imaginary hypothetical "weak pivotal act" that nobody can ever produce.  If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.

Okay, I will try to name a strong-but-checkable pivotal act.

(Having a strong-but-checkable pivotal act doesn't necessarily translate into having a weak pivotal act. Checkability allows us to tell the difference between a good plan and a trapped plan with high probability, but the AI has no reason to give us a good plan. It will just produce output like "I have insufficient computing power to solve this problem" regardless of whether that's actually true. If we're unusually successful at convincing the AI our checking process is bad when it's actually good,...

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

3Koen Holtman8mo
Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555 [https://www.lesswrong.com/s/3dCMdafmKmb6dRjMF/p/7EnZgaepSBwaZXA5y], if you ever had AGI technology, and what you can do with that in terms of safety. I can honestly say however that the project of writing that thing, in a way that makes the math somewhat accessible, was not easy.
0handoflixue8mo
Anecdotally: even if I could write this post, I never would have, because I would assume that Eliezer cares more about writing, has better writing skills, and has a much wider audience. In short, why would I write this when Eliezer could write it? You might want to be a lot louder if you think it's a mistake to leave you as the main "public advocate / person who writes stuff down" person for the cause.

This document doesn't look to me like something a lot of people would try to write. Maybe it was one of the most important things to write, but not obviously so. Among the steps (1) get the idea to write out all reasons for pessimism, (2) resolve to try, (3) not give up halfway through, and (4) be capable, I would not guess that 4 is the strongest filter.

Just to state the reigning orthodoxy among the Wise, if not among the general population: the interface between "AI developers" and "one AI" appears to be hugely more difficult, hugely more lethal, and vastly qualitatively different, from every other interface.  There's a horrible opsec problem with respect to single defectors in the AI lab selling your code to China which then destroys the world, but this horrible opsec problem has nothing in common with the skills and art needed for the purely technical challenge of building an AGI that doesn't dest...

The concept of "interfaces of misalignment" does not mainly point to GovAI-style research here (although it also may serve as a framing for GovAI). The concrete domains separated by the interfaces in the figure above are possibly a bit misleading in that sense:

For me, the "interfaces of misalignment" are generating intuitions about what it means to align a complex system that may not even be self-aligned - rather just one aligning part of it. It is expanding not just the space of solutions, but also the space of meanings of "success". (For example, one ext...

My guess is an attempt to explain where I think we actually differ in "generative intuitions" will be more useful than a direct response to your conclusions, so here it is. How to read it: roughly, this is attempting to just jump past several steps of double-crux to the area where I suspect actual cruxes lie.

Continuity

In my view, your ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems "weak, won't kill you, but also won't help you with alignment" and "st...

And if humans had a utility function and we knew what that utility function was, we would not need CEV.  Unfortunately extracting human preferences over out-of-distribution options and outcomes at dangerously high intelligence, using data gathered at safe levels of intelligence and a correspondingly narrower range of outcomes and options, when there exists no sensory ground truth about what humans want because human raters can be fooled or disassembled, seems pretty complicated.  There is ultimately a rescuable truth about what we want, and CEV i...

4Vanessa Kosoy8mo
I agree that it's a tricky problem, but I think it's probably tractable. The way PreDCA [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=vKw6DB9crncovPxED] tries to deal with these difficulties is: * The AI can tell that, even before the AI was turned on, the physical universe was running certain programs. * Some of those programs are "agentic" programs. * Agentic programs have approximately well-defined utility functions. * Disassembling the humans doesn't change anything, since it doesn't affect the programs that were already running[1] [#fn-Xf4y7zp4uQAGe7YHQ-1] before the AI was turned on. * Since we're looking at agent-programs rather than specific agent-actions, there is much more ground for inference about novel situations. Obviously, the concepts I'm using here (e.g. which programs are "running" or which programs are "agentic") are non-trivial to define, but infra-Bayesian physicalism [https://www.lesswrong.com/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized] does allow us the define them (not without some caveats, but hopefully at least to a 1st approximation). -------------------------------------------------------------------------------- 1. More precisely, I am looking at agents which could prevent the AI from becoming turned on, this is what I call "precursors". ↩︎ [#fnref-Xf4y7zp4uQAGe7YHQ-1]

IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent.  If somebody demands \$6 from me in the Ultimatum game, threatening to leave us both with \$0 unless I offer at least \$6 to them... then I offer \$6 with slightly less than 5/6 probability, so they do no better than if they demanded \$5, the amount I think is fair.  They cannot evade that by trying to make some 'commitment' earlier than I do.  I expect that, whatever is the correct and sane version of this ...

1DaemonicSigil9mo
The Ultimatum game seems like it has pretty much the same type signature as the prisoner's dilemma: Payoff matrix for different strategies, where the players can roll dice to pick which strategy they use. Does timeless decision theory return the "correct answer" (second player rejects greedy proposals with some probability) when you feed it the Ultimatum game?

I agree with all this I think.

This is why I said commitment races happen between consequentialists (I defined that term more narrowly than you do; the sophisticated reasoning you do here is nonconsequentialist by my definition). I agree that agents worthy of the label "rational" will probably handle these cases gracefully and safely.

However, I'm not yet supremely confident that the AGIs we end up building will handle these cases gracefully and safely. I would love to become more confident & am looking for ways to make it more likely.

If toda...

Idea A (for “Alright”): Humanity should develop hardware-destroying capabilities — e.g., broadly and rapidly deployable non-nuclear EMPs — to be used in emergencies to shut down potentially-out-of-control AGI situations, such as an AGI that has leaked onto the internet, or an irresponsible nation developing AGI unsafely.

Sounds obviously impossible in real life, so how about you go do that and then I'll doff my hat in amazement and change how I speak of pivotal acts. Go get gain-of-function banned, even, that should be vastly simpler. Then we can talk ...

3Andrew Critch8mo
Eliezer, from outside the universe I might take your side of this bet. But I don't think it's productive to give up on getting mainstream institutions to engage in cooperative efforts to reduce x-risk. A propos, I wrote the following post in reaction to positions-like-yours-on-this-issue, but FYI it's not just you (maybe 10% you though?): https://www.lesswrong.com/posts/5hkXeCnzojjESJ4eB [https://www.lesswrong.com/posts/5hkXeCnzojjESJ4eB]

Yes, it was an intentional part of the goal.

If there were any possibility of surviving the first AGI built, then it would be nice to have AGI projects promising to do something that wouldn't look like trying to seize control of the Future for themselves, when, much later (subjectively?), they became able to do something like CEV.  I don't see much evidence that they're able to think on the level of abstraction that CEV was stated on, though, nor that they're able to understand the 'seizing control of the Future' failure mode that CEV is meant to preve...

I would "destroy the world" from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.

From the perspective of my highly similar fellow humans with whom I evolved in context, they'd get nice stuff, because "my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that...

"my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me

This may not be what evolution had "in mind" when it created us. But couldn't we copy something like this into a machine so that it "thinks" of us (and our descendants) as its "fellow humans" who should "get nice stuff"? I understand that we don't know how to do that yet. But the fact that Eliezer has some kind of "don't destroy the world from a fellow human perspec...

Want to +1 that a vaguer version of this was my own rough sense of RNNs vs. CNNs vs. Transformers.

4Paul Christiano1y
I think transformers are a big deal, but I think this comment is a bad guess at the counterfactual and it reaffirms my desire to bet with you about either history or the future. One bet down, handful to go?

As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I'd like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.

6Paul Christiano1y
I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.) As far as I can tell, the source for your attribution of this "prediction" is: As far as I could tell it sounds from the surrounding text like his "prediction" for transformative impacts from AI was something like "between 2010 and 2030" with broad error bars.

It does fit well there, but I think it was more inspired by the person I met who thought I was being way too arrogant by not updating in the direction of OpenPhil's timeline estimates to the extent I was uncertain.

Maybe another way of phrasing this - how much warning do you expect to get, how far out does your Nope Vision extend?  Do you expect to be able to say "We're now in the 'for all I know the IMO challenge could be won in 4 years' regime" more than 4 years before it happens, in general?  Would it be fair to ask you again at the end of 2022 and every year thereafter if we've entered the 'for all I know, within 4 years' regime?

Added:  This question fits into a larger concern I have about AI soberskeptics in general (not you, the soberskeptics wou...

I think I'll get less confident as our accomplishments get closer to the IMO grand challenge. Or maybe I'll get much more confident if we scale up from \$1M -> \$1B and pick the low hanging fruit without getting fairly close, since at that point further progress gets a lot easier to predict

There's not really a constant time horizon for my pessimism, it depends on how long and robust a trend you are extrapolating from. 4 years feels like a relatively short horizon, because theorem-proving has not had much investment so compute can be scaled up several orde...

I also think human brains are better than elephant brains at most things - what did I say that sounded otherwise?

2Paul Christiano1y
Oops, this was in reference to the later part of the discussion where you disagreed with "a human in a big animal body, with brain adapted to operate that body instead of our own, would beat a big animal [without using tools]".

Okay, then we've got at least one Eliezerverse item, because I've said below that I think I'm at least 16% for IMO theorem-proving by end of 2025.  The drastic difference here causes me to feel nervous, and my second-order estimate has probably shifted some in your direction just from hearing you put 1% on 2024, but that's irrelevant because it's first-order estimates we should be comparing here.

So we've got huge GDP increases for before-End-days signs of Paulverse and quick IMO proving for before-End-days signs of Eliezerverse?  Pretty bare port...

I think IMO gold medal could be well before massive economic impact, I'm just surprised if it happens in the next 3 years. After a bit more thinking (but not actually looking at IMO problems or the state of theorem proving) I probably want to bump that up a bit, maybe 2%, it's hard reasoning about the tails.

I'd say <4% on end of 2025.

I think this is the flipside of me having an intuition where I say things like "AlphaGo and GPT-3 aren't that surprising"---I have a sense for what things are and aren't surprising, and not many things happen that are...

I expect it to be hella difficult to pick anything where I'm at 75% that it happens in the next 5 years and Paul is at 25%.  Heck, it's not easy to find things where I'm at over 75% that aren't just obvious slam dunks; the Future isn't that easy to predict.  Let's get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.

I frame no prediction about whether Paul is under 16%.  That's a separate matter.  I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I'm like "lol wut" and double their probability, even if it turns out that Paul agrees with me about it.

Ha!  Okay then.  My probability is at least 16%, though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more.  Paul?

EDIT:  I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists.  I'll stand by a >16% probabilit...

5Paul Christiano1y
Based on the other thread I now want to revise this prediction, both because 4% was too low and "IMO gold" has a lot of noise in it based on test difficulty. I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" where "hardest problem" = "usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra." (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.) Maybe I'll go 8% on "gets gold" instead of "solves hardest problem." Would be good to get your updated view on this so that we can treat it as staked out predictions.

I don't care about whether the AI is open-sourced (I don't expect anyone to publish the weights even if they describe their method) and I'm not that worried about our ability to arbitrate overfitting.

Ajeya suggested that I clarify: I'm significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I'd guess about 5-10% could be done with no deep learning and m...

2Matthew Barnett1y
If this task is bad for operationalization reasons, there are other theorem proving benchmarks [https://paperswithcode.com/task/automated-theorem-proving]. Unfortunately it looks like there aren't a lot of people that are currently trying to improve on the known benchmarks, as far as I'm aware. The code generation benchmarks [https://paperswithcode.com/task/code-generation] are slightly more active. I'm personally partial to Hendrycks et al.'s APPS benchmark [https://arxiv.org/pdf/2105.09938v3.pdf], which includes problems that "range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability." (Github link [https://github.com/hendrycks/apps]).
4Matthew Barnett1y
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul's camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal). The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul's model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul's. Edit: Here are, for example, the individual rankings for 2021: https://www.imo-official.org/year_individual_r.aspx?year=2021 [https://www.imo-official.org/year_individual_r.aspx?year=2021]

Mostly, I think the Future is not very predictable in some ways, and this extends to, for example, it being the possible that 2022 is the year where we start Final Descent and by 2024 it's over, because it so happened that although all the warning signs were Very Obvious In Retrospect they were not obvious in antecedent and so stuff just started happening one day.  The places where I dare to extend out small tendrils of prediction are the rare exception to this rule; other times, people go about saying, "Oh, no, it definitely couldn't start in 2022" a...

I'm mostly not looking for virtue points, I'm looking for: (i) if your view is right then I get some kind of indication of that so that I can take it more seriously, (ii) if your view is wrong then you get some indication feedback to help snap you out of it.

I don't think it's surprising if a GPT-3 sized model can do relatively good translation. If talking about this prediction, and if you aren't happy just predicting numbers for overall value added from machine translation, I'd kind of like to get some concrete examples of mediocre translations or concrete problems with existing NMT that you are predicting can be improved.

If they've found some way to put a lot more compute into GPT-4 without making the model bigger, that's a very different - and unnerving - development.