Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, now works at Center on Long-Term Risk. Research interests include acausal trade, timelines, takeoff speeds & scenarios, decision theory, history, and a bunch of other stuff. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html


AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions


Ngo and Yudkowsky on alignment difficulty
Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

I don't know the relevant history of science, but I wouldn't be surprised if something like the opposite was true: Our modern, very useful understanding of work is an abstraction that grew out of many people thinking concretely about various engines. Thinking about engines was like the homework exercises that helped people to reach and understand the concept of work.

Similarly, perhaps it is pedagogically (and conceptually) helpful to begin with the notion of a consequentialist and then generalize to outcome pumps.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Hot damn, where can I see these preliminary results?

Yudkowsky and Christiano discuss "Takeoff Speeds"

Fair enough! I too dislike premature meta, and feel bad that I engaged in it. However... I do still feel like my comment probably did more to prevent polarization than cause it? That's my independent impression at any rate. (For the reasons you mention).

I certainly don't want to give up! In light of your pushback I'll edit to add something at the top.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Yes, though I'm much more comfortable explaining and arguing for my own position than EY's. It's just that my position turns out to be pretty similar. (Partly this is independent convergence, but of course partly this is causal influence since I've read a lot of his stuff.)

There's a lot to talk about, I'm not sure where to begin, and also a proper response would be a whole research project in itself. Fortunately I've already written a bunch of it; see these two sequences.

Here are some quick high-level thoughts:

1. Begin with timelines. The best way to forecast timelines IMO is Ajeya's model; it should be the starting point and everything else should be adjustments from it. The core part of Ajeya's model is a probability distribution over how many OOMs of compute we'd need with today's ideas to get to TAI / AGI / APS-AI / AI-PONR / etc. [Unfamiliar with these acronyms? See Robbo's helpful comment below] For reasons which I've explained in my sequence (and summarized in a gdoc) my distribution has significantly more mass on the 0-6 OOM range than Paul does, and less on the 13+ range. The single post that conveys this intuition most is Fun with +12 OOMs.

Now consider how takeoff speed views interact with timelines views. Paul-slow takeoff and <10 year timelines are in tension with each other. If <7 OOMs of compute would be enough to get something crazy powerful with today's ideas, then the AI industry is not an efficient market right now. If we get human-level AGI in 2030, then on Paul's view that means the world economy should be doubling in 2029 and should have doubled over the course of 2025 - 2028 and should already be accelerating now probably. It doesn't look like that's happening or about to happen. I think Paul agrees with this; in various conversations he's said things like "If AGI happens in 10 years or less then probably we get fast takeoff." [Paul please correct me if I'm mischaracterizing your view!]

Ajeya (and Paul) mostly update against <10 year timelines for this reason. I, by contrast, mostly update against slow takeoff. (Obviously with both do a bit of both, like good Bayesians.)

2. I feel like the debate between EY and Paul (and the broader debate about fast vs. slow takeoff) has been frustratingly much reference class tennis and frustratingly little gears-level modelling. This includes my own writing on the subject -- lots of historical analogies and whatnot. I've tentatively attempted some things sorta like gears-level modelling (arguably What 2026 Looks Like is an example of this) and so far it seems to be pushing my intuitions more towards "Yep, fast takeoff is more likely." But I feel like my thinking on this is super inadequate and I think we all should be doing better. Shame! Shame on all of us!

3. I think the focus on GDP (especially GWP) is really off, for reasons mentioned here. I think AI-PONR will probably come before GWP accelerates, and at any rate what we care about for timelines and takeoff speeds is AI-PONR and so our arguments should be about e.g. whether there will be warning shots and powerful AI tools of the sort that are relevant to solving alignment for APS-AI systems.

(Got to go now)

Yudkowsky and Christiano discuss "Takeoff Speeds"

[ETA: In light of pushback from Rob: I really don't want this to become a self-fulfilling prophecy. My hope in making this post was to make the prediction less likely to come true, not more! I'm glad that MIRI & Eliezer are publicly engaging with the rest of the community more again, I want that to continue, and I want to do my part to help everybody to understand each other.]

And I know, before anyone bothers to say, that all of this reply is not written in the calm way that is right and proper for such arguments. I am tired. I have lost a lot of hope. There are not obvious things I can do, let alone arguments I can make, which I expect to be actually useful in the sense that the world will not end once I do them. I don't have the energy left for calm arguments. What's left is despair that can be given voice.

I grimly predict that the effect of this dialogue on the community will be polarization: People who didn't like Yudkowsky and/or his views will like him / his views less, and the gap between them and Yud-fans will grow (more than it shrinks due to the effect of increased dialogue). I say this because IMO Yudkowsky comes across as angry and uncharitable in various parts of this dialogue, and also I think it was kinda a slog to get through & it doesn't seem like much intellectual progress was made here.

FWIW I continue to think that Yudkowskys model of how the future will go is basically right, at least more right than Christiano's. This is a big source of sadness and stress for me too, because (for example) my beloved daughter probably won't live to adulthood.

The best part IMO was the mini-essay at the end about Thielian secrets and different kinds of tech progress -- a progression of scenarios adding up to Yudkowsky's understanding of Paul's model:

But we can imagine that doesn't happen either, because instead of needing to build a whole huge manufacturing plant, there's just lots and lots of little innovations adding up to every key AGI threshold, which lots of actors are investing $10 million in at a time, and everybody knows which direction to move in to get to more serious AGI and they're right in this shared forecast.

It does seem to me that the AI industry will move more in this direction than it currently is, over the next decade or so. However I still do expect that we won't get all the way there. I would love to hear from Paul whether he endorses the view Yudkowsky attributes to him in this final essay.

Ngo and Yudkowsky on alignment difficulty

For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior.

For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases!

Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences - we’ll indeed be trying to use human preferences to guide and shape the systems - so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations.

If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obvious that (a) you don’t know how to do that, and (b) if you did figure it out the code you add would be many orders of magnitude longer than the code you started with.

If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)

We think Yudkowsky’s response to this apparent counterexample is that humans are stupid, basically; AIs might be similarly stupid at first, but as they get smarter we should expect crude corrigibility-training techniques to stop working.

Discussion with Eliezer Yudkowsky on AGI interventions

EY knows more neuroscience than me (I know very little) but here's a 5-min brainstorm of ideas:

--For a fixed compute budget, spend more of it on neurons associated with higher-level thought (the neocortex?) and less of it on neurons associated with e.g. motor control or vision.

--Assuming we are an upload of some sort rather than a physical brain, tinker with the rules a bit so that e.g. neuron waste products get magically deleted instead of having to be pumped out, neurons never run out of energy/oxygen and need to rest, etc. Study situations where you are in "peak performance" or "flow" and then explore ways to make your brain enter those states at will.

--Use ML pruning techniques to cut away neurons that aren't being useful, to get slightly crappier mini-Eliezers that cost 10% the compute. These can then automate away 90% of your cognition, saving you enough compute that you can either think a few times faster or have a few copies running in parallel.

--Build automated tools that search through your brain for circuits that are doing something pretty simple, like a giant OR gate or an oscillator, and then replace those circuits with small bits of code, thereby saving significant compute. If anything goes wrong, no worries, just revert to backup.

This was a fun exercise!

LCDT, A Myopic Decision Theory
Myopia is the property of a system to not plan ahead, to not think too far about the consequences of its actions, and to do the obvious best thing in the moment instead of biding its time.

This seems inconsistent with how you later use the term. Don't you nowadays say that we could have a myopic imitator of HCH, or even a myopic Evan-imitator? But such a system would need to think about the long-term consequences of its actions in order to imitate HCH or Evan, since HCH / Evan would be thinking about those things.

Ngo and Yudkowsky on alignment difficulty

To be clear I think I agree with your overall position. I just don't think the argument you gave for it (about bureaucracies etc.) was compelling.

Load More