Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)



Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions


Do you have any sense of whether or not the models thought they were in a simulation?

Yeah I wasn't disagreeing with you to be clear. Just adding.

Current AIs suck at agency skills. Put a bunch of them in AutoGPT scaffolds and give them each their own computer and access to the internet and contact info for each other and let them run autonomously for weeks and... well I'm curious to find out what will happen, I expect it to be entertaining but not impressive or useful. Whereas, as you say, randomly sampled humans would form societies and fnd jobs etc.

This is the common thread behind all your examples Hjalmar. Once we teach our AIs agency (i.e. once they have lots of training-experience operating autonomously in pursuit of goals in sufficiently diverse/challenging environments that they generalize rather than overfit to their environment) then they'll be AGI imo. And also takeoff will begin, takeover will become a real possibility, etc. Off to the races.

FWIW I'm potentially intrested in interviewing you (and anyone else you'd recommend) and then taking a shot at writing the 101-level content myself.

I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. "we have to solve inner alignment and also outer alignment" or "we just have to make sure it isn't scheming." 

Anyone -- and in particular Evhub -- have updated views on this post with the benefit of hindsight? Should we e.g. try to get model cards to include training stories?

  • a) gaslit by "I think everyone already knew this" or even "I already invented this a long time ago" (by people who didn't seem to understand it); and that 

Curious to hear whether I was one of the people who contributed to this.

To put it in terms of the analogy you chose: I agree (in a sense) that the routes you take home from work are strongly biased towards being short, otherwise you wouldn't have taken them home from work. But if you tell me that today you are going to try out a new route, and you describe it to me and it seems to me that it's probably going to be super long, and I object and say it seems like it'll be super long for reasons XYZ, it's not a valid reply for you to say "don't worry, the routes I take home from work are strongly biased towards being short, otherwise I wouldn't take them." At least, it seems like a pretty confusing and maybe misleading thing to say. I would accept "Trust me on this, I know what I'm doing, I've got lots of experience finding short routes" I guess, though only half credit for that since it still wouldn't be an object level reply to the reasons XYZ and in the absence of such a substantive reply I'd start to doubt your expertise and/or doubt that you were applying it correctly here (especially if I had an error theory for why you might be motivated to think that this route would be short even if it wasn't.)

Thanks. The routes-home example checks out IMO. Here's another one that also seems to check out, which perhaps illustrates why I feel like the original claim is misleading/unhelpful/etc.: "The laws of ballistics strongly bias aerial projectiles towards landing on targets humans wanted to hit; otherwise, ranged weaponry wouldn't be militarily useful."

There's a non-misleading version of this which I'd recommend saying instead, which is something like "Look we understand the laws of physics well enough and have played around with projectiles enough in practice, that we can reasonably well predict where they'll land in a variety of situations, and design+aim weapons accordingly; if this wasn't true then ranged weaponry wouldn't be militarily useful."

And I would endorse the corresponding claim for deep learning: "We understand how deep learning networks generalize well enough, and have played around with them enough in practice, that we can reasonably well predict how they'll behave in a variety of situations, and design training environments accordingly; if this wasn't true then deep learning wouldn't be economically useful."

(To which I'd reply "Yep and my current understanding of how they'll behave in certain future scenarios is that they'll powerseek, for reasons which others have explained... I have some ideas for other, different training environments that probably wouldn't result in undesired behavior, but all of this is still pretty up in the air tbh I don't think anyone really understands what they are doing here nearly as well as e.g. cannoneers in 1850 understood what they were doing.")

I said "Either that, or it's straightup magical thinking" which was referring to the causal arrow hypothesis. I agree it's unlikely that they would endorse the causal arrow / magical thinking hypothesis, especially once it's spelled out like that. 

What do you think they meant by "Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful?"

Load More