Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters

AI Timelines

Takeoff and Takeover in the Past and Future

Posts

Sorted by New

3Daniel Kokotajlo's Shortform

30Self-Awareness: Taxonomy and eval suite proposal

5mo

75AI Timelines

9mo

44Paper: On measuring situational awareness in LLMs

19AGI is easier than robotaxis

32Linkpost: Github Copilot productivity experiment

24Replacement for PONR concept

15Why agents are powerful

23Gradations of Agency

37Deepmind's Gato: Generalist Agent

15Interlude: Agents as Automobiles

Wiki Contributions

Comments

A simple case for extreme inner misalignment

Daniel Kokotajlo13d41

This doesn't sound like an argument Yudkowsky would make, though it seems to have some similar concepts. And it's interesting food for thought regardless -- thanks! Looking forward to the rest of the series.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo17d60

Rereading this classic by Ajeya Cotra: https://www.planned-obsolescence.org/july-2022-training-game-report/

I feel like this is an example of a piece that is clear, well-argued, important, etc. but which doesn't seem to have been widely read and responded to. I'd appreciate pointers to articles/posts/papers that explicitly (or, failing that, implicitly) respond to Ajeya's training game report. Maybe the 'AI Optimists?'

Daniel Kokotajlo's Shortform

Daniel Kokotajlo17d2215

I found this article helpful and depressing. Kudos to TracingWoodgrains for detailed, thorough investigation.

Matthew Barnett's Shortform

Daniel Kokotajlo17d30

Thanks for this Matthew, it was an update for me -- according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn't have much of an opinion about this)

A dilemma for prosaic AI alignment

Daniel Kokotajlo1mo20

I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.

Update: Seems to probably be true enough in practice! Maybe in the limit pretrained LLMs would have dangerous levels of agency, and some model-whisperers think they might be situationally aware already iirc, but for the most part the answer is no, things are fine, pretrained models probably aren't situationally aware or agentic. In retrospect I think doubt was warranted, but not as much doubt as I had -- I should have agreed that probably things would be fine in practice.

What 2026 looks like

Daniel Kokotajlo1mo40

Agreed. Though I don't feel like I have good visibility into which actors are using AI-driven propaganda and censorship, and how extensively.

Matthew Barnett's Shortform

Daniel Kokotajlo1mo35

Good question. I want to think about this more, I don't have a ready answer. I have a lot of uncertainty about how long it'll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I'm skeptical. The longer it takes, the more likely it is that we'll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!

Matthew Barnett's Shortform

Daniel Kokotajlo1mo30

It's not about timelines, it's about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is 'agency skills.' So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we'll face the problem of corrigibility breakdowns only really happening right around the time when it's too late or almost too late.

Matthew Barnett's Shortform

Daniel Kokotajlo1mo54

Thanks for this detailed reply!

We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this

Depending on what you mean by "on their way towards being solved" I'd agree. The way I'd put it is: "We didn't know what the path to AGI would look like; in particular we didn't know whether we'd have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that's good in some ways and bad in other ways, it's probably overall good. Huzzah! However, our core problems remain, and we don't have much time left to solve them."

(Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul's stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.)

I agree that current frontier models are only a "tiny bit agentic". I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we've seen enough to know that corrigibility probably won't be that hard to train into a system that's only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?

Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I'm interested in seeing if we can make some bets on this though; if we can, great; if we can't, then at least we can avoid future disagreements about who should update.

There's a bit of a trivial definitional problem here. If it's easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say "those aren't the type of AIs we were worried about". But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it's not clear why we should care? Just create the corrigible AIs. We don't need to create the things that you were worried about!

I don't think that we know how to "just create the corrigible AIs." The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won't work on much more agentic AIs. To be clear I think they might work, there's a lot of uncertainty, but I think they probably won't. I think it might be easier to see why I think this if you try to prove the opposite in detail -- like, write a mini-scenario in which we have something like AutoGPT but much better, and it's being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigibility-related parts of its prompt and/or constitution or whatever are, and write down what the training signal is roughly including the bit about RLHF or whatever, and then imagine that said system is mildly superhuman across the board (and vastly superhuman in some domains) and is being asked to design it's own successor. (I'm trying to do this myself as we speak. Again I feel like it could work out OK, but it could be disastrous. I think writing some good and bad scenarios will help me decide where to put my probability mass.)

I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the "world isn't as grim as it could have been". For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I'm glad you spelled it out more clearly.

Yay, thanks!

Matthew Barnett's Shortform

Daniel Kokotajlo1mo1920

I thought you would say that, bwahaha. Here is my reply:

(1) Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: "A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly ... A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI's final goal is to 'make the project's sponsor happy.' Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner... until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor's brain..." My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc. -- they aren't plotting against us yet, but their 'values' aren't exactly what we want, and so if somehow their 'intelligence' was amplified dramatically whilst their 'values' stayed the same, they would eventually realize this and start plotting against us. (realistically this won't be how it happens since it'll probably be future models trained from scratch instead of smarter versions of this model, plus the training process probably would change their values rather than holding them fixed). I'm not confident in this tbc--it's possible that the 'values' so to speak of GPT4 are close enough to perfect that even if they were optimized to a superhuman degree things would be fine. But neither should you be confident in the opposite. I'm curious what you think about this sub-question.

(2) This passage deserves a more direct response:

I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.

Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven't been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren't the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.

(3) Here's my positive proposal for what I think is happening. There was an old vision of how we'd get to AGI, in which we'd get agency first and then general world-knowledge second. E.g. suppose we got AGI by training a model through a series of more challenging video games and simulated worlds and then finally letting them out into the real world. If that's how it went, then plausibly the first time it started to actually seem to be nice to us, was because it was already plotting against us, playing along to gain power, etc. We clearly aren't in that world, thanks to LLMs. General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn't as grim as it could have been, from a technical alignment perspective. However, I don't think me or Yudkowsky or Bostrom or whatever strongly predicted that agency would come first. I do think that LLMs should be an update towards hopefulness about the technical alignment problem being solved in time for the reasons mentioned, but also they are an update towards shorter timelines, for example, and an update towards more profits and greater vested interests racing to build AGI, and many other updates besides, so I don't think you can say "Yudkowsky's still super doomy despite this piece of good news, he must be epistemically vicious." At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that'll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.