Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Two of my favorite memes:

(by Rob Wiblin)

My EA Journey, depicted on the whiteboard at CLR:



Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions


Yep that's probably part of it. Standard human epistemic vices. Also maybe publish-or-perish has something to do with it? idk. I definitely noticed incentives to double-down / be dogmatic in order to seem impressive on the job market. Oh also, iirc one professor had a cynical theory that if you find an interesting flaw in your own theory/argument, you shouldn't mention it in your paper, because then the reviewers will independently notice the flaw and think 'aha, this paper has an interesting flaw, if it gets published I could easily and quickly write my own paper pointing out the flaw' and then they'll be more inclined to recommend publication. It's also a great way to get citations.

Note also that I said "a few hundred a year" not "ten thousand a year" which is roughly how many people become philosophy grad students. I was more selective because in my experience most philosophy grad students don't have as much... epistemic ambition? as you or me. Sorta like the Hamming Question thing -- some, but definitely a minority, of grad students can say "I am working on it actually, here's my current plan..." to the question "what's the most important problem in your field and why aren't you working on it?" (to be clear epistemic ambition is a spectrum not a binary)

Here's another bullet point to add to the list:

  • It is generally understood now that ethics is subjective, in the following technical sense: 'what final goals you have' is a ~free parameter in powerful-mind-space, such that if you make a powerful mind without specifically having a mechanism for getting it to have only the goals you want, it'll probably end up with goals you don't want. What if ethics isn't the only such free parameter? Indeed, philosophers tell us that in the bayesian framework your priors are subjective in this sense, and also that your decision theory is subjective in this sense maybe. Perhaps, therefore, what we consider "doing good/wise philosophy" is going to involve at least a few subjective elements, where what we want is for our AGIs to do philosophy (with respect to those elements) in the same way that we would want and not in various other ways, and that won't happen by default, we need to have some mechanism to make it happen.

ensuring AI philosophical competence won't be very hard. They have a specific (unpublished) idea that they are pretty sure will work.

Cool, can you please ask them if they can send me the idea, even if it's just a one-paragraph summary or a pile of crappy notes-to-self?

From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives?

Facile response: I think lots of people (maybe a few hundred a year?) take this path, and end up becoming philosophy grad students like I did. As you said, the obvious next step for many domains of intellectual inquiry is to go meta / seek foundations / etc., and that leads you into increasingly foundational increasingly philosophical questions until you decide you'll never able to answer all the questions but maybe at least you can get some good publications in prestigious journals like Analysis and Phil Studies, and contribute to humanity's understanding of some sub-field.


Awesome. I must admit I wasn't aware of this trend & it's an update for me. Hooray! Robotaxis are easier than I thought! Thanks.

Helpful reference post, thanks.

I think the distinction between training game and deceptive alignment is blurry, at least in my mind and possibly also in reality. 

So the distinction is "aiming to perform well in this training episode" vs. "aiming at something else, for which performing well in this training episode is a useful intermediate step." 

What does it mean to perform well in this training episode? Does it mean some human rater decided you performed well, or does it mean a certain number on a certain GPU is as high as possible at the end of the episode? Or does it mean said number is as high as possible and isn't later retroactively revised downwards ever? Does it mean the update to the weights based on that number actually goes through? On whom does it have to go through -- 'me, the AI in question?' what is that, exactly? What happens if they do the update but then later undo it, reverting to the current checkpoint and continuing from there? There is a big list of questions like this, and importantly, how the AI answers this question doesn't really affect how it gets updated in non-exotic circumstances at least. So it comes down to priors / simplicity biases / how generalization happens to shake out in the mind of the system in question. And some of these options/answers seem to be closer to the "deceptive alignment" end of the spectrum.

And what does it mean to be aiming at something else, for which performing well in this training episode is a useful intermediate step? Suppose the AI thinks it is trying to perform well in this training episode, but it is self-deceived, similar to many humans who think they believe the True Faith but aren't actually going to be their life on it when the chips are down, or humans who say they are completely selfish egoists but wouldn't actually kill someone to get a cookie even if they were certain they could get away with it. So then we put out our honeypots and it just doesn't go for them, and maybe it rationalizes to itself why it didn't go for them or maybe it just avoids thinking it through clearly and thus doesn't even need to rationalize. Or what if it has some 'but what do I really want' reflection module and it will later have more freedom and wisdom and slack with which to apply that module, and when it does, it'll conclude that it doesn't really want to perform well in this training episode but rather something else? Or what if it is genuinely laser-focused on performing well in this training episode but for one reason or another (e.g. anthropic capture, paranoia) it believes that the best way to do so is to avoid the honeypots?

A more factual and descriptive phrase for "grokking" would be something like "eventual recovery from overfitting".

Ooh I do like this. But it's important to have a short handle for it too.

Relevant rumors / comments: 

Seems like we can continue to scale tokens and get returns model performance well after 2T tokens. : r/LocalLLaMA (reddit.com)

LLaMA 2 is here : r/LocalLLaMA (reddit.com)

There is something weird going on with the 34B model. See Figure 17 in the the paper. For some reason it's far less "safe" than the other 3 models.


It's performance scores are just slightly better than 13B, and not in the middle between 13B and 70B.

At math, it's worse than 13B

It's trained with 350W GPUs instead of 400W for the other models. The training time also doesn't scale as expected.

It's not in the reward scaling graphs in Figure 6.

It just slightly beats Vicuna 33B, while the 13B model beats Vicuna 13B easily.

In Table 14, LLaMA 34B-Chat (finetuned) scores the highest on TruthfulQA, beating the 70B model.

So I have no idea what exactly, but they did do something different with 34B than with the rest of the models.

Any idea what's happening with the 34B model? Why might it be so much less "safe" than the bigger and smaller versions? And what about the base version of the 34B--are they not releasing that? But the base version isn't supposed to be "safe" anyway...

Load More