CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.
Fwiw I’m not sure this is right; I think that a lot of questions about decision theory become pretty obvious once you start thinking about digital minds and simulations. And my guess is that a lot of FDT-like ideas would have become popular among people like the ones I work with, once people were thinking about those questions.
I do judge comments more harshly when they're phrased confidently—your tone is effectively raising the stakes on your content being correct and worth engaging with.
If I agreed with your position, I'd probably have written something like:
I don't think this is an important source of risk. I think that basically all the AI x-risk comes from AIs that are smart enough that they'd notice their own overconfidence (maybe after some small number of experiences being overconfident) and then work out how to correct for it.
There are other epistemic problems that I think might affect the smart AIs that pose x-risk, but I don't think this is one of them.
In general, this seems to me like a minor capability problem that is very unlikely to affect dangerous AIs. I'm very skeptical that trying to address such problems is helpful for mitigating x-risk.
What changed? I think it's only slightly more hedged. I personally like using "I think" everywhere for the reason I say here and the reason Ben says in response. To me, my version also more clearly describes the structures of my beliefs and how people might want to argue with me if they want to change my mind (e.g. by saying "basically all the AI x-risk comes from" instead of "The kind of intelligent agent that is scary", I think I'm stating the claim in a way that you'd agree with, but that makes it slightly more obvious what I mean and how to dispute my claim—it's a lot easier to argue about where x-risk comes from than whether something is "scary").
I also think that the word "stupid" parses as harsh, even though you're using it to describe something on the object level and it's not directed at any humans. That feels like the kind of word you'd use if you were angry when writing your comment, and didn't care about your interlocutors thinking you might be angry.
I think my comment reads as friendlier and less like I want the person I'm responding to to feel bad about themselves, or like I want onlookers to expect social punishment if they express opinions like that in the future. Commenting with my phrasing would cause me to feel less bad if it later turned out I was wrong, which communicates to the other person that I'm more open to discussing the topic.
(Tbc, sometimes I do want the person I'm responding to to feel bad about themselves, and I do want onlookers to expect social punishment if they behave like the person I was responding to; e.g. this is true in maybe half my interactions with Eliezer. Maybe that's what you wanted here. But I think that would be a mistake in this case.)
It depends on what you mean by scary. I agree that AIs capable enough to take over are pretty likely to be able to handle their own overconfidence. But the situation when those AIs are created might be substantially affected by the earlier AIs that weren't capable of taking over.
As you sort of note, one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don't think that that consideration is overwhelmingly strong. So I think it's totally reasonable to research weaknesses that might cause earlier AIs to not be as helpful as they could be for mitigating later risks. For example, I'm overall positive on research on making AIs better at conceptual research.
Overall, I think your comment is quite unreasonable and overly rude.
Before this post, I'm not aware of anything people had written on what might happen after you catch your AI red-handed. I basically stand by everything we wrote here.
I'm a little sad that there hasn't been much research following up on this. I'd like to see more, especially research on how you can get more legible evidence of misalignment from catching individual examples of your AI's behaving badly, and research on few-shot catastrophe detection techniques.
The point I made in this post still seems very important to me, and I continue to think that it was underrated at the time I wrote this post. I think rogue internal deployments are probably more important to think about than self-exfiltration when you're thinking about how to mitigate risk from internal deployment of possibly-misaligned AI agents.
The systems architecture that I described here is still my best guess as to how agents will work at the point where AIs are very powerful.
Since I wrote this post, agent scaffolds are used much more in practice. The infrastructure I described here is a good description of cloud-based agents, but isn't the design used by agents that you run on your own computer like Claude Code or Gemini CLI or whatever. I think agents will move in the direction that I described, especially as people want to be able to work with more of them, want to give them longer tasks, and want them to be able to use their own virtual machines for programming so they don't step on each other's toes all the time.
The terminology I introduced here is used widely by people who I know who think about insider threat from AI agents, but it hasn't penetrated that far outside my cluster as far as I know.
I think the points made in this post are very important and I reference them constantly. I am proud of it and I think it was good that we wrote it.
Re that last point, you might be interested to read about "the constitution is not a suicide pact": many prominent American political figures have said that survival of the nation is more important than constitutionality (and this has been reasonably well received by other actors, not reviled).
Yeah for sure. A really nice thing about the Tinker API is that it doesn't allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.
(I haven't spelled out what I think about decision theory publicly, so presumably this won't be completely informative to you. A quick summary is that I think that questions related to decision theory and anthropics are very important for how the future goes, and relevant to some aspects of the earliest risks from misaligned power-seeking AI.)
My impression is that people at AI companies who have similar opinions to me about risk from misalignment tend to also have pretty similar opinions on decision theory etc. AI company staff with more different opinions tend to have not thought much about decision theory etc.