Some AI labs claim to care about AI safety, but continue trying to build AGI anyway. Peter argues they should explicitly state why they think this is the right course of action, given the risks. He suggests they should say something like "We're building AGI because [specific reasons]. If those reasons no longer held, we would stop."
This is the second of a two-post series on foom (previous post) and doom (this post).
The last post talked about how I expect future AI to be different from present AI. This post will argue that this future AI will be of a type that will be egregiously misaligned and scheming, not even ‘slightly nice’, absent some future conceptual breakthrough.
I will particularly focus on exactly how and why I differ from the LLM-focused researchers who wind up with (from my perspective) bizarrely over-optimistic beliefs like “P(doom) ≲ 50%”.[1]
In particular, I will argue that these “optimists” are right that “Claude seems basically nice, by and large” is nonzero evidence for feeling good about current LLMs (with various caveats). But I think that future AIs...
This is a two-post series on AI “foom” (this post) and “doom” (next post).
A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...
In this comment, I'll try to respond at the object level arguing for why I expect slower takeoff than "brain in a box in a basement". I'd also be down to try to do a dialogue/discussion at some point.
...1.4.1 Possible counter: “If a different, much more powerful, AI paradigm existed, then someone would have already found it.”
I think of this as a classic @paulfchristiano-style rebuttal (see e.g. Yudkowsky and Christiano discuss "Takeoff Speeds", 2021).
In terms of reference class forecasting, I concede that it’s rather rare for technologies with extreme profit
Consider the following vignette:
...It is March 2028. With their new CoCo-Q neuralese reasoning model, a frontier AI lab has managed to fully automate the process of software engineering. In AI R&D, most human engineers have lost their old jobs, and only a small number of researchers now coordinate large fleets of AI agents, each AI about 10x more productive than the humans they’ve replaced. AI progress remains fast, and safety teams are scrambling to prepare alignment and control measures for the next CoCo-R model, which is projected to be at least TEDAI. The safety team has no idea whether CoCo-Q or early checkpoints of CoCo-R are scheming (because progress in interpretability and automated alignment research turns out to be disappointing), and control evaluations become increasingly unreliable (because of worries
Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn't convincing to skeptics).
(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)
I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here’s one point that I think is important.
My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.
On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the...
Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.
Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream.
Read our paper on ArXiv and enjoy an interactive demo.
Maybe some future AI has long-term goals and humanity is in its...
Wasn't it the case that for some reason, full distillation had comparable compute requirement to data filtering? I was surprised by that. My impression is that distillation should be more like 10% of pretraining (data filtering), which would make the computational UNDO results much stronger. Not sure what happened here.
Join us at the fifth Human-Aligned AI Summer School in Prague from 22nd to 25th July 2025!
Update: We have now confirmed our speaker list with excellent speakers -- see below!
Update: We still have capacity for more excellent participants as of late June. Please help us spread the word to people who would be a good fit for the school. We also have some additional funding to financially support some of the participants.
We will meet for four intensive days of talks, workshops, and discussions covering latest trends in AI alignment research and broader framings of AI risk.
Apply now, applications are evaluated on a rolling basis.
The intended audience of the school are people interested in learning more about the AI alignment topics, PhD students, researchers working in ML/AI, and talented...
Our speaker lineup is now confirmed and features an impressive roster of researchers, see the updated list above.
As of late June, we still have capacity for additional participants, and we'd appreciate your help with outreach! We have accepted a very strong cohort so far, and the summer school will be a great place to learn from our speakers, engage in insightful discussions, and connect with other researchers in AI alignment. In addition to people already familiar with AI alignment, we're particularly interested in reaching technical students and research...
I don't think this is an accurate paraphrase of my perspective.
My view ... (read more)