Some AI labs claim to care about AI safety, but continue trying to build AGI anyway. Peter argues they should explicitly state why they think this is the right course of action, given the risks. He suggests they should say something like "We're building AGI because [specific reasons]. If those reasons no longer held, we would stop."

Popular Comments

By terrorism I assume you mean a situation like hostage-taking, e.g. a terrorist kidnapping someone and threatening to kill them unless you pay a ransom. That's importantly different. It's bad to pay ransoms because if the terrorist knows you don't pay ransoms, they won't threaten you, so that commitment makes you better off. Here, (conditional on the AI already being misaligned) we are better off if the AI offers the deal than if it doesn't.
I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment.  However, as the authors mention above, I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper "avoiding obfuscation with prover-estimator debate" is a bit misleading. I believe the authors are going to change this in v2.)  I'm excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful. I think there are broadly two classes of hope about obfuscated arguments: (1.) In practice, obfuscated argument problems rarely come up, due to one of: 1. It’s difficult in practice to construct obfuscated arguments for arbitrary propositions 1. It’s definitely plausible to me that the algorithm sketch I gave here for constructing obfuscated arguments for arbitrary propositions doesn’t work, given some relatively weak assumptions about the structure of the debaters’ knowledge 2. For anything the debaters know, there’s usually a sufficiently compact and stable honest argument such that the honest debater can win by giving this argument 1. It seems pretty easy to give counterexamples here - e.g. Paul gives the example of unsupervised translation. But maybe these are relatively rare. It's plausible that the honest debaters in the human experiments I did where we ran into obfuscation / instability problems weren't using the right tactics (i.e. they could have constructed more stable arguments that used independent lines of evidence more) (2.) We can create a protocol that distinguishes between cases where: * (not obfuscatable) the debate mirrors the structure of the debaters’ understanding such that they would be able to easily identify which subtree contains a flaw if one was present    * (obfuscatable) they don't or wouldn't know which subtree contains the flaw.  This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct. This wouldn't get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down.   
More recently, there has been more evidence (or at least rumors) that pretraining is continuing to yield substantial returns. It's unclear if this is due to algorithmic improvement (or data improvement) or due to scaling up training runs. Some examples: * Opus 4 seems to be based on new pretrained model and this seems to be yielding some fraction of the returns. (Notably, Sonnet 4, which is presumably a smaller model, performs similarly on benchmarks like SWE bench, but worse in general.) * Claims in the Gemini model card * General rumors (not necessarily much additional evidence).
Load More

Recent Discussion

2.1 Summary & Table of contents

This is the second of a two-post series on foom (previous post) and doom (this post).

The last post talked about how I expect future AI to be different from present AI. This post will argue that this future AI will be of a type that will be egregiously misaligned and scheming, not even ‘slightly nice’, absent some future conceptual breakthrough.

I will particularly focus on exactly how and why I differ from the LLM-focused researchers who wind up with (from my perspective) bizarrely over-optimistic beliefs like “P(doom) ≲ 50%”.[1]

In particular, I will argue that these “optimists” are right that “Claude seems basically nice, by and large” is nonzero evidence for feeling good about current LLMs (with various caveats). But I think that future AIs...

@ryan_greenblatt likewise told me (IIRC) “I think things will be continuous”, and I asked whether the transition in AI zeitgeist from RL agents (e.g. MuZero in 2019) to LLMs counts as “continuous” in his book, and he said “yes”, adding that they are both “ML techniques”. I find this perspective baffling—I think MuZero and LLMs are wildly different from an alignment perspective. Hopefully this post will make it clear why. (And I think human brains work via “ML techniques” too.)

I don't think this is an accurate paraphrase of my perspective.

My view ... (read more)

1.1 Series summary and Table of Contents

This is a two-post series on AI “foom” (this post) and “doom” (next post).

A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...

In this comment, I'll try to respond at the object level arguing for why I expect slower takeoff than "brain in a box in a basement". I'd also be down to try to do a dialogue/discussion at some point.

1.4.1 Possible counter: “If a different, much more powerful, AI paradigm existed, then someone would have already found it.”

I think of this as a classic @paulfchristiano-style rebuttal (see e.g. Yudkowsky and Christiano discuss "Takeoff Speeds", 2021).

In terms of reference class forecasting, I concede that it’s rather rare for technologies with extreme profit

... (read more)
4Thane Ruthenis
Fully agree with everything in this post, this is exactly my model as well. (That's the reason behind my last-line rug-pull here, by the way.)
3Ryan Greenblatt
I think the LLM-focused AGI people broadly agree with what you're saying and don't see a real disagreement here. I don't see an important distinction between "AIs can figure out development and integration R&D" and "AIs can just learn the relevant skills". Like, the AIs are doing some process which results in a resulting AI that can perform the relevant task. This could be an AI updated by some generic continual learning algorithm or an AI which is trained on a bunch of RL environment that AIs create, it doesn't ultimately make much of a difference so long as it works quickly and cheaply. (There might be a disagreement in what sample efficiency (as in, how efficiently AIs can learn from limited data) people are expecting AIs to have at different levels of automation.) Similarly, note that humans also need to do things like "figure out how to learn some skill" or "go to school". Similarly, AIs might need to design a training strategy for themselves (if existing human training programs don't work or would be too slow), but it doesn't really matter.
3Ryan Greenblatt
Actually, I don't think Paul says this is a failed prediction in the linked text. He says: My understanding is that this is supposed to be read as "[incorrectly predicting that physicists would be wrong about the existence of the Higgs boson (LW bet registry)] and [expressing the view that real AI would likely emerge from a small group rather than a large industry]", Paul isn't claiming that the view that real AI would likely emerge from a small group is a failed prediction! ---------------------------------------- On "improbable and crazy", Paul says: Note that Paul says "looks if anything even more improbable and crazy than it did then". I think your quotation is reasonable, but it's unclear if Paul thinks this is "crazy" or if he thinks it's just more incorrect and crazy-looking than it was in the past.

Consider the following vignette:

It is March 2028. With their new CoCo-Q neuralese reasoning model, a frontier AI lab has managed to fully automate the process of software engineering. In AI R&D, most human engineers have lost their old jobs, and only a small number of researchers now coordinate large fleets of AI agents, each AI about 10x more productive than the humans they’ve replaced. AI progress remains fast, and safety teams are scrambling to prepare alignment and control measures for the next CoCo-R model, which is projected to be at least TEDAI. The safety team has no idea whether CoCo-Q or early checkpoints of CoCo-R are scheming (because progress in interpretability and automated alignment research turns out to be disappointing), and control evaluations become increasingly unreliable (because of worries

...
1Julian Stastny
I think it's interesting that @Lukas Finnveden seems to think in contrast to this comment. It's not literally contradictory (we could have been more explicit about both possibilities), but I wonder if this indicates a disagreement between you and him.
11Lukas Finnveden
Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits. Here's why I think the information value could be really high: It's super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like: * stops sandbagging and demonstrates much higher capabilities * tell us about collusion signals that can induce enormously different behavior in other copies of the AI, including e.g. attempting escapes * admits that it was looking for ways to take over the world but couldn't find any that were good enough so now it wants to work with us instead The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don't compromise competitiveness too much. (E.g. by coordinating.) And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think "offering deals" will probably produce interesting experimental results before it will be the SOTA metho

Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn't convincing to skeptics).

(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)

2Ryan Greenblatt
I also think you could use deals to better understand and iterate against scheming. As in, you could use whether the AI accepts a deal as evidence for whether it was scheming so this can more easily be studied in a test setting and we can find approaches which make scheming less likely. There are a number of practical difficulties with this.

I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here’s one point that I think is important.

My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.

On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the...

This is a linkpost for https://arxiv.org/abs/2506.06278

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.
Distilling the good while leaving the bad behind.

Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream. 

Read our paper on ArXiv and enjoy an interactive demo.

Robust unlearning probably reduces AI risk

Maybe some future AI has long-term goals and humanity is in its...

Wasn't it the case that for some reason, full distillation had comparable compute requirement to data filtering? I was surprised by that. My impression is that distillation should be more like 10% of pretraining (data filtering), which would make the computational UNDO results much stronger. Not sure what happened here.

2Alex Turner
I have a cached thought that this was found to disrupt overall capabilities / make learning harder, but I don't have a reference on hand.
3Fabien Roger
How? Because current unlearning is shallow, you don't know if it produces noise on the relevant pretraining tokens. So my guess is that iterating against unlearning by seeing if the model helps you build bioweapons is worse than also seeing if it produces noise on relevant pretraining tokens, and the later is roughly as much signal than iterating against a classifier by looking at whether it detects bioweapons advice and relevant pretraining tokens. This might be wrong if my conjecture explained in Addie's comment is wrong though. I think you missed the point here. My suggested scheme is 1. label a small amount of data 2. train a classifier 3. apply the classifier to know if you should skip a token / make the target logprobs be noise or use the original logprobs. This is spiritually the same as 1. label a small amount of data 2. use that for unlearning 3. apply the unlearned model to know if the target logprobs should be noise or sth close to the original logprobs. I agree with that, and I'd bet that UNDO likely increases jailbreak robustness even in the 1%-of-pretrain-compute regime. But you did not run experiments that show the value of UNDO in the 1%-of-pretrain-compute regime, right? Fair, I also agree. This to be a part of the option space that nobody is interested in, but it's still scientifically interesting. But I think it's noteworthy that there results are so negative, if I had been asked to predict results of UNDO, I would have predicted much stronger results.
3Alex Turner
EDIT: I think I misunderstood your original point - were you saying to just label all of the data using a classifier trained on just 1% of the pretraining data? (Neither of your schemes say what to do after step 3.) Why do you claim that no one is interested in this? Lots of labs do data filtering, which is known to be effective but quite costly to iterate on. 
This is a linkpost for https://humanaligned.ai/2025

Join us at the fifth Human-Aligned AI Summer School in Prague from 22nd to 25th July 2025!

Update: We have now confirmed our speaker list with excellent speakers -- see below!
Update: We still have capacity for more excellent participants as of late June. Please help us spread the word to people who would be a good fit for the school. We also have some additional funding to financially support some of the participants.

We will meet for four intensive days of talks, workshops, and discussions covering latest trends in AI alignment research and broader framings of AI risk.

Apply now, applications are evaluated on a rolling basis.

The intended audience of the school are people interested in learning more about the AI alignment topics, PhD students, researchers working in ML/AI, and talented...

Our speaker lineup is now confirmed and features an impressive roster of researchers, see the updated list above.

As of late June, we still have capacity for additional participants, and we'd appreciate your help with outreach! We have accepted a very strong cohort so far, and the summer school will be a great place to learn from our speakers, engage in insightful discussions, and connect with other researchers in AI alignment. In addition to people already familiar with AI alignment, we're particularly interested in reaching technical students and research... (read more)

Load More