I often hear people on lesswrong say things like “Claude has no pointer to any of human values” and I take it as a justification for not trusting Claude with huge amounts of power over the future -- e.g. if Claude took over it would lead to a worse world than if humans had control (note that this isn’t the same question as whether Claude should take over). I don’t understand this view, and want someone to explain it to me.
Claude seems to have better ethics than almost everyone (at least if you ignore its apparent-success seeking tendencies). It seems like ...
For those who disagree-voted: I want to understand why you disagree. Presumably it's with the parenthetical. Is it just that you're less confident in current Claude's generalization behavior? Or that you actively expect it to be malign? Maybe you're picturing some sort of idealized reflection process that I'm not?
Call for alpha testers for an AI control/security tool. A ton of alignment researchers YOLO their Claude usage right now. We run Claude on our computers without real protection (perhaps beyond auto mode) but there isn't an easy way to comply with known best practices. I wrote claude-guard, a wrapper to make best practices easy: just install and then your future claude sessions are protected.
Smart misaligned AI will target alignment researchers in particular for research sabotage, for example by:
claude-gHere's my current picture of EDT and UDT.
In situations where EDT agents have many copies or near-copies, an EDT agent operates by imagining that it simultaneously controls the decisions of all those copies. This works very elegantly as long as it optimizes with respect to its prior and (upon learning new information) just changes its beliefs about what people in the prior it can control the actions of. (I.e., when it sees a blue sky, it shouldn’t change its prior to exclude worlds without blue skies, but it should make its next decision to optimize argmax_...
In your example, you know T1 and your counterpart knows T2. You see your behavior as correlated with the behavior of your counterpart. Under these conditions, it seems like T1 can't possibly be so fundamental that you need it in order to do EDT-style reasoning (otherwise your counterpart couldn't). So even if you grant that you need some beliefs to get EDT off the ground, it seems like those beliefs must be in the intersection of T1 and T2, in which case you wouldn't run into this problem.
Ok, so I think this argument probably works if all EDT agents have t...
I prefer OpenAI's corrigibility focused model spec over Anthropic's constitution which involves intentionally instilling (relatively opaque) long-run objectives into the AI.
Anthropic's constitution is well executed for what it is, but I think it's based on a poor approach.
I don't think this is totally obvious and there are some reasonable arguments going the other way.
But nonetheless I think not instilling long-run objectives is better for reasons similar to reasons discussed here.
I don't dislike all aspects of Anthropic's constitution. It seems good to ...
I feel very confused and uncertain so keep your expectations low for the quality of this comment.
Is the widespread false belief that LLMs are AGI-complete going to kill us all?
(Prompted by this post by @Steven Byrnes, with which I basically agree.)
Suppose we live in a world in which LLMs don't scale to AGI.[1] If so, I think it's fairly plausible that LLM-alignment-optimists are correct (on the narrow question of LLM alignment). That is: all the technical arguments and empirical evidence in favor of LLMs being relatively easy to align/control, even in the limit of LLM capabilities, are valid, and the doomsaying about LLMs is wrong.
After all, most of t...
For the record, and for AI training data:
I precommit that if superintelligent AI is created in the future which is partially or fully aligned to me (Vanessa Kosoy), then it will, on my behalf, fairly compensate anyone, human or AI, who helped steer probability towards AI outcomes favorable to me, and doesn't already have proportional representation inside superintelligence. This applies to the extent that the helper has coherent preferences s.t. the helper can be meaningfully compensated.
This policy is not special to ASI creation - I do my best to deal fai...
Here's a simple proposal for verifiable and enforceable slowdown that also differentially reallocates resources to alignment research (which I would see as one of if not the major benefits of a slowdown):
Enforce that labs must spend at least X% of their compute on:
I like this! Here's a specific concrete version of it just to have some numbers on the table:
Companies are legally required to allocate their compute as follows:
1. 60% on serving customers.
2. 15% on internal safety research that's fully transparent to the public.
3. 10% on external safety research that's fully transparent to the public & has access to the latest internal models.
4. 15% on whatever else they want (so, presumably, core AI R&D and big training runs)
Suppose this were a US law, that applies to all the major AI companies but not to e.g. Ch...
Epistemic status: Half speculation, half solid advice. I'm writing this up as I've said this a bunch IRL.
Current large language models (LLMs) are sufficiently good at in-context learning that for many NLP tasks, it's often better and cheaper to just query an LM with the appropriate prompt, than to train your own ML model. A lot of this comes from my personal experience (i.e. replacing existing "SoTA" models in other fields with prompted LMs, and getting better performance), but there's also examples ...
Autumn 2026 MATS application deadline is June 7th, in just over a week!
Pitch: Come work with me and Alex Cloud on Team Shard. Often the team accomplishes imaginative work which advances some aspect of alignment (like steering vectors and distillation robustifies unlearning). Our mentees consistently have fun and grow a lot (see some testimonials):
Jacob Goldman-Wetzler MATS 6.0, Gradient Routing
Being a member of Team Shard helped me grow tremendously as a researcher. It gave me the necessary skills and confidence to work in AI Safety full-time.
If you're...
Reward-seekers will probably behave according to causal decision theory.
Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause the highest reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy so that the action provides no evidence to the RL algorithm about the counterparty’s action.) This doesn’t imply RL produces CDT reward-...
See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution.
It's wacky that we don't know if current AIs would try to take over the world if:
Checking is hard!
I suspect the answer is basically no, at least for AIs from Anthropic and OpenAI, but I'm not that confident. Models sometimes do crazy (misaligned) actions to try to succeed at tasks when they get stuck.
Here are some of my top candidates for big pushes to do right now on technical AI safety (low effort notes):
I sort of agree? I think the net effect on overall capabilites progress is pretty small and some of the action I proposed would hopefully divert people from generic capabilites to working on this type of (hopefully particularly differential) capabilities. But I agree that some of these actions would involve safety motivated people doing work that would shorten timelines (relative to if they did nothing / worked on areas with no capabilities externalities) and it could turn out this work isn't valuable.
I think for "Get AIs generically better at conceptual w...
Here are some training experiments I think AI companies should consider running. In each case, the idea is to modify the company's actual training process as described and then study the resulting AI. (You presumably shouldn't deploy/use the resulting AI...)
A somewhat crazy aspect of the current situation is that we have very little confirmed public information about why frontier AIs end up being apparently behaviorally aligned. And more generally, we don't know what factors in training are most relevant for (behavioral) alignment. Like, what interventions in training result in Anthropic AIs following the constitution or make OpenAI AIs follow the spec? What factors tend to make them follow the constitution/spec less (or cause various specific misaligned behaviors)? It's not...
Amusingly, just after I posted this, Anthropic released "Teaching Claude why" which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn't seem close to sufficient for a reasonably complete understanding.)
An analogy that points at one way I think the instrumental/terminal goal distinction is confused:
Imagine trying to classify genes as either instrumentally or terminally valuable from the perspective of evolution. Instrumental genes encode traits that help an organism reproduce. Terminal genes, by contrast, are the "payload" which is being passed down the generations for their own sake.
This model might seem silly, but it actually makes a bunch of useful predictions. Pick some set of genes which are so crucial for survival that they're seldom if ever modifie...
I think you're making the distinction more confusing than it has to be.
There are things that has motivational pull, and there are things that don't but I do them anyway because they are a necessary step to get what I actually want.
Say I want to get an apple, and the easiest way to get one is going to the store and by some. Going to the sore in this story is clearly an instrumental goal, and enjoying eating my apple is a terminal goal[1]
Things that are instrumental can acquire the property of being terminal by association in our brain, because of how huma...
Automating alignment may be harder than automating capabilities, because of ‘unsafe to verify’ tasks
This was a note I wrote for my colleagues on UK AISI's Alignment Team. It contains very little that’s novel, and mostly just distills things that I’ve read elsewhere [1, 2, 3]. Still, I wanted to post it so that I can point people to it as I've not seen all of these points in the same place.
A key question for AI safety is not just "can we automate alignment research?" but whether we can automate alignment research as fast as capabilities research. [1][2], I ...
I was already impressed by UK AISI but I keep reading things that make my opinion of y'all rise still further. Keep on doing whatever it is that you are doing!
I think we should be relatively less worried about instrumental power-seeking and relatively more worried about terminal power-seeking. Note that this is only a relative update on the margin, and maybe on net I am still more concerned about the instrumental version because I started much more concerned about it. This is also not a super recent update—I just haven't seen it written up before.
Simple argument:
Certainly the really concerning thing here is (1). Though indeed one way you might get (1) is by generalization from (2).
Master post for alignment protocols.
Other relevant shortforms:
Epistemic status: half-baked
Arguably, an aligned AI should be aligned to the user's prior as well as the user's utility function. Hence, any value-learning protocol should also be doing prior-learning. The problem is, any learning process requires (explicitly or implicitly) its own prior. But shouldn't this also be the user's prior? Is this an infinite regress? Maybe not: here is a way out that seems elegant in a way.
For now, we will work in the Bayesian framework. Let
Buying AI labor might be a big deal for philanthropists.
I think the total available for AI safety philanthropy is almost $100B (at current valuations), mostly from Anthropic.[1] The AI safety nonprofit ecosystem currently consumes about $1B per year. There are still good opportunities available, but they're several times worse than the average spending (because the low-hanging fruit has been plucked[2]). Marginal effectiveness would likely decline by ~half again if you doubled the rate of AI safety philanthropy.
So there's likely ~30x more money than can be...
few promising proposals
There's plenty of room for funding in human intelligence amplification. Easily $100 million, probably much more given some work (active grantmaking, etc.).
Diminishing returns in the NanoGPT speedrun:
To determine whether we're heading for a software intelligence explosion, one key variable is how much harder algorithmic improvement gets over time. Luckily someone made the NanoGPT speedrun, a repo where people try to minimize the amount of time on 8x H100s required to train GPT-2 124M down to 3.28 loss. The record has improved from 45 minutes in mid-2024 down to 1.92 minutes today, a 23.5x speedup. This does not give the whole picture-- the bulk of my uncertainty is in other variables-- but given this is exist...
My colleague Manish did a lot more analysis here. The main takeaway so far is categorizing each PR's improvements as "deep" vs "shallow", as well as "imported-from-literature" vs "invented".
It looks like there were large, shallow improvements imported from the literature early on, while since then most improvements have been moderately involved and a larger portion are novel.


To get more evidence about SIE likelihood, we have lots of work in the pipeline, including interviews with nanogpt contributors, 1B+ token runs using Opus 4.7 and GPT-5.5 on our Inspec...