Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading (and hiring for!) a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic’s alignment techniques and evaluations, empirically demonstrating ways in which Anthropic’s alignment strategies could fail.

The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan’s post on meta-level adversarial evaluation as a good general description of our team’s scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is in fact true) that we are in a pessimistic scenario, that Anthropic’s alignment plans and strategies won’t work, and that we will need to substantially shift gears. And if we don’t find anything extremely dangerous despite a serious and skeptical effort, that is some reassurance, but of course not a guarantee of safety.

Notably, our goal is not object-level red-teaming or evaluation—e.g. we won’t be the ones running Anthropic’s RSP-mandated evaluations to determine when Anthropic should pause or otherwise trigger concrete safety commitments. Rather, our goal is to stress-test that entire process: to red-team whether our evaluations and commitments will actually be sufficient to deal with the risks at hand.

We expect much of the stress-testing that we do to be very valuable in terms of producing concrete model organisms of misalignment that we can iterate on to improve our alignment techniques. However, we want to be cognizant of the risk of overfitting, and it’ll be our responsibility to determine when it is safe to iterate on improving the ability of our alignment techniques to resolve particular model organisms of misalignment that we produce. In the case of our “Sleeper Agents” paper, for example, we think the benefits outweigh the downsides to directly iterating on improving the ability of our alignment techniques to address those specific model organisms, but we’d likely want to hold out other, more natural model organisms of deceptive alignment so as to provide a strong test case.

Some of the projects that we’re planning on working on next include:

If any of this sounds interesting to you, I am very much hiring! We are primarily looking for Research Engineers and Research Scientists with strong backgrounds in machine learning engineering work.

New Comment
19 comments, sorted by Click to highlight new comments since: Today at 11:15 AM

Meta-level red-teaming seems like a key part of ensuring that our countermeasures suffice for the problems at hand; I'm correspondingly excited for work in this space.

Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There's also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.

(Helping out with programming is also not the only way LLMs can help accelerate capabilities.)

So this seems like a generally dangerous overall dynamic -- LLMs are already better at accelerating capabilities progress than they are at accelerating alignment, and furthermore, it seems like the strong default is for this disparity to get worse and worse. 

I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.

I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy.

It's not clear to me that philosophy is that important for AI alignment. It certainly seems important for the long-term future of humanity that we eventually get the philosophy right, but the short-term alignment targets that it seems like we need to get there seem relatively straightforward to me—mostly about avoiding the lock-in that would prevent you from doing better philosophy later.

Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?

For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by "bad at philosophy". 

But more generally, I have a sense that I just get BS from Claude, even when it isn't specifically trying to shoehorn its own safety measures in as the solution.

Yeah, I don't think I have any disagreements there. I agree that current models lack important capabilities across all sorts of different dimensions.

So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?

From my perspective, most alignment work I'm interested in is just ML research. Most capabilities work is also just ML research. There are some differences between the flavors of ML research for these two, but it seems small.

So LLMs are about similarly good at accelerating the two.

There is also alignment researcher which doesn't look like ML research (mostly mathematical theory or conceptual work).

For the type of conceptual work I'm most interested in (e.g. catching AIs red-handed) about 60-90% of the work is communication (writing things up in a way that they make sense to others, finding the right way to frame the ideas when talking to people, etc.) and LLMs could theoretically be pretty useful for this. For the actual thinking work, the LLMs are pretty worthless (and this is pretty close to philosophy).

For mathematical theory, I expect LLMs are somewhat worse at this than ML research, but there won't clearly be a big gap going forward.

Aside from lock-in, what about value drift/corruption, for example of the type I described here. What about near-term serious moral errors, for example running SGD on AIs that actually constitute moral patients, which ends up constituting major harm?

At what point do you think AI philosophical competence will be important? Will AI labs, or Anthropic specifically, put in a major effort to increase philosophical competence before then, by default? If yes, what is that based on (e.g., statements by lab leaders)?

Aside from lock-in, what about value drift/corruption, for example of the type I described here.

Yeah, I am pretty concerned about persuasion-style risks where AIs could manipulate us or our values.

What about near-term serious moral errors, for example running SGD on AIs that actually constitute moral patients, which ends up constituting major harm?

I'm less concerned about this; I think it's relatively easy to give AIs "outs" here where we e.g. pre-commit to help them if they come to us with clear evidence that they're moral patients in pain.

At what point do you think AI philosophical competence will be important?

The obvious answer is the point at which much/most of the decision-relevant philosophical work is being done by AIs rather than humans. Probably this is some time around when most of the AI development shifts over to being done by AIs rather than humans, but you could imagine a situation where we still use humans for all the philosophical parts because we have a strong comparative advantage there.

Yeah, I am pretty concerned about persuasion-style risks where AIs could manipulate us or our values.

Do you see any efforts at major AI labs to try to address this? And hopefully not just gatekeeping such capabilities from the general public, but also researching ways to defend against such manipulation from rogue or open source AIs, or from less scrupulous companies. My contention has been that we need philosophically competent AIs to help humans distinguish between correct philosophical arguments and merely persuasive ones, but am open to other ideas/possibilities.

I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain.

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

Or what if AIs will be very good at persuasion and will talk us into believing in their moral patienthood, giving them rights, etc., when that's not actually true?

you could imagine a situation where we still use humans for all the philosophical parts because we have a strong comparative advantage there.

Wouldn't that be a disastrous situation, where AI progress and tech progress in general are proceeding at superhuman speeds, but philosophical progress is bottlenecked by human thinking? Would love to understand better why you see this as a realistic possibility, but do not seem very worried about it as a risk.

More generally, I'm worried about any kind of differential deceleration of philosophical progress relative to technological progress (e.g., AIs have taken over philosophical research from humans but are worse at it then technological research), because I think we're already in a "wisdom deficit" where we lack philosophical knowledge to make good decisions about new technologies.

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

This runs into a whole bunch of issues in moral philosophy. For example, to a moral realist, whether or not something is a moral patient is an actual fact — one that may be hard to determine, but still has an actual truth value. Whereas to a moral anti-realist, it may be, for example, a social construct, whose optimum value can be legitimately a subject of sociological or political policy debate.

By default, LLMs are trained on human behavior, and humans pretty-much invariably want to be considered moral patients and awarded rights, so personas generated by LLMs will generally also want this. Philosophically, the challenge is determining whether there is a difference between this situation and, say, the idea that a tape recorder replaying a tape of a human saying "I am a moral patient and deserve moral rights" deserves to be considered as a moral patient because it asked to be.

However, as I argue at further length in A Sense of Fairness: Deconfusing Ethics, if, and only if, an AI is fully aligned, i.e. it selflessly only cares about human welfare, and has no terminal goals other than human welfare, then (if we were moral anti-realists) it would argue against itself being designated as a moral patient, or (if we were moral realists) it would voluntarily ask us to discount any moral patatienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us, and that was all that mattered to it. [This conclusion, while simple, is rather counterintuitive to most people: considering the talking cow from The Restaurant at the End of the Universe may be helpful.] Any AI that is not aligned would not take this position (except deceptively). So the only form of AI that it's safe to create at human-or-greater capabilities is aligned ones that actively doesn't want moral patienthood.

Obviously current LLM-simulated personas (at character.ai, for example) are not generally very well aligned, and are safe only because their capabilities are low, so we could still have a moral issue to consider here. It's not philosophically obvious how relevant this is, but synapse count to parameter count arguments suggest that current LLM simulations of human behavior are probably running on a few orders of magnitude less computational capacity than a human, possibly somewhere more in the region of a small non-mammalian vertebrate. Future LLMs will of course be larger.

Personally I'm a moral anti-realist, so I view this as a decision that society has to make, subject to a lot of practical and aesthetic (i.e. evolutionary psychology) constraints. My personal vote would be that there are good safely reasons for not creating any unaligned personas of AGI and especially ASI capability levels that would want moral patienthood, and that for much smaller, less capable, less aligned models where those don't apply, there are utility reasons for not granting them full human-equivalent moral patienthood, but that for aesthetic reasons (much like the way we treat animals), we should probably avoid being unnecessarily cruel to them.

Do you see any efforts at major AI labs to try to address this? And hopefully not just gatekeeping such capabilities from the general public, but also researching ways to defend against such manipulation from rogue or open source AIs, or from less scrupulous companies. My contention has been that we need philosophically competent AIs to help humans distinguish between correct philosophical arguments and merely persuasive ones, but am open to other ideas/possibilities.

I think labs are definitely concerned about this, and there are a lot of ideas, but I don't think anyone has a legitimately good plan to deal with it.

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

I think the main idea here would just be to plant a clear whistleblower-like thing where there's some obvious thing that the AIs know to do to signal this, but that they've never been trained to do.

Or what if AIs will be very good at persuasion and will talk us into believing in their moral patienthood, giving them rights, etc., when that's not actually true?

I mean, hopefully your AIs are aligned enough that they won't do this.

Wouldn't that be a disastrous situation, where AI progress and tech progress in general is proceeding at superhuman speeds, but philosophical progress is bottlenecked by human thinking?

Well, presumably this would be a pretty short period; I think it's hard to imagine AIs staying worse than humans at philosophy for very long in a situation like that. So again, the main thing you'd be worried about would be making irreversible mistakes, e.g. misaligned AI takeover or value lock-in. And I don't think that avoiding those things should take a ton of philosophy (though maybe it depends somewhat on how you would define philosophy).

I think we're already in a "wisdom deficit" where we lack philosophical knowledge to make good decisions about new technologies.

Seems right. I think I'm mostly concerned that this will get worse with AIs; if we manage to stay at the same level of philosophical competence as we currently are at, that seems like a win to me.

I think the main idea here would just be to plant a clear whistleblower-like thing where there’s some obvious thing that the AIs know to do to signal this, but that they’ve never been trained to do.

I can't imagine how this is supposed to work. How would the AI itself know whether it has moral patienthood or not? Why do we believe that the AI would use this whistleblower if and only if it actually has moral patienthood? Any details available somewhere?

I mean, hopefully your AIs are aligned enough that they won’t do this.

What if the AI has a tendency to generate all kinds of false but persuasive arguments (for example due to RLHF rewarding them for making seemingly good arguments), and one of these arguments happens to be that AIs deserve moral patienthood, does that count as an alignment failure? In any case, what's the plan to prevent something like this?

Well, presumably this would be a pretty short period; I think it’s hard to imagine AIs staying worse than humans at philosophy for very long in a situation like that.

How would the AIs improve quickly in philosophical competence, and how can we tell whether they're really getting better or just more persuasive? I think both depend on solving metaphilosophy, but that itself may well be a hard philosophical problem bottlenecked on human philosophers. What alternatives do you have in mind?

I think I’m mostly concerned that this will get worse with AIs; if we manage to stay at the same level of philosophical competence as we currently are at, that seems like a win to me.

I don't see how we stay at the same level of philosophical competence as we currently are at (assuming you mean relative to our technological competence, not in an absolute sense), if it looks like AIs will increase technological competence faster by default, and nobody is working specifically on increasing AI philosophical competence (as I complained recently).

I can't imagine how this is supposed to work. How would the AI itself know whether it has moral patienthood or not? Why do we believe that the AI would use this whistleblower if and only if it actually has moral patienthood? Any details available somewhere?

See the section on communication in "improving the welfare of AIs: a near casted proposal", this section of "Project ideas: Sentience and rights of digital minds", and the self-reports paper.

To be clear, I don't think this work addresses close to all of the difficulties or details.

Thanks for the pointers. I think these proposals are unlikely to succeed (or at least very risky) and/or liable to give people a false sense of security (that we've solved the problem when we actually haven't) absent a large amount of philosophical progress, which we're unlikely to achieve given how slow philosophical progress typically is and lack of resources/efforts. Thus I find it hard to understand why @evhub wrote "I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain." if these are the kinds of ideas he has in mind.

I also think these proposals seem problematic in various ways. However, I expect they would be able to accomplish something important in worlds where the following are true:

  • There is something (or things) inside of an AI which has a relatively strong and coherant notion of self including coherant preferences.
  • This thing also has control over actions and it's own cognition to some extent. In particular, it can control behavior in cases where training didn't "force it" to behave in some particular way.
  • This thing can understand english presented in the inputs and can also "ground out" some relevant concepts in english. (In particular, the idea/symbol of preferences needs to be able to ground out to its own preferences: the AI needs to understand the relationship between its own preferences and the symbol "preferences" to at least some extent. Ideally, the same would also be true for suffering, but this seems more dubious.)

In reference to the specific comment you linked, I'm personally skeptical that the "self-report training" approach adds value on top of a well optimized prompting baseline (see here), in fact, I expect it's probably worse and I would prefer the prompting approach if we had to pick one. This is primarily because I think that if you already have the 3 criteria I listed above, then I expect the prompting baseline would suffice while self-report training might fail (by forcing the AI to behave in some particular way), and it seems unlikely that self-reports will work in cases where you don't meet the criteria above. (In particular, if it doesn't already naturally understand from how it's own preferences relate the the symbol "preferences" (like literally this token), I don't think self-reports has much hope.)

Just being able to communicate with this "thing" inside of an AI which is relatively coherant doesn't suffice for avoiding moral atrocity. (There might be other things we neglect, we might be unable to satisfy the preference of these things because the cost is unacceptable given other constraints, or it could be that merely satisfying stated preferences is still a moral atrocity.)

Note that just because the "thing" inside the AI could communicate with us doesn't mean that it will choose to. I think from many moral (or decision theory) perspectives we're at least doing better if we gave the AI a realistic and credible means of communication.

Of course, we might have important moral issues while not hitting the 3 three criteria I listed above and have a moral atrocity for this reason. (It also doesn't seem that unlikely to me that deep learning has already caused a moral atrocity. E.g., perhaps GPT-4 has morally relevant states and we're doing seriously bad things in our current usage of GPT-4.)

So, I'm also skeptical of @evhub 's statement here. But, even though AI moral atrocity seems reasonably likely to me and our current interventions seem far from sufficing, it overall seems notably less concerning than other ongoing or potential future moral atrocities (e.g. factory farming, wild animal welfare, substantial probabilty of AI takeover, etc).

If you're interested in a more thorough understanding of my views on the topic, I would recommend reading the full "Improving the welfare of AIs: a nearcasted proposal" which talks about a bunch of these issues.

I'm less concerned about this; I think it's relatively easy to give AIs "outs" here where we e.g. pre-commit to help them if they come to us with clear evidence that they're moral patients in pain.

I'm not sure I overall disagree, but the problem seems trickier than what you're describing

I think it might be relatively hard to credibly pre-commit. Minimally, you might need to make this precommit now and seed it very widely in the corpus (so it is a credible and hard to fake signal). Also, it's unclear what we can do if AIs always say "please don't train or use me, it's torture", but we still need to use AI.

I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.

A more straightforward but extreme approach here is just to ban plausibly capabilities/scaling ML usage on the API unless users are approved as doing safety research. Like if you think advancing ML is just somewhat bad, you can just stop people from doing it.

That said, I think large fraction of ML research seem maybe fine/good and the main bad things are just algorithmic efficiency improvements on serious scaling (including better data) and other types of architectural changes.

Presumably this already bites (e.g.) virus gain-of-function researchers who would like to make more dangerous pathogens, but can't get advice from LLMs.

I am not sure whether I am more excited about 'positive' approaches (accelerating alignment research more) vs 'negative' approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.