Dan H




Catastrophic Risks From AI
CAIS Philosophy Fellowship Midpoint Deliverables
Pragmatic AI Safety

Wiki Contributions


Dan H80

Some years ago we wrote that "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries" and discussed monitoring systems that can create "AI tripwires could help uncover early misaligned systems before they can cause damage." https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Adversarial_Robustness

Since then, I've updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow but not necessarily for LLMs.

Dan H1-22

is novel compared to... RepE

This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405

Demonstrate full ablation of the refusal behavior with much less effect on coherence

In our paper and notebook we show the models are coherent.

Investigate projection

We did investigate projection too (we use it for concept removal in the RepE paper) but didn't find a substantial benefit for jailbreaking.

harmful/harmless instructions

We use harmful/harmless instructions.

Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer

In the RepE paper we target multiple layers as well.

Test on many different models

The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.

Describe a way of turning this into a weight-edit

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE).

Dan H1-4

but generally people should be free to post research updates on LW/AF that don't have a complete thorough lit review / related work section.

I agree if they simultaneously agree that they don't expect the post to be cited. These can't posture themselves as academic artifacts ("Citing this work" indicates that's the expectation) and fail to mention related work. I don't think you should expect people to treat it as related work if you don't cover related work yourself.

Otherwise there's a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.

including refusal-bypassing-related ones

The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.

I'll note pretty much every time I mention something isn't following academic standards on LW I get ganged up on and I find it pretty weird. I've reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.

Dan H0-21

From Andy Zou:

Thank you for your reply.

Model interventions to bypass refusal are not discussed in Section 6.2.

We perform model interventions to robustify refusal (your section on “Adding in the "refusal direction" to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post.

we examined Section 6.2 carefully before writing our work

Not mentioning it anywhere in your work is highly unusual given its extreme similarity. Knowingly not citing probably the most related experiments is generally considered plagiarism or citation misconduct, though this is a blog post so norms for thoroughness are weaker. (lightly edited by Dan for clarity)

Ablating vs. Addition

We perform a linear combination operation on the representation. Projecting out the direction is one instantiation of it with a particular coefficient, which is not necessary as shown by our GitHub demo. (Dan: we experimented with projection in the RepE paper and didn't find it was worth the complication. We look forward to any results suggesting a strong improvement.)


Please reach out to Andy if you want to talk more about this.

Edit: The work is prior art (it's been over six months+standard accessible format), the PIs are aware of the work (the PI of this work has spoken about it with Dan months ago, and the lead author spoke with Andy about the paper months ago), and its relative similarity is probably higher than any other artifact. When this is on arXiv we're asking you to cite the related work and acknowledge its similarities rather than acting like these have little to do with each other/not mentioning it. Retaliating by some people dogpile voting/ganging up on this comment to bury sloppy behavior/an embarrassing oversight is not the right response (went to -18 very quickly).

Edit 2: On X, Neel "agree[s] it's highly relevant" and that he'll cite it. Assuming it's covered fairly and reasonably, this resolves the situation.

Edit 3: I think not citing it isn't a big deal because I think of LW as a place for ml research rough drafts, in which errors will happen. But if some are thinking it's at the level of an academic artifact/is citable content/is an expectation others cite it going forward, then failing to mention extremely similar results would actually be a bigger deal. Currently I'll think it's the former.

Dan H1-17

From Andy Zou:

Section 6.2 of the Representation Engineering paper shows exactly this (video). There is also a demo here in the paper's repository which shows that adding a "harmlessness" direction to a model's representation can effectively jailbreak the model.

Going further, we show that using a piece-wise linear operator can further boost model robustness to jailbreaks while limiting exaggerated refusal. This should be cited.

Dan H2229

> My understanding is that we already know that backdoors are hard to remove.

We don't actually find that backdoors are always hard to remove!

We did already know that backdoors often (from the title) "Persist Through Safety Training." This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn't establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.

I think it's very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)

I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout's words,  AGI threat scenario "window dressing," or when players from an EA-coded group research a topic. (I've been suggesting more attention to backdoors since maybe 2019; here's a video from a few years ago about the topic; we've also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don't have the window dressing.

I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It's a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).

A request: Could Anthropic employees not call supervised fine-tuning and related techniques "safety training?" OpenAI/Anthropic have made "alignment" in the ML community become synonymous with fine-tuning, which is a big loss. Calling this "alignment training" consistently would help reduce the watering down of the word "safety."

Dan H10

A brief overview of the contents, page by page.

1: most important century and hinge of history

2: wisdom needs to keep up with technological power or else self-destruction / the world is fragile / cuban missile crisis

3: unilateralist's curse

4: bio x-risk

5: malicious actors intentionally building power-seeking AIs / anti-human accelerationism is common in tech

6: persuasive AIs and eroded epistemics

7: value lock-in and entrenched totalitarianism

8: story about bioterrorism

9: practical malicious use suggestions

10: LAWs as an on-ramp to AI x-risk

11: automated cyberwarfare -> global destablization

12: flash war, AIs in control of nuclear command and control

13: security dilemma means AI conflict can bring us to brink of extinction

14: story about flash war

15: erosion of safety due to corporate AI race

16: automation of AI research; autnomous/ascended economy; enfeeblement

17: AI development reinterpreted as evolutionary process

18: AI development is not aligned with human values but with competitive and evolutionary pressures

19: gorilla argument, AIs could easily outclass humans in so many ways

20: story about an autonomous economy

21: practical AI race suggestions

22: examples of catastrophic accidents in various industries

23: potential AI catastrophes from accidents, Normal Accidents

24: emergent AI capabilities, unknown unknowns

25: safety culture (with nuclear weapons development examples), security mindset

26: sociotechnical systems, safety vs. capabilities

27: safetywashing, defense in depth

28: story about weak safety culture

29: practical suggestions for organizational safety

30: more practical suggestions for organizational safety

31: bing and microsoft tay demonstrate how AIs can be surprisingly unhinged/difficult to steer

32: proxy gaming/reward hacking

33: goal drift

34: spurious cues can cause AIs to pursue wrong goals/intrinsification

35: power-seeking (tool use, self-preservation)

36: power-seeking continued (AIs with different goals could be uniquely adversarial)

37: deception examples

38: treacherous turns and self-awareness

39: practical suggestions for AI control

40: how AI x-risk relates to other risks

41: conclusion

Dan H66

but I'm confident it isn't trying to do this

It is. It's an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it's to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn't get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that's because it took over a year.

Dan H10

I asked for permission via Intercom to post this series on March 29th. Later, I asked for permission to use the [Draft] indicator and said it was written by others. I got permission for both of these, but the same person didn't give permission for both of these requests. Apologies this was not consolidated into one big ask with lots of context. (Feel free to get rid of any undue karma.)

Load More