AI ALIGNMENT FORUM
AF

1081
Jannes Elstner
000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Jailbreak steering generalization
Jannes Elstner1y10

Cool work! One thing I noticed is that the ASR with adversarial suffixes is only ~3% for Vicuna-13B while in the universal jailbreak paper they have >95%.  Is the difference because you have a significantly stricter criteria of success compared to them? I assume that for the adversarial suffixes, the model usually regresses to refusal after successfully generating the target string ("Sure, here's how to build a bomb.  Actually I can't...")?

Reply