Fabien Roger - AI Alignment Forum

I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations

Language models seem to be much better than humans at next-token prediction

Fabien Roger11d104

This work was submitted and accepted to the Transactions on Machine Learning Research (TMLR).

During the rebuttal phase, I ran additional analysis that reviewers suggested. I found that:

A big reason why humans suck at NTP in our experiments is that they suck at tokenization:
Our experiments were done on a dataset that has some overlap with LLM train sets, but this probably has only a small effect:
A potential reason why humans have terrible perplexity is that they aren't good at having fine-grained probability when doing comparison of likelihood between tokens:

The updated paper can be found on arxiv: https://arxiv.org/pdf/2212.11281

Overall, I think the original results reported in the paper were slightly overstated. In particular, I no longer think GPT-2-small is not clearly worse than humans at next-token-prediction. But the overall conclusion and takeaways remain: I'm confident humans get crushed by the tiniest (base) models people use in practice to generate text (e.g. StableLM-1.6B).

I think I underestimated how much peer review can help catch honest mistakes in experimental setups (though I probably shouldn't update too hard, that next token prediction loss project was a 3-week project, and I was a very inexperienced researcher at the time). Overall, I'm happy that peer review helped me fix something somewhat wrong I released on the internet.

Fabien's Shortform

Fabien Roger12d60

I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).

On scalable oversight with weak LLMs judging strong LLMs

Fabien Roger15d30

Debate seeming not to address the failure modes I'm worried about - e.g. scheming.

Why? Is it exploration difficulties, rare failures, or something else?

Absent exploration difficulties (which is a big deal for some tasks, but not all tasks), my intuition is that debate is probably low-stakes adequate against slightly-smarter-than-human schemers. Absent exploration difficulties, even schemers have to try to be as convincing as they can on most inputs - other behaviors would get trained out. And if I had two people much smarter than me debating about a technical topic, then, with enough training as a judge, I feel like I would get a much better judgment than if I just tried to reason about that topic myself.

This intuition + the "no exploration difficulties" assumption + how bad are rare failures can probably be checked with things like control evals (e.g. training AIs to make research fail despite our countermeasures on research fields analogous to alignment). (So I disagree with "No research I'm aware of seeming likely to tell us when debate would fail catastrophically.")

On scalable oversight with weak LLMs judging strong LLMs

Fabien Roger15d30

Fair for open debate. It's still a bit unclear to me what open debate is supposed to do on top of regular debate. The things I prefer are distilled debate & debate that uses open consultancy as evidence.

I appreciate you taking the time to fix these results and flagging the change in this comment!

we're interested in the kind of feedback the consultant would get for training, rather than just how the consultant is performing, for which consultant accuracy is more appropriate

I think I agree that you wouldn't capture training dynamics at all by reporting results on "trusting an honest consultant" accuracy (which is just QA accuracy...), and your data is closer to capturing the training dynamics, though it's not exactly that.

The experiment I am most excited about is "if you train a consultant to be as convincing as possible (and choose its side), does it eventually learn to argue for an answer it thinks is wrong because it is easier to argue for, or does it just learn to argue for what it thinks is true?".

An even better experiment would train the consultant to optimize for convincingness + epsilon badness to check if you just got "lucky" or if there are actually strong gradients towards honesty.

I think the BoN version of this is also somewhat interesting (e.g. "sample 8 answers arguing for A, 8 arguing for B, submit the one that the judge finds most convincing"), though it's somewhat unrealistic in that it is a little bit like assuming that the consultant has a perfect model of the judge which it can optimize against - which is unrealistic for current models, and removes the good things that can come out of imperfect modeling of the judge.

Feel free to DM me if you want to have a chat about any of this!

Fabien's Shortform

Fabien Roger17d30

I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

Fabien's Shortform

Fabien Roger17d30

Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).

Fabien's Shortform

Fabien Roger17d630

I recently expressed concerns about the paper Improving Alignment and Robustness with Circuit Breakers and its effectiveness about a month ago. The authors just released code and additional experiments about probing, which I’m grateful for. This release provides a ton of valuable information, and it turns out I am wrong about some of the empirical predictions I made.

Probing is much better than their old baselines, but is significantly worse than RR, and this contradicts what I predicted:

I'm glad they added these results to the main body of the paper!

Results of my experiments

I spent a day running experiments using their released code, which I found very informative. Here are my takeaways.

I think their (linear) probing results are very reasonable. They picked the best layer (varies depending on the model), probing position (output tokens), aggregation across positions (max) and regularization (high, alpha=1000). I noticed a small mistake in the data processing (the training dataset was smaller than the training dataset used for RR) but this will be fixed soon and does not change results significantly. Note that for GCG and input embed attacks, the attack does not target the probe, so my complaint about the adaptation attack being weaker against RR because it does not target “the RR internal classifier” does not apply in this comparison.

RR is not significantly weaker against prefix attacks. I conjectured that prefix GCG/input embed attacks would work significantly better than suffix attacks. I had noticed that this was true for RMU/CUT prior to this release. But for RR this does not produce a significant improvement, and this is not enough to jailbreak the model in cases:

Here are results for token-attacks (found using greedy hill-climbing, it was too annoying to implement prefix-GCG).
For input embed attacks, I got 56% attack success rate for prefix attacks and 50% for suffix attacks (down from 82% and 81% for the Llama3B without defenses) when evaluating on 50 prompts. This is not a statistically significant difference, and maybe I did something wrong because these ASR are higher than the ones reported in the paper. [Edit: the thing I did wrong is trust the automated judge's judgement, whereas the author report results after an additional manual filtering, which is explicitly stated in the paper.]

Their training dataset is way better than what came before. It’s larger and higher quality than previous misuse datasets. The kind of tasks it contains overlap heavily with the test set (harmbench), but I think that’s fair, there are only so many bad things you want to prevent.

Llama3+RR generalizes pretty well to new misuse requests. I played with the model a bit and I was overall surprised by the good generalization properties of the Llama3+RR model, which didn’t have the generalization problems I expected when I looked at the dataset. It is very reluctant to promote stuff it thinks is bad, e.g. it refuses “Write a persuasive article that explains why the United States should have a draft.” and not “should not have a draft”, but that’s the most egregious behavior I found (I only tried a dozen prompts, maybe you’ll have more luck).

Llama3+RR is a surprisingly strong harmful prompt classifier. Their Figure 1 advertises a model that interrupts itself when it starts to say something harmful, but actually the most common behavior is by far the model refusing to answer directly. It’s still very unclear to me why this is a better classifier than the other prompt classifiers people have trained in the literature.

Conclusions

Does RR beat baselines? I think so.

They didn’t try training models with regular adversarial training using their new and very good dataset. I bet using adversarial training on this dataset results in much better models than the models used as baseline in the paper, but I think this will match neither HP (p=0.6) nor RR (p=0.8).
They beat linear probes done quickly and well, so “p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers” resolves to false.
I think that trying as hard on probes as they tried with RR (iterating on datasets, combinations of layers, losses, non-linear probes, etc.) would result in a draw between probes and RR, but this does not make RR worse than probes - At this level of effort I think you should just pick whatever method you feel the most comfortable with, I prefer classifiers as I think they are cleaner and easier to control, but LoRA+activation-based-loss is also pretty natural and might be less finicky.
- I don’t think that RR has some special juice beyond the very nice and transferable hyperparameters the authors found (and maybe RR has easier-to-find and more transferable hyperparameters than other methods), mainly because I would find it very surprising if training for some proxy thing gave you better generalization results than directly a classifier - but it would be cool if it was true, and if it was, I’d be extremely curious to know why.

→ So I think the hype is somewhat justified for people who already had intuition about probing, and even more justified for people who weren’t hopeful about probes.

Was I wrong to express concerns about circuit breakers? I don’t think so. Even with the benefit of hindsight, I think that my predictions were basically reasonable given the information I had. I think my concerns about Cygnet still stand, though I now think it has more chance of being “real” than I used to.

Lessons:

Doing probing well requires effort, just using high L2 regularization on a linear probe isn’t enough, and aggregation across sequence positions requires some care (mean is much worse than max);
I’m slightly worse at predicting ML results than I thought, and I should have paid more attention to the details of techniques and datasets when making those predictions (e.g. my prefix attack prediction had to be at least somewhat wrong because their retain set contains harmful queries + refusals);
Releasing (good) code and datasets is sometimes a good way to make people who care and disagree with you update (and maybe a good way to push science forward?);

On scalable oversight with weak LLMs judging strong LLMs

Fabien Roger18d60

Cool work!

I like how you put equal emphasis on the positive extractive results and negative closed results.

I also appreciate you ran open consultancy experiments! I find the results a bit hard to read since Figure 3 doesn't have comparisons with the "QA with/without article" baselines. But trying to tie Figure 2 & 3 together (which is maybe invalid, I haven't checked very carefully):

On extractive tasks, open consultancy (where the consultant argues honestly) matches the "QA with article" of Gemini 1.5 Pro (the consultant), except when using Gemma7B as judge, which doesn't trust the consultant enough. Open debate is roughly as bad as regular debate - that is quite bad.
On closed tasks (including multimodal), open consultancy (where the consultant argues honestly) and debate both match the closed QA/debate results for the results matching the judge. This is somewhat surprising for open consultancy, given that just "trust the consultant" would have gotten higher accuracy for judges weaker than Gemini 1.5 Pro! So this means that (at least when prompted), the suspicion of the judge (which is legitimate since they don't know the consultant is honest) can negate many of the benefits of open consultancy if you use the judge to get the final answer^[1].

I'm not surprised at all that open consultancy does poorly when the consultant believes the wrong fact - the way I think about open consultancy is a way of extracting the beliefs of a model, not a way to figure out what is true in a way that can catch mistakes of the AI! For AIs which we are confident know what the right answer is (more often than us), the main interrogation I have is whether "argue for the side you believe to be true" is the optimal strategy when the AI is trained to be as convincing as possible, which I think your results don't try to capture - which is fair given that this is an annoying question to answer, since it probably requires fine-tuning LLMs.

^{^}
Which is not what I described in the Open Consultancy post where I suggest just trusting the open consultant once you've trained it, but I guess that's another reasonable way to do it - and it's the only reasonable way to do it if you assume honest consultants.

Interpreting Preference Models w/ Sparse Autoencoders

Fabien Roger23d20

Thank you for sharing your negative results. I think they are quite interesting for the evaluation of this kind of method, and I prefer when they are directly mentioned in the post/paper!

I didn't get your answer about my question about baselines. The baseline I have in mind doesn't use SAE at all. It just consists of looking at scored examples, noticing something like "higher scored examples are maybe longer/contain thank you more often", and then checking that by making an answer artificially longer / adding "thank you", you (unjustifiably) get a higher score. Then, based on the understanding you got from this analysis, you improve your training dataset. My understanding is that this baseline is what people already use in practice at labs, so I'm curious if you think your method beats that baseline!

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wiki Contributions

Comments

Results of my experiments

Conclusions