# All of Arthur Conmy's Comments + Replies

It's very impressive that this technique could be used alongside existing finetuning tools.

> According to our data, this technique stacks additively with both finetuning

To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?

(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you're unsure about the statistical significance due to the low percentages involved)

2Alex Turner2mo
We had results on decreasing sycophancy, as you say, but both methods zero it out in generalization. We'd need to test on a harder sycophancy dataset for that.

I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.

Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2

1Lee Sharkey2mo
Makes sense! Thanks!

I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!

1Lee Sharkey2mo

Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!

This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing

3Alex Turner7mo
I still don't follow. Apparently, TL's center_writing_weights is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn't affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.

> Can we just add in  times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.

Do you have evidence for this?

It's totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream ...

3Alex Turner7mo
We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency.  Oh, somehow I'm not familiar with this. Is this center_unembed? Or are you talking about something else? Yes, but I think the evidence didn't actually come from the "Love" - "Hate" prompt pair. Early in testing we found paired activation additions worked better. I don't have a citeable experiment off-the-cuff, though.

I think this point was really overstated. I get the impression the rejected papers were basically turned into the arXiv format as fast as possible and so it was easy for the mods to tell this. However, I've seen submissions to cs.LG like this and this that are clearly from the alignment community. These posts are also not stellar by standards of preprint formatting, and were not rejected, apparently

2David Manheim1y
There have also been plenty of other adapatations, ones which were not low-effort. I worked on 2, the Goodhart's law paper and a paper with Issa Rice on HRAD. Both were very significantly rewritten and expanded into "real" preprints, but I think it was clearly worthwhile.

Does the “ground truth” shows the correct label function on 100% of the training and test data? If so, what’s the relevance of the transformer which imperfectly implements the label function?

2Stephen Casper1y
Yes, it does show the ground truth. The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.

I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.

If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.

I don't understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

2Evan Hubinger1y
They're unconditional, not conditional probabilities. The comma is just for the exists quantifier. Sure—edited.

Thanks for the comment!

I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.

In depth, when GPT-Neo is fed a sequence of tokens  where  are uniformly random and  for , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from ...

Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here, another example of "recruiting resources outside of the model alone".

(however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)

1David Scott Krueger1y
It could work if you can use interpretability to effectively prohibit this from happening before it is too late.  Otherwise, it doesn't seem like it would work.

What happened to the unrestricted adversarial examples challenge? The github [1] doesn't have an update since 2020, and that is only to the warmup challenge. Additionally, were there any takeaways from the contest?