Hi Scott, thanks for this!
Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I've never had much confidence in, and it seems that there's not a huge amount beyond Yun et al's work, at least as far as I've seen.
Still though, I've not seen almost any of these which suggests a big hole in my knowledge, and in the paper I'll go through and add a lot more background to attempts to make more interpretable models.
Hi Charlie, yep it's in the paper - but I should say that we did not find a working CUDA-compatible version and used the scikit version you mention. This meant that the data volumes used are somewhat limited - still on the order of a million examples but 10-50x less than went into the autoencoders.
It's not clear whether the extra data would provide much signal since it can't learn an overcomplete basis and so has no way of learning rare features but it might be able to outperform our ICA baseline presented here, so if you wanted to give someone a project of making that available, I'd be interested to see it!
Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
It's not PCA but we've been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).
We've found that they're on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation patching, e.g. using a "this direction activates on curse words" direction to make text more aggressive. If people are interested in exploring this further let me know, say hi at our EleutherAI channel or check out the repo :)
For the avoidance of doubt, this accounting should recursively aggregate transitive inputs.
What does this mean?
Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?
In particular, I'm surprised by the method of adding the activations that was chosen because the tokens of the different prompts don't line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.
If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:
I'd love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.
Could you explain why you think "The game is skewed in our favour."?
Agree that the cited links don't represent a strong criticism of RLHF but I think there's an interesting implied criticism, between the mode-collapse post and janus' other writings on cyborgism etc that I haven't seen spelled out, though it may well be somewhere.
I see janus as saying that if you know how to properly use the raw models, then you can actually get much more useful work out of the raw models than the RLHF'd ones. If true, we're paying a significant alignment tax with RLHF that will only become clear with the improvement and take-up of wrappers around base models in the vein of Loom.
I guess the test (best done without too much fanfare) would be to get a few people well acquainted with Loom or whichever wrapper tool and identify a few complex tasks and see whether the base model or the RLHF model performs better.
Even if true though, I don't think it's really a mark against RLHF since it's still likely that RLHF makes outputs safer for the vast majority of users, just that if we think we're in an ideas arms-race with people trying to advance capabilities, we can't expect everyone to be using RLHF'd models.
Interesting! I'm struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake 'you suddenly have huge power' situations which are quite common suggestions but v curious what you have in mind.
Also, think it's worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by Δlogprob), so it feels like the first one shouldn't count that as a solid success.
I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
I wanted to test out the prompt generation part of this so I made a version where you pick a particular input sequence and then only allow a certain fraction of the input tokens to change. I've been initialising it with a paragraph about COVID and testing how few tokens it needs to be able to change before it reliably outputs a particular output token.
Turns out it only needs a few tokens to fairly reliably force a single output, even within the context of a whole paragraph, eg "typical people infected Majesty the virus will experience mild to moderate 74 illness and recover without requiring special treatment. However, some will become seriously ill and require medical attention. Older people and those with underlying medical conditions like cardiovascular disease" has a >99.5% chance of ' 74' as the next token. Penalising repetition makes the task much harder.
It can even pretty reliably cause GPT-2 to output SolidGoldMagikarp with >99% probability by only changing 10% of the tokens, though it does this by just inserting SolidGoldMagikarp wherever possible. As far as I've seen playing around with it for an hour or so, if you penalise repeating the initial token then it never succeeds.
I don't think these attacks are at all new (see Universal Adversarial Triggers from 2019 and others) but it's certainly fun to test out.
This raises a number of questions:
I'd be interested to know how you estimate the numbers here, they seem quite inflated to me.
If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K.
50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high.
Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023.
Agree with paul's comment above that timeline shifts are the most important variable.