Thoughts on "Deep Learning is Robust to Massive Label Noise."
...We show that deep neural networks are capable of generalizing from training data for which true labels are massively outnumbered by incorrect labels. We demonstrate remarkably high test performance after training on corrupted data from MNIST, CIFAR, and ImageNet. For example, on MNIST we obtain test accuracy above 90 percent even after each clean training example has been diluted with 100 randomly-labeled examples. Such behavior holds across multiple patterns of label noise, even when erroneous l
Thoughts on "The Curse of Recursion: Training on Generated Data Makes Models Forget." I think this asks an important question about data scaling requirements: what happens if we use model-generated data to train other models? This should inform timelines and capability projections.
Abstract:
...Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language mo
a lot of interpretability work that performs act-add like ablations to confirm that their directions are real
Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?
ITI is basically act adds but they compute act adds with many examples instead of just a pair
Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of i...
Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping.
LLMs aren't trained to convergence because that's not compute-efficient, so early stopping seems like the relevant baseline. No?
everyone who reads those seems to be even more confused after reading them
I want to defend "Reward is not the optimization target" a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don't think it's true. For some reason, some people really get a lot ...
I agree that with time, we might be able to understand. (I meant to communicate that via "might still be incomprehensible")
All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing.
Strong claim! I'm skeptical (EDIT: if you mean "in the limit" to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn't matter? (Also, of course, even one single optimal ...
AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.
Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed.
Or:
The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts...
I don't think the conclusion follows from the premises. People often learn new concepts after studying stuff, and it seems likely (to me) that when studying human cognition, we'd first be confused because our previous concepts weren't sufficient to understand it, and then slowly stop being confused as we built & understood concepts related to the subject. If an AI's thoughts are like human thoughts, given a lot of time to understand them, what you describe doesn't rule out that the AI's thoughts would be comprehensible.
The mere existence of concepts we don't know about in a subject doesn't mean that we can't learn those concepts. Most subjects have new concepts.
(Also, all AI-doom content should maybe be expunged as well, since "AI alignment is so hard" might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)
I agree that there's something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn't a crux.)
(Medium confidence) FWIW, RLHF'd models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts.
This paper seems pretty cool!
I've for a while thought that alignment-related content should maybe be excluded from pretraining corpora, and held out as a separate optional dataset. This paper seems like more support for that, since describing general eval strategies and specific evals might allow models to 0-shot hack them.
Other reasons for excluding alignment-related content:
(Also, all AI-doom content should maybe be expunged as well, since "AI alignment is so hard" might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)
I've been interested in using this for red-teaming for a while -- great to see some initial work here. I especially liked the dot-product analysis.
This incidentally seems like strong evidence that you can get jailbreak steering vectors (and maybe the "answer questions" vector is already a jailbreak vector). Thankfully, activation additions can't be performed without the ability to modify activations during the forward pass, and so e.g. GPT-4 can't be jailbroken in this way. (This consideration informed my initial decision to share the cheese vector research.)
In practice, we focus on the embedding associated with the last token from a late layer.
I don't have time to provide citations right now, but a few results have made me skeptical of this choice -- probably you're better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.
Focusing on language models, we note that models exhibit “consistent developmental stages,” at first behaving similarly to -gram models and later exhibiting linguistic patterns.
I wrote a shortform comment which seems relevant:
...Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there's an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigat
Offline RL can work well even with wrong reward labels. I think alignment discourse over-focuses on "reward specification." I think reward specification is important, but far from the full story.
To this end, a new paper (Survival Instinct in Offline Reinforcement Learning) supports Reward is not the optimization target and associated points that reward is a chisel which shapes circuits inside of the network, and that one should fully consider the range of sources of parameter updates (not just those provided by a reward signal).
Some relevant qu...
Delicious food does seem like a good (but IMO weak) point in favor of reward-optimization, and pushes up my P(AI cares a lot about reward terminally) a tiny bit. But also note that lots of people (including myself) don't care very much about delicious food, and it seems like the vast majority of people don't make their lives primarily about delicious food or other tight correlates of their sensory pleasure circuits.
If pressing a button is easy and doesn't conflict with taking out the trash and doing other things it wants to do, it might try it.
This i...
Off-the-cuff: Possibly you can use activation additions to compute a steering vector which e.g. averages the difference between "prompts which contain the password" and "the same prompts but without the password", and then add this steering vector into the forward pass at a range of layers, and see what injection sites are most effective at unlocking the behavior. This could help localize the "unlock" circuit, which might help us start looking for unlock circuits and (maybe?) information related to deception.
My impression is derived from looking at some apparently random qualitative examples. But maybe @NinaR can run the coeff=0 setting and report the assessed sycophancy, to settle this more quantitatively:?
What is "shard theory"? I've written a lot about shard theory. I largely stand by these models and think they're good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a "theory"? Is it a "frame"? Is it "a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner"?
I think this understandable confusion happened because my writing didn't distinguish between:
I'd guess that Meta didn't bother to train against obvious sycophancy and if you trained against it, then it would go away.
Hm. My understanding is that RLHF/instruct fine-tuning tends to increase sycophancy. Can you share more about this guess?
Here's the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:
For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that's not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF'd models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
What this graph actually shows is increasing syc...
Maybe my original comment was unclear. I was making a claim of "evidently this has improved on whatever they did" and not "there's no way for them to have done comparably well if they tried."
I do expect this kind of technique to stack benefits on top of finetuning, making the techniques complementary. That is, if you consider the marginal improvement on some alignment metric on validation data, I expect the "effort required to increase metric via finetuning" and "effort via activation addition" to be correlated but not equal. Thus, I suspect that even after finetuning a model, there will be low- or medium-hanging activation additions which further boost alignment.
I think this result is very exciting and promising. You appear to have found substantial reductions in sycophancy, beyond whatever was achieved with Meta's finetuning, using a simple activation engineering methodology. Qualitatively, the model's coherence and capabilities seem to be retained, though I'd like to see e.g. perplexity on OpenWebText and MMLU benchmark performance to be sure.
Can Anthropic just compute a sycophancy vector for Claude using your methodology, and then just subtract the vector and thereby improve alignment with user interests? I'd love to know the answer.
Consider what update equations have to say about "training game" scenarios. In PPO, the optimization objective is proportional to the advantage given a policy , reward function , and on-policy value function :
Consider a mesa-optimizer acting to optimize some mesa objective. The mesa-optimizer understands that it will be updated proportional to the advantage. If the mesa-optimizer maximizes reward, this corresponds to maximizing the intensity of the gradients it receives, thus maximally updatin...
An Arxiv version is forthcoming. We're working with Gavin Leech to publish these results as a conference paper.
I'm also excited by tactics like "fully reverse engineer the important bits of a toy model, and then consider what tactics and approaches would -- in hindsight -- have quickly led you to understand the important bits of the model's decision-making."
I still don't follow. Apparently, TL's center_writing_weights
is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn't affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.
We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency.
TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here.
Oh, somehow I'm not familiar with this. Is this center_unembed
? Or are you talking about something else?
Do you have evidence for this?
Yes, but I think the evidence didn't actually come from the "Love" - "Hate" prompt pair. Early in testing we found paired activation additions worked better. I don't have a citeabl...
Thanks for the comment! Quick reacts: I'm concerned about the first bullet, not about 2, and bullet 3 seems to ignore top- probability prediction requirements (the requirement isn't to just ID the most probable next token). Maybe there's a recovery of bullet 3 somehow, though?
I'm currently excited about a "macro-interpretability" paradigm. To quote Joseph Bloom:
...TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important.
The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as redu
I'm also excited by tactics like "fully reverse engineer the important bits of a toy model, and then consider what tactics and approaches would -- in hindsight -- have quickly led you to understand the important bits of the model's decision-making."
Handling compute overhangs after a pause.
Sometimes people object that pausing AI progress for e.g. 10 years would lead to a "compute overhang": At the end of the 10 years, compute will be cheaper and larger than at present-day. Accordingly, once AI progress is unpaused, labs will cheaply train models which are far larger and smarter than before the pause. We will not have had time to adapt to models of intermediate size and intelligence. Some people believe this is good reason to not pause AI progress.
There seem to be a range of relatively simple pol...
Cheaper compute is about as inevitable as more capable AI, neither is a law of nature. Both are valid targets for hopeless regulation.
For example, if you wanted to generally predict model behavior right now, you'd probably just want to get really good at understanding webtext, practice the next token prediction game, etc.
Another candidate eval is to demand predictability given activation edits (eg zero-ablating certain heads, patching in activations from other prompts, performing activation additions, and so on). Webtext statistics won't be sufficient there.
It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.
After thinking more about it earlier this week, I agree.
I was initially more bullish on "this seems sufficient and also would give a lot of time to understand models" (in which case you can gate model deployment with this alone) but I came to think "prediction requirements track something important but aren't sufficient" (in which case this is ...
The advantage definition itself is correct and non-oscillating... Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.
The advantage is (IIUC) defined with respect to a given policy, and so the advantage can oscillate and then cause mode collapse. I agree that a constant learning rate schedule is problematic, but note that ACTDE converges even with a constant learning rate schedule. So, I would indeed say that oscillating value estimation caused mode collapse in the toy example I gave?
Though note that ideally, once we actually know with confidence what is best, we should be near-greedy about it, rather than softmaxing!
I disagree. I don't view reward/reinforcement as indicating what is "best" (from our perspective), but as chiseling decision-making circuitry into the AI (which may then decide what is "best" from its perspective). One way of putting a related point: I think that we don't need to infinitely reinforce a line of reasoning in order to train an AI which reasons correctly.
(I want to check -- does this response make sense to you? Happy to try explaining my intuition in another way.)
I mostly disagree with the quote as I understand it.
Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would transfer to being able to predict generalization behavior in the cases that we care about—and we can't test the case that we care about directly due to RSA-2048-style problems.
I don't buy the RSA-2048 example as plausible generalization that gets baked into weights (though I know that example isn't meant to be realistic). I agree there exist in weight-space some bad models which this won't catch, though it...
I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases.
It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.
...I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out a
Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample?
In the tabular case, that's equivalent given uniform . Maybe it's also true in the function approximator PG regime, but that's a maybe -- depends on inductive biases. But often we want a pretrained (like when doing RLHF on LLMs), which isn't uniform.
Without being familiar with the literature, why should I buy that we can informally reason about what is "low-frequency" versus "high-frequency" behavior? I think reasoning about "simplicity" has historically gone astray, and worry that this kind of reasoning will as well.
That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive".
You're right. I was critiquing "power-seeking due to your assumptions isn't probable, because I think your assumptions won't hold" and not "power-seeking isn't predictive." I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:
...I don’t see a notion of “objective” that can be confidently claimed is:
- Probable: there is a good argument that the systems we build will ha
The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims.
It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like "...
RL creates agents, and RL seemed to be the way to AGI. In the 2010s, reinforcement learning was the dominant paradigm for those interested in AGI (e.g. OpenAI). RL lends naturally to creating agents that pursue rewards/utility/objectives. So there was reason to expect that agentic AI would be the first (and by the theoretical arguments, last) form that superintelligence would take.
Why are you confident that RL creates agents? Is it the non-stochasticity of optimal policies for almost all reward functions? The on-policy data collection of PPO? I think there...
I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are!
I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions
.
What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.
In addition to my other comment, I'll further quote Behavioural statistics for a maze-solving agent:
...We think the complex influence of spatial distances on the network’s decision-making might favor a ‘shard-like’ description: a
I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function.
Can you say more? Maybe give an example of what this looks like in the maze-solving regime?
...What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-
To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.
I'm responding to this post, so why should I strike that out?
The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.
The post is talking about internal representations.
Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.
Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water
I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).
"There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."
I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.
This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.
You looked at GPT-2-small. I injected your activation additions into GPT-2-X...
On applying generalization bounds to AI alignment. In January, Buck gave a talk for the Winter MLAB. He argued that we know how to train AIs which answer on-distribution questions at least as well as the labeller does. I was skeptical. IIRC, his argument had the following structure:
... (read more)