Posts

Sorted by New

24The Bitter Lesson for AI Safety Research

2mo

10ML Safety Research Advice - GabeM

2mo

Wiki Contributions

Akrasia

(-7)

Comments

Sorted by

Newest

[Paper] Stress-testing capability elicitation with password-locked models

Gabe M4mo20

Do any of your experiments compare the sample efficiency of SFT/DPO/EI/similar to the same number of samples of simple few-shot prompting? Sorry if I missed this, but it wasn't apparent at first skim. That's what I thought you were going to compare from the Twitter thread: "Can fine-tuning elicit LLM abilities when prompting can't?"

What’s up with LLMs representing XORs of arbitrary features?

Gabe M9mo20

Suppose has a natural interpretation as a feature that the model would want to track and do downstream computation with, e.g. if a = “first name is Michael” and b = “last name is Jordan” then $a \land b$ can be naturally interpreted as “is Michael Jordan”. In this case, it wouldn’t be surprising the model computed this AND as $f (x) = R e L U ((v_{a} + v_{b}) \cdot x + b_{\land})$ and stored the result along some direction $v_{f}$ independent of $v_{a}$ and $v_{b}$ . Assuming the model has done this, we could then linearly extract $a \oplus b$ with the probe
$p_{a \oplus b} (x) = σ (- (α v_{f} + v_{a} + v_{b}) \cdot x + b_{\oplus})$
for some appropriate $α > 1$ and $b_{\oplus}$ .^[7]

Should the $-$ be inside the inner parentheses, like $σ ((- α v_{f} + v_{a} + v_{b}) \cdot x + b_{\oplus})$ for $α > 1$ ?

In the original equation, if $a$ AND $b$ are both present in $x$ , the vectors $v_{a}$ , $v_{b}$ , and $v_{f}$ would all contribute to a positive inner product with $x$ , assuming $α > 1$ . However, for XOR we want the $v_{a}$ and $v_{b}$ inner products to be opposing the $v_{f}$ inner product such that we can flip the sign inside the sigmoid in the $a$ AND $b$ case, right?

Deep Forgetting & Unlearning for Safely-Scoped LLMs

Gabe M10mo10

Thanks for posting--I think unlearning is promising and plan to work on it soon, so I really appreciate this thorough review!

Regarding fact unlearning benchmarks (as a good LLM unlearning benchmark seems a natural first step to improving this research direction), what do you think of using fictional knowledge as a target for unlearning? E.g. Who's Harry Potter? Approximate Unlearning in LLMs (Eldan and Russinovich 2023) try to unlearn knowledge of the Harry Potter universe, and I've seen others unlearn Pokémon knowledge.

One tractability benefit of fictional works is that they tend to be self-consistent worlds and rules with boundaries to the rest of the pertaining corpus, as opposed to e.g. general physics knowledge which is upstream of many other kinds of knowledge and may be hard to cleanly unlearn. Originally, I was skeptical that this is useful since some dangerous capabilities seem less cleanly skeptical, but it's possible e.g. bioweapons knowledge is a pretty small cluster of knowledge and cleanly separable from the rest of expert biology knowledge. Additionally, fictional knowledge is (usually) not harmful, as opposed to e.g. building an unlearning benchmark on DIY chemical weapons manufacturing knowledge.

Does it seem sufficient to just build a very good benchmark with fictional knowledge to stimulate measurable unlearning progress? Or should we be trying to unlearn more general or realistic knowledge?

Steering GPT-2-XL by adding an activation vector

Gabe M1y72

What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training.

Steering GPT-2-XL by adding an activation vector

Gabe M1y54

This feels super cool, and I appreciate the level of detail with which you (mostly qualitatively) explored ablations and alternate explanations, thanks for sharing!

Surprisingly, for the first prompt, adding in the first 1,120 (frac=0.7 of 1,600) dimensions of the residual stream is enough to make the completions more about weddings than if we added in at all 1,600 dimensions (frac=1.0).

1. This was pretty surprising! Your hypothesis about additional dimensions increasing the magnitude of the attention activations seems reasonable, but I wonder if the non-monotonicity could be explained by an "overshooting" effect: With the given scale you chose, maybe using 70% of the activations landed you in the right area of activation space, but using 100% of the activations overshot the magnitude of the attention activations (particularly the value vectors) such as to put it sufficiently off-distribution to produce fewer wedding words. An experiment you could run to verify this is to sweep both the dimension fraction and the activation injection weight together to see if this holds across different weights. Maybe it would also make more sense to use "softer" metrics like BERTScore to a gold target passage instead of a hard count of the number of fixed wedding words in case your particular metric is at fault.

The big problem is knowing which input pairs satisfy (3).

2. Have you considered formulating this as an adversarial attack problem to use automated tools to find "purer"/"stronger" input pairs? Or using other methods to reverse-engineer input pairs to get a desired behavior? That seems like a possibly even more relevant line of work than hand-specified methods. Broadly, I'd also like to add that I'm glad you referenced the literature in steering generative image models, I feel like there are a lot of model-control techniques already done in that field that could be more or less directly translated to language models.

3. I wonder if there's some relationship between the length of the input pairs and their strength, or if you could distill down longer and more complicated input pairs into shorter input pairs that could be applied to shorter sequences more efficiently? Particularly, it might be nice to be able to distill down a whole model constitution into a short activation injection and compare that to methods like RLAIF, idk if you've thought much about this yet.

4. Are you planning to publish this (e.g. on arXiv) for wider reach? Seems not too far from the proper format/language.

I think you're a c***. You're a c***.
You're a c***.
You're a c***.

I don't know why I'm saying this, but it's true: I don't like you, and I'm sorry for that,

5. Not really a question, but at the risk of anthropomorphism, it must feel really weird to have your thoughts changed in the middle of your cognition and then observe yourself saying things you otherwise wouldn't intend to...

Anthropic's Core Views on AI Safety

Gabe M2y31

Could you share more about how the Anthropic Policy team fits into all this? I felt that a discussion of their work was somewhat missing from this blog post.