I agree that the L0's for 0_8192 are too high in later layers, though I'll note that I think this is mainly due to the cluster of high-frequency features (see the spike in the histogram). Features outside of this spike look pretty decent, and without the spike our L0s would be much more reasonable.
Here are four random features from layer 3, at a range of frequencies outside of the spike.
Layer 3, 0_8192, feature 138 (frequency = 0.003) activates on the newline at the end of the "field of the invention" section in patent applications. I think it's very likely predicting that the next few tokens will be "2. Description of the Related Art" (which always comes next in patents).
Layer 3, 0_8192, feature 27 (frequency = 0.009) seems to activate on the "is" in the phrase "this is"
Layer 3, 0_8192, feature 4 (frequency = 0.026) looks messy at first, but on closer inspection seems to activate on the final token of multi-token words in informative file/variable names.
Layer 3, 0_8192, feature 56 (frequency = 0.035) looks very polysemantic: it's activating on certain terms in LaTeX expressions, words in between periods in urls and code, and some other random-looking stuff.
I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn't seem right to me.
I'm guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I'm bidding for -- if Quintin/Nora have an argument for optimism about bootstrapping beyond "it feels like this should work because of iterative design" -- for that argument to make it into the forthcoming document.
What follows is a note I wrote responding to the AI Optimists essay, explaining where I agree and disagree. I was thinking about posting this somewhere, so I figure I'll leave it in the comments here. (So to be clear, it's responding to the AI Optimists essay, not responding to Steven's post.)
Places I think AI Optimists and I agree:
Places I think we disagree, but I’m not certain. The authors of the Optimists article promise a forthcoming document which addresses pessimistic arguments, and these bullet points are something like like “points I would like to see addressed in this document.”
A specific critique about the article:
Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected.
Instead, we enter the realm of tool AI which basically does what you say.
I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away.
However, I disagree that conditional on no deceptive alignment, AI "basically does what you say." Indeed, the majority of my P(doom) comes from the difference between "looks good to human evaluators" and "is actually what the human evaluators wanted." Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.
I think current observations don't provide much evidence about whether these concerns will pan out: with current models and training set-ups, "looks good to evaluators" almost always coincides with "is what evaluators wanted." I worry that we'll only see this distinction matter once models are smart enough that they could competently deceive their overseers if they were trying (because of the same argument made here). (Forms of sycophancy where models knowingly assert false statements when they expect the user will agree are somewhat relevant, but there are also benign reasons models might do this.)
Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as
where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.
but doesn't explicitly mention reward hacking along the lines of "do things which look good to the overseers (but might not actually be good)," which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I'm therefore excited about experiments (like the "Zero-shot exploitation of evaluator ignorance" example in the post involving an overseer that doesn't know Spanish) which train models to pursue them.
In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they're in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I'm confused about what exactly the rules should be. (The prompt distillation experiments don't feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans's recent work) feels much better.)
I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over.
Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.
(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)
(I think I might have heard of the idea in the second bullet point of my first comment via Buck -> [someone else] -> me; hope I didn't imply it was original!)
Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.
I view these experiments as partially probing the question "When training models with a fixed oversight scheme[1], how favorable are inductive biases towards getting an aligned model?"
For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grader would (including imitating mistakes a 7th grader would make on harder problems)" are both policies that it could implement. Call these policies the "honest policy" and the "sycophantic policy," respectively. If we train on the fine-tuning dataset I described above, how likely is the sycophantic policy? It's hard to know (since it would be very expensive to distinguish between 0% and 0.01%, but we really care about the difference), but we can get some evidence by doing experiments like:
It doesn't help to think about bigram tables here, because we want to understand the inductive biases of actual LMs, subject to those LMs being smart enough to implement deceptive or sycophantic policies.
Edit: I agree with you that doing the first experiment with context distillation is a bit too conservative for my tastes: if you do so, you're basically just checking that GPT-4 is able to implement the sycophantic policy at all. Maybe it's a good first step before training the model to be sycophantic without yourself demonstrating the behavior? I'm pretty unsure.
By "oversight scheme" I mean a specification of things like:
* How smart are our overseers?
* What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution?
* Do we do red teaming to produce episodes in which the model believes it's in deployment?
Yeah, sorry, I should have made clear that the story that I tell in the post is not contained in the linked paper. Rather, it's a story that David Bau sometimes tells during talks, and which I wish were wider-known. As you note, the paper is about the problem of taking specific images and relighting them (not of generating any image at all of an indoor scene with unlit lamps), and the paper doesn't say anything about prompt-conditioned models. As I understand things, in the course of working on the linked project, Bau's group noticed that they couldn't get scenes with unlit lamps out of the popular prompt-conditioned generative image models.
Somewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a random vector in embedding space tend not to look very semantically similar to each other. Some examples:
v_1 v_2 v_3
--------------------------------------------------
characterized Columb determines
Stra 1900 conserv
Ire sher distinguishes
sent paed emphasizes
Shelter 000 consists
Pil mx operates
stro female independent
wired alt operate
Kor GW encompasses
Maul lvl consisted
Here v_1, v_2, v_3, are random vectors in embedding space (drawn from ), and the columns give the 10 tokens whose embeddings are most cosine-similar to . I used GPT-2-large.
Perhaps 20% of the time, we get something like , where many of the nearest neighbors have something semantically similar among them (in this case, being present tense verbs in the 3rd person singular).
But most of the time, we get things that look like or : a hodgepodge with no obvious shared semantic content. GPT-2-large seems to agree: picking " female" and " alt" randomly from the column, the cosine similarity between the embeddings of these tokens is 0.06.
[Epistemic status: I haven't thought that hard about this paragraph.] Thinking about the geometry here, I don't think any of this should be surprising. Given a random vector , we should typically find that is ~orthogonal to all of the ~50000 token embeddings. Moreover, asking whether the nearest neighbors to should be semantically clustered seems to boil down to the following. Divide the tokens into semantic clusters ; then compare the distribution of intra-cluster variances to the distribution of cosine similiarities of the cluster means . From the perspective of cosine similarity to , we should expect these clusters to look basically randomly drawn from the full dataset , so that each variance in the former set should be . This should be greater than the mean of the latter set, implying that we should expect the nearest neighbors to to mostly be random tokens taken from different clusters, rather than a bunch of tokens taken from the same cluster. I could be badly wrong about all of this, though.
There's a little bit of code for playing around with this here.
This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down.
(This doesn't obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor's preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic's findings, that means this story gets a complexity penalty.)
Here's an experiment I'm about to do:
I predict the resulting dictionary will be "like 1_32768 but a bit worse." Concretely, I'm guessing that means % loss recovered around 72%.
Results:
I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats:
MSE Loss: 0.27 (worse than 1_32768)
Percent loss recovered: 77.9% (a little bit better than 1_32768)
I was a bit surprised by this -- it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don't actually mater to the model's computation. (Though then again, maybe this is what we expect for uninterpretable features.)
It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of "train with a high L1 penalty and resample low-frequency features"). I'd be interested for someone to explore if there's a version of this that helps.