Sam Marks

Wiki Contributions


Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as

where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.

but doesn't explicitly mention reward hacking along the lines of "do things which look good to the overseers (but might not actually be good)," which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I'm therefore excited about experiments (like the "Zero-shot exploitation of evaluator ignorance" example in the post involving an overseer that doesn't know Spanish) which train models to pursue them. 

In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they're in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I'm confused about what exactly the rules should be. (The prompt distillation experiments don't feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans's recent work) feels much better.)

I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over.

Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.

(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

(I think I might have heard of the idea in the second bullet point of my first comment via Buck -> [someone else] -> me; hope I didn't imply it was original!)

Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.

I view these experiments as partially probing the question "When training models with a fixed oversight scheme[1], how favorable are inductive biases towards getting an aligned model?"

For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grader would (including imitating mistakes a 7th grader would make on harder problems)" are both policies that it could implement. Call these policies the "honest policy" and the "sycophantic policy," respectively. If we train on the fine-tuning dataset I described above, how likely is the sycophantic policy? It's hard to know (since it would be very expensive to distinguish between 0% and 0.01%, but we really care about the difference), but we can get some evidence by doing experiments like:

  • Include a term in the training reward which incentivizes the model to introduce subtle errors that a 7th grader couldn't catch. 
    • If this does not result in learning the sycophantic policy, that provides strong evidence that inductive biases are very favorable in this setting (or that my assumption that GPT-4 is smart enough to implement the sycophantic policy was wrong).
    • Conversely, if this does result in learning the sycophantic policy, that provides weak negative evidence about how much we can count on easy-to-hard generalization when models are smart enough to deceive our overseers.
  • Do multiple finetuning runs, where in each run we include a term in the training reward which incentivizes the model to generalize differently from the previously trained models.
    • If this does not result in any model learning the sycophantic policy, that provides very strong evidence that inductive biases are favorable.
    • If this does result in some model learning the sycophantic policy, that provides pretty strong (IMO) evidence that we'll sometimes get sycophantic models in practice (though maybe this depends on additional details about how we decide when two answers are "the same" and other bits of the experimental design I haven't pinned down).

It doesn't help to think about bigram tables here, because we want to understand the inductive biases of actual LMs, subject to those LMs being smart enough to implement deceptive or sycophantic policies.

Edit: I agree with you that doing the first experiment with context distillation is a bit too conservative for my tastes: if you do so, you're basically just checking that GPT-4 is able to implement the sycophantic policy at all. Maybe it's a good first step before training the model to be sycophantic without yourself demonstrating the behavior? I'm pretty unsure.

  1. ^

    By "oversight scheme" I mean a specification of things like:
    * How smart are our overseers?
    * What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution?
    * Do we do red teaming to produce episodes in which the model believes it's in deployment?

Yeah, sorry, I should have made clear that the story that I tell in the post is not contained in the linked paper. Rather, it's a story that David Bau sometimes tells during talks, and which I wish were wider-known. As you note, the paper is about the problem of taking specific images and relighting them (not of generating any image at all of an indoor scene with unlit lamps), and the paper doesn't say anything about prompt-conditioned models. As I understand things, in the course of working on the linked project, Bau's group noticed that they couldn't get scenes with unlit lamps out of the popular prompt-conditioned generative image models.

Somewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a random vector in embedding space tend not to look very semantically similar to each other. Some examples:

v_1                 v_2             v_3
 characterized       Columb          determines
 Stra                1900           conserv
 Ire                 sher            distinguishes
sent                 paed            emphasizes
 Shelter             000             consists
 Pil                mx               operates
stro                 female          independent
 wired               alt             operate
 Kor                GW               encompasses
 Maul                lvl             consisted

Here v_1, v_2, v_3, are random vectors in embedding space (drawn from ), and the columns give the 10 tokens whose embeddings are most cosine-similar to . I used GPT-2-large.

Perhaps 20% of the time, we get something like , where many of the nearest neighbors have something semantically similar among them (in this case, being present tense verbs in the 3rd person singular).

But most of the time, we get things that look like  or : a hodgepodge with no obvious shared semantic content. GPT-2-large seems to agree: picking " female" and " alt" randomly from the  column, the cosine similarity between the embeddings of these tokens is 0.06.

[Epistemic status: I haven't thought that hard about this paragraph.] Thinking about the geometry here, I don't think any of this should be surprising. Given a random vector , we should typically find that  is ~orthogonal to all of the ~50000 token embeddings. Moreover, asking whether the nearest neighbors to  should be semantically clustered seems to boil down to the following. Divide the tokens into semantic clusters ; then compare the distribution of intra-cluster variances  to the distribution of cosine similiarities of the cluster means . From the perspective of cosine similarity to , we should expect these clusters to look basically randomly drawn from the full dataset , so that each variance in the former set should be . This should be greater than the mean of the latter set, implying that we should expect the nearest neighbors to  to mostly be random tokens taken from different clusters, rather than a bunch of tokens taken from the same cluster. I could be badly wrong about all of this, though.

There's a little bit of code for playing around with this here.

This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down.

(This doesn't obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor's preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic's findings, that means this story gets a complexity penalty.)

Regarding your points on agentic simulacra (which I assume means "agentic personas the language model ends up imitating"):

1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.

2) But I'm pretty skeptical about your intuition that RLHF makes the "imitating agentic personas" problem worse. When people I've spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic's paper, they usually mean either:

(a) prompt engineering; or

(b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token "GOOD" and then ask the model to produce outputs which start with "GOOD".)

Approach (b) really doesn't seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.)

I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic's chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF.

Interested to hear if you have other intuitions here.

In terms of being able to sample from the conditional, I don't think that the important constraint here is . Rather, it seems that the important constraint is that our architecture can only sample from distributions of the form ; even allowing  to be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness - KL divergence from the base model.

I'm not sure the above point as an important one. I just wanted to disambiguate some different capabilities limitations which appeared in the example:

  1. limitations on what sorts of distributions the architecture could approximate
  2. limitations on the latent capabilities in the base model for producing true/persuasive outputs 
  3. limitations on how much steering each of the various latent capabilities gets to exert ().

On my understanding, your point was about limitation (1). But I don't feel especially nervous about limitation (1) -- taking the output distribution of our pretrained model and weighting it by a Boltzman factor feels like it should produce a kinda crazy distribution, and my naive intuition is that we shouldn't necessarily expect our model to be able to approximate this distribution that well after RL finetuning with a KL penalty.

I think I'm most nervous about the way we modeled limitation (3): I have no idea how to think about the extent to which models' capabilities trade off against one another, and taking  without additional constraints would have resulted in outputs of mean truthiness  for some  which we can't pin down without specifying additional details (e.g. is there weight decay?).

(The worked example in this comment was a joint effort with Eric Neyman and Drake Thomas.)

Here's a toy example in which we get worse Goodharting for RL than for filtering: suppose that our model has three submodules

  • A, which tries to produce outputs which are both true and persuasive
  • B, which tries to produce outputs which are true, but have no effect on persuasiveness
  • C, which tries to produce outputs which are persuasive, but with no effect on truthiness.

Our model has parameters  summing to 1 which determine how much to listen to each of these submodules. More specifically, our submodules produce samples  from the normal distributions , respectively, and then our model puts these samples together to produce an output which has truth score 
and persuasiveness score 

We'll assume that we're only able to measure persuasiveness, but that we want truthiness.(Some unstated assumptions:  with  and .)

Our model was trained on data in which truthiness and persuasiveness were positively correlated; this will be reflected in having , so that  and  are positively correlated. If this is true, then conditioning on some persuasiveness score  results in getting an output with expected truthiness score
Note that this scales linearly with , so that as we ask for more persuasiveness, we get more truthiness on average, as we'd hope.

In contrast, suppose we do RL on our model for high persuasiveness scores; imagine that this doesn't change the submodules A, B, and C much, but does tune the parameters . Then:

  • if  we'll set , i.e. always use the submodule which tries to produce true and persuasive outputs. This will result in average truthiness .
  • but if  we'll set , i.e. always use the submodule which tries to be persuasive but not true. This will result in average truthiness , much worse than we would get if we had done filtering.

Really this is just a dressed-up version of the classic Goodharting story, where you have a constrained resource () to allocate among various options(=the submodules A,B,C), so you put 100% of your resources into the option which is cheapest in persuasiveness-per-resource; unfortunately, this was not the option which gave the best truth-per-resource.

Some misc comments:

  • This example was a bit silly, but I think it captures some pieces of many folks' intuitions around RL and Goodharting: pretrained models have lots of capabilities, which are in some sense competing for the steering wheel: you can't LARP an economist writing an excellent paper and simultaneously LARP a deceptive agent who wants paperclips but finds it instrumentally useful to write economics papers. Whichever capability scores best for your proxy will win out, and with all other possible ways the model could have completed the training task getting no say.
  • By thinking of "persuasiveness" as being something which we actually wanted to get, this example also serves as an illustration of how filtering can be uncompetitive: filtering produces outputs whose persuasiveness is distributed as  whereas RL produces a model whose outputs have persuasiveness  on average; if  is large, that means that you'd have to filter roughly the order of  outputs to get the same persuasiveness as the average output of the RL-optimized model.
  • I spent a while confused about how this squares with the baseline-probability-time-Boltzman-factor classification of what RL with a KL penalty will converge to. (The example above didn't have a KL penalty, but adding a small one wouldn't have much much difference.) I think the answer is that the model I described wasn't expressive enough to represent the baseline-probability-time-Boltzman-factor distribution that RL with a KL penalty would optimally converge to. This lack of expressivity seems quite related to the fact our model was a linear combination of three distributions which we modeled as not changing throughout training. That means that this story, which is based on the frame that generative models are a giant pile of capabilities which can be elicited, is in tension with the frame that neural networks are flexible function approximators; I found this pretty interesting.
  • All this being said, I'm pretty skeptical that whatever sort of Goodharting is being captured in this example has much to do with the sort of Goodharting we empirically observe in RLHF, since this example doesn't work with best-of-n optimization (whereas extremal Goodharting does occur for best-of-n, as Buck pointed out elsethread).
  • Overall, I don't put much stock in this example beyond helping articulate the point that RL amplifies capabilities in proportion to how causally downstream of high-reward outputs they are, whereas filtering only takes into account their correlations with high-reward outputs.

The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)

I'm not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There's no reward conditioning going on. They're also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations). 

Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn't change from episode to episode -- these "tasks" are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it's not getting reward.

My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.

Two interesting differences between the approaches discussed here and in my linked post:

  • In my post, I assumed that the generative model was trained on a data set which included rewards (for example, humans playing Breakout, where the reward is provided by the environment; or a setting in which rewards can be provided by a reward model trained with human feedback). In contrast, you seem to be primarily considering settings, like language models trained on the internet, in which rewards are not provided (or at least, only provided implicitly in the sense that "here's some Nobel prize winning research: [description of research]" implicitly tells the model that the given research is the type of research that good researchers produce, and thus kinda acts as a reward label). My assumption trivializes the problem of making a quantilizer (since we can just condition on reward in the top 5% of previously observed rewards). But your assumption might be more realistic, in that the generative models we try to get superhuman performance out of won't be trained on data that includes rewards, unless we intentionally produce such data sets.
  • My post focuses on a particular technique for improving capabilities from baseline called online generative modeling; in this scheme, after pretraining, the generative model starts an online training phase in which episodes that it outputs are fed back into the generative model as new inputs. Over time, this will cause the distribution of previously-observed rewards to shift upwards, and with it the target quantile. Note that if the ideas you lay out here for turning a generative model into a quantilizer work, then you can stack online generative modeling on top. Why would you do this? It seems like you're worried that your techniques can safely produce the research of a pretty good biologist but not of the world's best biologist on their best day. One way around this is to just ask your generative model to produce the research of a pretty good biologist, but use the online generative modeling trick to let its expectation of what pretty good biology research looks like drift up over time. Would this be safer? I don't know, but it's at least another option.
Load More