Sam Marks

Sorted by New

# Wiki Contributions

Somewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a random vector in embedding space tend not to look very semantically similar to each other. Some examples:

v_1                 v_2             v_3
--------------------------------------------------
characterized       Columb          determines
Stra                1900           conserv
Ire                 sher            distinguishes
sent                 paed            emphasizes
Shelter             000             consists
Pil                mx               operates
stro                 female          independent
wired               alt             operate
Kor                GW               encompasses
Maul                lvl             consisted

Here v_1, v_2, v_3, are random vectors in embedding space (drawn from ), and the columns give the 10 tokens whose embeddings are most cosine-similar to . I used GPT-2-large.

Perhaps 20% of the time, we get something like , where many of the nearest neighbors have something semantically similar among them (in this case, being present tense verbs in the 3rd person singular).

But most of the time, we get things that look like  or : a hodgepodge with no obvious shared semantic content. GPT-2-large seems to agree: picking " female" and " alt" randomly from the  column, the cosine similarity between the embeddings of these tokens is 0.06.

[Epistemic status: I haven't thought that hard about this paragraph.] Thinking about the geometry here, I don't think any of this should be surprising. Given a random vector , we should typically find that  is ~orthogonal to all of the ~50000 token embeddings. Moreover, asking whether the nearest neighbors to  should be semantically clustered seems to boil down to the following. Divide the tokens into semantic clusters ; then compare the distribution of intra-cluster variances  to the distribution of cosine similiarities of the cluster means . From the perspective of cosine similarity to , we should expect these clusters to look basically randomly drawn from the full dataset , so that each variance in the former set should be . This should be greater than the mean of the latter set, implying that we should expect the nearest neighbors to  to mostly be random tokens taken from different clusters, rather than a bunch of tokens taken from the same cluster. I could be badly wrong about all of this, though.

There's a little bit of code for playing around with this here.

This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down.

(This doesn't obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor's preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic's findings, that means this story gets a complexity penalty.)

Regarding your points on agentic simulacra (which I assume means "agentic personas the language model ends up imitating"):

1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.

2) But I'm pretty skeptical about your intuition that RLHF makes the "imitating agentic personas" problem worse. When people I've spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic's paper, they usually mean either:

(a) prompt engineering; or

(b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token "GOOD" and then ask the model to produce outputs which start with "GOOD".)

Approach (b) really doesn't seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.)

I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic's chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF.

Interested to hear if you have other intuitions here.

In terms of being able to sample from the conditional, I don't think that the important constraint here is . Rather, it seems that the important constraint is that our architecture can only sample from distributions of the form ; even allowing  to be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness - KL divergence from the base model.

I'm not sure the above point as an important one. I just wanted to disambiguate some different capabilities limitations which appeared in the example:

1. limitations on what sorts of distributions the architecture could approximate
2. limitations on the latent capabilities in the base model for producing true/persuasive outputs
3. limitations on how much steering each of the various latent capabilities gets to exert ().

On my understanding, your point was about limitation (1). But I don't feel especially nervous about limitation (1) -- taking the output distribution of our pretrained model and weighting it by a Boltzman factor feels like it should produce a kinda crazy distribution, and my naive intuition is that we shouldn't necessarily expect our model to be able to approximate this distribution that well after RL finetuning with a KL penalty.

I think I'm most nervous about the way we modeled limitation (3): I have no idea how to think about the extent to which models' capabilities trade off against one another, and taking  without additional constraints would have resulted in outputs of mean truthiness  for some  which we can't pin down without specifying additional details (e.g. is there weight decay?).

(The worked example in this comment was a joint effort with Eric Neyman and Drake Thomas.)

Here's a toy example in which we get worse Goodharting for RL than for filtering: suppose that our model has three submodules

• A, which tries to produce outputs which are both true and persuasive
• B, which tries to produce outputs which are true, but have no effect on persuasiveness
• C, which tries to produce outputs which are persuasive, but with no effect on truthiness.

Our model has parameters  summing to 1 which determine how much to listen to each of these submodules. More specifically, our submodules produce samples  from the normal distributions , respectively, and then our model puts these samples together to produce an output which has truth score

and persuasiveness score

We'll assume that we're only able to measure persuasiveness, but that we want truthiness.(Some unstated assumptions:  with  and .)

Our model was trained on data in which truthiness and persuasiveness were positively correlated; this will be reflected in having , so that  and  are positively correlated. If this is true, then conditioning on some persuasiveness score  results in getting an output with expected truthiness score
.
Note that this scales linearly with , so that as we ask for more persuasiveness, we get more truthiness on average, as we'd hope.

In contrast, suppose we do RL on our model for high persuasiveness scores; imagine that this doesn't change the submodules A, B, and C much, but does tune the parameters . Then:

• if  we'll set , i.e. always use the submodule which tries to produce true and persuasive outputs. This will result in average truthiness .
• but if  we'll set , i.e. always use the submodule which tries to be persuasive but not true. This will result in average truthiness , much worse than we would get if we had done filtering.

Really this is just a dressed-up version of the classic Goodharting story, where you have a constrained resource () to allocate among various options(=the submodules A,B,C), so you put 100% of your resources into the option which is cheapest in persuasiveness-per-resource; unfortunately, this was not the option which gave the best truth-per-resource.

• This example was a bit silly, but I think it captures some pieces of many folks' intuitions around RL and Goodharting: pretrained models have lots of capabilities, which are in some sense competing for the steering wheel: you can't LARP an economist writing an excellent paper and simultaneously LARP a deceptive agent who wants paperclips but finds it instrumentally useful to write economics papers. Whichever capability scores best for your proxy will win out, and with all other possible ways the model could have completed the training task getting no say.
• By thinking of "persuasiveness" as being something which we actually wanted to get, this example also serves as an illustration of how filtering can be uncompetitive: filtering produces outputs whose persuasiveness is distributed as  whereas RL produces a model whose outputs have persuasiveness  on average; if  is large, that means that you'd have to filter roughly the order of  outputs to get the same persuasiveness as the average output of the RL-optimized model.
• I spent a while confused about how this squares with the baseline-probability-time-Boltzman-factor classification of what RL with a KL penalty will converge to. (The example above didn't have a KL penalty, but adding a small one wouldn't have much much difference.) I think the answer is that the model I described wasn't expressive enough to represent the baseline-probability-time-Boltzman-factor distribution that RL with a KL penalty would optimally converge to. This lack of expressivity seems quite related to the fact our model was a linear combination of three distributions which we modeled as not changing throughout training. That means that this story, which is based on the frame that generative models are a giant pile of capabilities which can be elicited, is in tension with the frame that neural networks are flexible function approximators; I found this pretty interesting.
• All this being said, I'm pretty skeptical that whatever sort of Goodharting is being captured in this example has much to do with the sort of Goodharting we empirically observe in RLHF, since this example doesn't work with best-of-n optimization (whereas extremal Goodharting does occur for best-of-n, as Buck pointed out elsethread).
• Overall, I don't put much stock in this example beyond helping articulate the point that RL amplifies capabilities in proportion to how causally downstream of high-reward outputs they are, whereas filtering only takes into account their correlations with high-reward outputs.

The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)

I'm not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There's no reward conditioning going on. They're also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations).

Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn't change from episode to episode -- these "tasks" are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it's not getting reward.

My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.

Two interesting differences between the approaches discussed here and in my linked post:

• In my post, I assumed that the generative model was trained on a data set which included rewards (for example, humans playing Breakout, where the reward is provided by the environment; or a setting in which rewards can be provided by a reward model trained with human feedback). In contrast, you seem to be primarily considering settings, like language models trained on the internet, in which rewards are not provided (or at least, only provided implicitly in the sense that "here's some Nobel prize winning research: [description of research]" implicitly tells the model that the given research is the type of research that good researchers produce, and thus kinda acts as a reward label). My assumption trivializes the problem of making a quantilizer (since we can just condition on reward in the top 5% of previously observed rewards). But your assumption might be more realistic, in that the generative models we try to get superhuman performance out of won't be trained on data that includes rewards, unless we intentionally produce such data sets.
• My post focuses on a particular technique for improving capabilities from baseline called online generative modeling; in this scheme, after pretraining, the generative model starts an online training phase in which episodes that it outputs are fed back into the generative model as new inputs. Over time, this will cause the distribution of previously-observed rewards to shift upwards, and with it the target quantile. Note that if the ideas you lay out here for turning a generative model into a quantilizer work, then you can stack online generative modeling on top. Why would you do this? It seems like you're worried that your techniques can safely produce the research of a pretty good biologist but not of the world's best biologist on their best day. One way around this is to just ask your generative model to produce the research of a pretty good biologist, but use the online generative modeling trick to let its expectation of what pretty good biology research looks like drift up over time. Would this be safer? I don't know, but it's at least another option.

When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer, and I'm quite happy to see that there's a lot of convergence between my private list and Paul's list here.

I thought it would be a useful exercise to diff my list with Paul's; I'll record the result in the rest of this comment without the expectation that it's useful to anyone else.

Points on both lists:

• Eliezer's "first critical try" framing downplays the importance of trial-and-error with non-critical tries.
• It's not clear that a "pivotal act" by an aligned AI is the only way to prevent unaligned AI systems from being created.
• Eliezer badly equivocates between "alignment is hard"/"approach X to alignment doesn't obviously solve it" and "alignment is impossible to solve within our time limit"/"approach X to alignment is doomed."
• Deceptive behavior may arise from AI systems before they are able to competently deceive us, giving us some chances to iterate.
• Eliezer's arguments for fast takeoffs aren't precise enough to warrant his confidence.
• Eliezer's reasoning on generalization across distributional shift seems sloppy. Paul doesn't dig into this much, but I would add that there are approaches to reasoning about the inductive biases of ML systems which, together with Anthropic-esque empirical work on how things scale with capabilities, could give us some measure of confidence that a promising-looking alignment scheme will generalize.
• Based on recent work, ML systems might be much more interpretable than Eliezer seems to think.
• Eliezer doesn't seriously engage with any of the most promising approaches to alignment (and by his own admission, probably could not pass an ITT for them).
• Debate-esque strategies for checking outputs of powerful AI systems aren't obviously doomed by Eliezer's concerns about coordination.
• Eliezer's argument that it's impossible to train a powerful agent by imitating human thought seem bad.
• Regarding the question "Why did no one but Eliezer write a List of Lethalities?" a pretty plausible answer is "because List of Lethalities was not an especially helpful document and other researchers didn't think writing it was a priority."

I won't try to list all of the things that Paul mentioned which weren't on my list, but some of the most useful (for me) were:

• Eliezer's doomy stories often feature a superintelligent AI system which is vastly smarter than it needs to be in order to kill us, which is a bit unrealistic since these stories ought to be about the first AI which is powerful enough to attempt permanently disempowering us. To patch this story, you need to either imagine a less powerful system turning dangerous or humans having already made aligned systems up to the level of capabilities of the new dangerous system, both of which feel less scary than the classic "atomized by nanobots" stories.
• AI systems will be disincentivized from hiding their capabilities, since we'll be trying to produce AI systems with powerful capabilities.
• Approaches to alignment are disjunctive, so pessimistic cases need to seriously engage with an existential quantifier over the research humans (perhaps assisted by whatever AI research assistants we can safely produce) can perform in the coming ~decades.
• Since R&D is out-of-distributions for humans, we might expect to have a comparative advantage in dealing with deception from AI systems.

Finally, a few points which were on my list and not Paul's, and which I feel like writing out:

• "Consequentialist which plans explicitly using exotic decision theory" is not a likely shape for the first superintelligent AI systems to take, but many of Eliezer's doomy arguments seem to assume AI systems of that form. Now, it's true that the AI systems we build might figure out that agents of that form are especially powerful and invest time into trying to build them. But that's a problem we can hopefully leave to our aligned superintelligent research assistants; building such aligned research assistants seems much less doomed.
• (This is a disagreement with both Paul and Eliezer.) Contra the view that capabilities will necessarily improve a lot before alignment failures start being a problem, it seems plausible to me that many commercial applications for AI might rely on solving alignment problems. You can't deploy your smart power grid if it keeps doing unsafe things.
• Eliezer's view that you're not likely to make progress on alignment unless you figured out it was a problem by yourself seems insane to me. I can't think of any other research field like this ("you can't be expected to make progress on mitigating climate change unless you independently discovered that climate change would be a problem"), and I'm not sure where Eliezer's opinion that alignment is an exception is coming from.

Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general.

For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a sense in which the error "comes to dominate" the thing we're optimizing.

One concern which does make sense to me (and I'm not sure if I'm steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they're supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.

If this is your primary concern regarding Goodhart's Law, then I agree the model above doesn't obviously capture it. I guess it's more precisely a model of proxy misspecification.

This paper gives a mathematical model of when Goodharting will occur. To summarize: if

(1) a human has some collection  of things which she values,

(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and

(3) the robot can freely vary how much of  there are in the world, subject only to resource constraints that make the  trade off against each other,

then when the robot optimizes for its proxy utility, it will minimize all 's which its proxy utility function doesn't take into account. If you impose a further condition which ensures that you can't get too much utility by only maximizing some strict subset of the 's (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human's true utility function.

That said, I wasn't super-impressed by this paper -- the above is pretty obvious and the mathematical model doesn't elucidate anything, IMO.

Moreover, I think this model doesn't interact much with the skeptical take about whether Goodhart's Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn't take into account:

(1) Lots of the things we value are correlated with each other over "realistically attainable" distributions of world states. Or in other words, for many pairs  of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of  without also increasing the amount of .

(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.

If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won't be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)

This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.