[AN #157]: Measuring misalignment in the technology underlying Copilot

by Rohin ShahAlignment Newsletter10 min read23rd Jul 20219 comments

18

AI
Frontpage

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Evaluating Large Language Models Trained on Code (Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan et al) (summarized by Rohin): You’ve probably heard of GitHub Copilot, the programming assistant tool that can provide suggestions while you are writing code. This paper evaluates Codex, a precursor to the model underlying Copilot. There’s a lot of content here; I’m only summarizing what I see as the highlights.

The core ingredient for Codex was the many, many public repositories on GitHub, which provided hundreds of millions of lines of training data. With such a large dataset, the authors were able to get good performance by training a model completely from scratch, though in practice they finetuned an existing pretrained GPT model as it converged faster while providing similar performance.

Their primary tool for evaluation is HumanEval, a collection of 164 hand-constructed Python programming problems where the model is provided with a docstring explaining what the program should do along with some unit tests, and the model must produce a correct implementation of the resulting function. Problems are not all equally difficult; an easier problem asks Codex to “increment all numbers in a list by 1” while a harder one provides a function that encodes a string of text using a transposition cipher and asks Codex to write the corresponding decryption function.

To improve performance even further, they collect a sanitized finetuning dataset of problems formatted similarly to those in HumanEval and train Codex to perform well on such problems. These models are called Codex-S. With this, we see the following results:

1. Pretrained GPT models get roughly 0%.

2. The largest 12B Codex-S model succeeds on the first try 29% of the time. (A Codex model of the same size only gets roughly 22%.)

3. There is a consistent scaling law for reduction in loss. This translates into a less consistent graph for performance on the HumanEval dataset, where once the model starts to solve at least (say) 5% of the tasks, there is a roughly linear increase in the probability of success when doubling the size of the model.

4. If instead we generate 100 samples and check whether they pass the unit tests to select the best one, then Codex-S gets 78%. If we still generate 100 samples but select the sample that has the highest mean log probability (perhaps because we don’t have an exhaustive suite of unit tests), then we get 45%.

They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.

Since Codex is trained primarily to predict the next token, it has likely learned that buggy code should be followed by more buggy code, that insecure code should be followed by more insecure code, and so on. This suggests that if the user accidentally provides examples with subtle bugs, then the model will continue to create buggy code, even though the user would want correct code. They find that exactly this effect occurs, and that the divergence between good and bad performance increases as the model size increases (presumably because larger models are better able to pick up on the correlation between previous buggy code and future buggy code).

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

TECHNICAL AI ALIGNMENT


TECHNICAL AGENDAS AND PRIORITIZATION

Measurement, Optimization, and Take-off Speed (Jacob Steinhardt) (summarized by Sudhanshu): In this blogpost, the author argues that "trying to measure pretty much anything you can think of is a good mental move that is heavily underutilized in machine learning". He motivates the value of measurement and additional metrics by (i) citing evidence from the history of science, policy-making, and engineering (e.g. x-ray crystallography contributed to rapid progress in molecular biology), (ii) describing how, conceptually, "measurement has several valuable properties" (one of which is to act as interlocking constraints that help to error-check theories), and (iii) providing anecdotes from his own research endeavours where such approaches have been productive and useful (see, e.g. Rethinking Bias-Variance Trade-off (AN #129)).

He demonstrates his proposal by applying it to the notion of optimization power -- an important idea that has not been measured or even framed in terms of metrics. Two metrics are offered: (a) the change (typically deterioration) of performance when trained with a perturbed objective function with respect to the original objective function, named Outer Optimization, and (b) the change in performance of agents during their own lifetime (but without any further parameter updates), such as the log-loss on the next sentence for a language model after it sees X number of sequences at test time, or Inner Adaptation. Inspired by these, the article includes research questions and possible challenges.

He concludes with the insight that take-off would depend on these two continuous processes, Outer Optimization and Inner Adaptation, that work on very different time-scales, with the former being, at this time, much quicker than the latter. However, drawing an analogy from evolution, where it took billions of years of optimization to generate creatures like humans that were exceptional at rapid adaptation, we might yet see a fast take-off were Inner Adaptation turns out to be an exponential process that dominates capabilities progress. He advocates for early, sensitive measurement of this quantity as it might be an early warning sign of imminent risks.

Sudhanshu's opinion: Early on, this post reminded me of Twenty Billion Questions; even though they are concretely different, these two pieces share a conceptual thread. They both consider the measurement of multiple quantities essential for solving their problems: 20BQ for encouraging AIs to be low-impact, and this post for productive framings of ill-defined concepts and as a heads-up about potential catastrophes.

Measurement is important, and this article poignantly argues why and illustrates how. It volunteers potential ideas that can be worked on today by mainstream ML researchers, and offers up a powerful toolkit to improve one's own quality of analysis. It would be great to see more examples of this technique applied to other contentious, fuzzy concepts in ML and beyond. I'll quickly note that while there seems to be minimal interest in this from academia, measurement of optimization power has been discussed earlier in several ways, e.g. Measuring Optimization Power, or the ground of optimization (AN #105).

Rohin's opinion: I broadly agree with the perspective in this post. I feel especially optimistic about the prospects of measurement for (a) checking whether our theoretical arguments hold in practice and (b) convincing others of our positions (assuming that the arguments do hold in practice).

FORECASTING

Fractional progress estimates for AI timelines and implied resource requirements (Mark Xu et al) (summarized by Rohin): One methodology for forecasting AI timelines is to ask experts how much progress they have made to human-level AI within their subfield over the last T years. You can then extrapolate linearly to see when 100% of the problem will be solved. The post linked above collects such estimates, with a typical estimate being 5% of a problem being solved in the twenty year period between 1992 and 2012. Overall these estimates imply a timeline of 372 years.

This post provides a reductio argument against this pair of methodology and estimate. The core argument is that if you linearly extrapolate, then you are effectively saying “assume that business continues as usual: then how long does it take”? But “business as usual” in the case of the last 20 years involves an increase in the amount of compute used by AI researchers by a factor of ~1000, so this effectively says that we’ll get to human-level AI after a 1000^{372/20} = 10^56 increase in the amount of available compute. (The authors do a somewhat more careful calculation that breaks apart improvements in price and growth of GDP, and get 10^53.)

This is a stupendously large amount of compute: it far dwarfs the amount of compute used by evolution, and even dwarfs the maximum amount of irreversible computing we could have done with all the energy that has ever hit the Earth over its lifetime (the bound comes from Landauer’s principle).

Given that evolution did produce intelligence (us), we should reject the argument. But what should we make of the expert estimates then? One interpretation is that “proportion of the problem solved” behaves more like an exponential, because the inputs are growing exponentially, and so the time taken to do the last 90% can be much less than 9x the time taken for the first 10%.

Rohin's opinion: This seems like a pretty clear reductio to me, though it is possible to argue that this argument doesn’t apply because compute isn’t the bottleneck, i.e. even with infinite compute we wouldn’t know how to make AGI. (That being said, I mostly do think we could build AGI if only we had enough compute; see also last week’s highlight on the scaling hypothesis (AN #156).)

MISCELLANEOUS (ALIGNMENT)

Progress on Causal Influence Diagrams (Tom Everitt et al) (summarized by Rohin): Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the wrong incentives. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are graphical models (AN #49) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:

1. We can analyze what happens when you intervene on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.

2. We can avoid reward tampering (AN #71) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its current reward function.

3. A multiagent version allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.

AI GOVERNANCE

A personal take on longtermist AI governance (Luke Muehlhauser) (summarized by Rohin): We’ve previously seen (AN #130) that Open Philanthropy struggles to find intermediate goals in AI governance that seem robustly good to pursue from a longtermist perspective. (If you aren’t familiar with longtermism, you probably want to skip to the next summary.) In this personal post, the author suggests that there are three key bottlenecks driving this:

1. There are very few longtermists in the world; those that do exist often don’t have the specific interests, skills, and experience needed for AI governance work. We could try to get others to work on relevant problems, but:

2. We don’t have the strategic clarity and forecasting ability to know which intermediate goals are important (or even net positive). Maybe we could get people to help us figure out the strategic picture? Unfortunately:

3. It's difficult to define and scope research projects that can help clarify which intermediate goals are worth pursuing when done by people who are not themselves thinking about the issues from a longtermist perspective.

Given these bottlenecks, the author offers the following career advice for those who hope to do work from a longtermist perspective in AI governance:

1. Career decisions should be especially influenced by the value of experimentation, learning, aptitude development, and career capital.

2. Prioritize future impact, for example by building credentials to influence a 1-20 year “crunch time” period. (But make sure to keep studying and thinking about how to create that future impact.)

3. Work on building the field, especially with an eye to reducing bottleneck #1. (See e.g. here.)

4. Try to reduce bottleneck #2 by doing research that increases strategic clarity, though note that many people have tried this and it doesn’t seem like the situation has improved very much.

NEWS

Open Philanthropy Technology Policy Fellowship (Luke Muehlhauser) (summarized by Rohin): Open Philanthropy is seeking applicants for a US policy fellowship program focused on high-priority emerging technologies, especially AI and biotechnology. Application deadline is September 15.

Read more: EA Forum post

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

AI2
Frontpage

18

9 comments, sorted by Highlighting new comments since Today at 10:09 PM
New Comment

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. Notably, there are IMO pretty good arguments (mostly by people affiliated with EleutherAI, I'm pushing them to post on the AF) that GPT-3 seems to work more like a simulator of language-producing processes (for lack of a better word), than as an agent trying to predict the next token.

Like what you write here:

They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem. In that sense, GPT-3 fails the "chatbot task": for a lot of the great things it's great at doing, you have to handcraft (or constrain) the prompts to make -- it won't find out precisely what you mean.

Or put it differently: people which are good at making GPT-3 do what they want have learned to not use it like a smart agent figuring out what you really mean, but more like a "prompt continuation engine". You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture, and I have the gut feeling that being agenty makes it harder to do that task (as you need a very un-goal-like goal).

(I think this points out to what you mention in that comment, about approval-directedness being significantly less goal-directed: if GPT-3 is agenty, it looks quite a lot like a sort of approval-directed agent.)

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. 

Where do you see any assumption of agency/goals?

(I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.)

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture,

Agreed, which is why I didn't say anything like that?

Sorry for ascribing you beliefs you don't have. I guess I'm just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Sure, but don't you agree that it's a very confusing use of the term? Like, if I say GPT-3 isn't trying to kill me, I'm not saying it is trying to kill anyone, but I'm sort of implying that it's the right framing to talk about it. In this case, the "motivated" part did triggered me, because it implied that the right framing is to think about what Codex wants, which I don't think is right (and apparently you agree).

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

Agreed with you there.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

True, but I don't feel like there is a significant difference between Codex and GPT-3 in terms of size or training to warrant different conclusions with regard to ascribing goals/agency.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

First, I think I interpreted "misalignment" here to mean "inner misalignment", hence my answer. I also agree that all examples in Victoria's doc are showing misalignment. That being said, I still think there is a difference with the specification gaming stuff. 

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious? Almost all specification gaming examples are subtle, or tricky, or exploiting bugs. They're things that I would expect a human to fail to find, even given the precise loss and training environment. Whereas I expect any human to complete buggy code with buggy code once you explain to them that Codex looks for the most probable next token based on all the code.

But there doesn't seem to be a real disagreement between us: I agree that GPT-3/Codex seem fundamentally unable to get really good at the "Chatbot task" I described above, which is what I gather you mean by "solving my problem".

(By the way, I have an old post about formulating this task that we want GPT-3 to solve. It was written before I actually studied GPT-3 but that holds decently well I think. I also did some experiments on GPT-3 with EleutherAI people on whether bigger models get better at answering more variations of the prompt for the same task.)

Sure, but don't you agree that it's a very confusing use of the term?

Maybe? Idk, according to me the goal of alignment is "create a model that is motivated to help us", and so misalignment = not-alignment = "the mode is not motivated to help us". Feels pretty clear to me but illusion of transparency is a thing.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now.

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯ 

I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious?

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

 Almost all specification gaming examples are subtle, or tricky, or exploiting bugs.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Sorry for the delay in answering, I was a bit busy.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now

That makes some sense, but I do find the "motivationless" state interesting from an alignment point of view. Because if it has no motivation, it also doesn't have a motivation to do all the things we don't want. We thus get some corrigibility by default, because we can change its motivation just by changing the prompt.

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯ 

I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

Agreed that there's not much difference when predicting GPT-3. But it's because we're at the place in the scaling where Gwern (AFAIK) describe the LM as an agent very good at predicting-agent. By definition it will not do anything different from a simulator, since its "goal" literally encode all of its behavior.

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

Fair enough.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Yeah, you're probably right.

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

Yeah, I agree that in the future there is a difference. I don't think we know which of these situations we're going to be in (which is maybe what you're arguing). Idk what Gwern predicts.

Exactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.

@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?

(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)

My message was really about Rohin's phrasing, since I usually don't read the papers in details if I think the summary is good enough.

Reading the section now, I'm fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out "is capable" into "there is a prompt to make it do what we want" and "chooses" into "what it actually returns for our prompt" makes it quite unambiguous.

I also like this paragraph in the appendix:

However, there is an intuitive notion that, given its training objective, Codex is better described as “trying” to continue the prompt by either matching or generalizing the training distribution, than as “trying” to be helpful to the user.

Rohin also changed my mind on my criticism of calling that misalignment; I now agree that misalignment is the right term.

 

One thought I just had: this looks more like a form of proxy alignment to what we really want, which is not ideal but significantly better than deceptive alignment. Maybe autoregressive language models point to a way of paying a cost of proxy alignment to avoid deceptive alignment?