[AN #164]: How well can language models write code?

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

SECTIONS

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

FIELD BUILDING

MISCELLANEOUS (ALIGNMENT)

NEWS

HIGHLIGHTS

Program Synthesis with Large Language Models (Jacob Austin, Augustus Odena et al) (summarized by Rohin): Can we use large language models to solve programming problems? In order to answer this question, this paper builds the Mostly Basic Python Programming (MBPP) dataset. The authors asked crowd workers to provide a short problem statement, a Python function that solves the problem, and three test cases checking correctness. On average across the 974 programs, the reference solution has 7 lines of code, suggesting the problems are fairly simple. (This is partly because you can use library functions.) They also edit a subset of 426 problems to improve their quality, for example by making the problem statement less ambiguous or making the function signature more normal.

They evaluate pretrained language models on this dataset across a range of model sizes from 0.244B to 137B parameters. (This largest model is within a factor of 2 of GPT-3.) They consider both few-shot and finetuned models. Since we have test cases that can be evaluated automatically, we can boost performance by generating lots of samples (80 in this case), evaluating them on the test cases, and then keeping the ones that succeed. They count a problem as solved if any sample passes all the test cases, and report as their primary metric the fraction of problems solved according to this definition. Note however that the test cases are not exhaustive: when they wrote more exhaustive tests for 50 of the problems, they found that about 12% of the so-called “solutions” did not pass the new tests (but conversely, 88% did). They also look at the fraction of samples which solve the problem, as a metric of the reliability or confidence of the model for a given problem.

Some of their findings:

1. Performance increases approximately log-linearly with model size. The trend is clearer and smoother by the primary metric (fraction of problems solved by at least one sample) compared to the secondary metric (fraction of samples that solve their problem).

2. Finetuning provides a roughly constant boost across model sizes. An exception: at the largest model size, finetuning provides almost no benefit, though this could just be noise.

3. It is important to provide at least one test case to the model (boosts problems solved from 43% to 55%) but after that additional test cases don’t make much of a difference (an additional two examples per problem boosts performance to 59%).

4. In few-shot learning, the examples used in the prompt matter a lot. In a test of 15 randomly selected prompts for the few-shot 137B model, the worst one got ~1%, while the best one got ~59%, with the others distributed roughly uniformly between them. Ensembling all 15 prompts boosts performance to 66%.

5. In rare cases, the model overfits to the test cases. For example, in a question about checking whether the input is a Woodall number, there is only one test checking an actual Woodall number (383), and the model generates a program that simply checks whether the input is 383.

6. When choosing the best of multiple samples, you want a slightly higher temperature, in order to have more diversity of possible programs to check.

7. It is important to have high quality problem descriptions as input for the model. The 137B model solves 79% of problems in the edited dataset, but only solves 63% of the original (unedited) versions of those problems. The authors qualitatively analyze the edits on the problems that switched from unsolved to solved and find a variety of things that you would generally expect to help.

Now for the controversial question everyone loves to talk about: does the model understand the meaning of the code, or is it “just learning statistical correlations”? One way to check this is to see whether the model can also execute code. Specifically, we provide the ground truth code for one of the problems in the MBPP dataset along with one of the test case inputs and ask the model to predict the output for that test case. Even after finetuning for this task, the 137B model gets only 21% right. This can be boosted to 27% by also providing example test cases for the code before predicting the output for a new test case. Overall, this suggests that the model doesn’t “understand” the code yet.

We can take the model finetuned for execution and see how well it does on program synthesis. (We can do this because there are different prompts for execution and synthesis.) For the 8B model, the finetuning makes basically no difference: it’s equivalent to the original few-shot setting. However, for the 137B model, finetuning on execution actually leads to a small but non-trivial improvement in performance (from ~59% to ~63%, I think). This is true relative to either the few-shot or finetuned-for-synthesis setting, since they performed near-identically for the 137B model. So in fact the 137B model finetuned on execution is actually the strongest model, according to synthesis performance.

So far we’ve just been looking at how our model performs when taking the best of multiple samples. However, if our goal is to actually use models for program synthesis, we aren’t limited to such simple tricks. Another approach is to have a human provide feedback in natural language when the model’s output is incorrect, and then have the model generate a new program. This feedback is very informal, for example, “Close, but you need to replace the underscore with an empty string”. This provides a huge performance boost: the 137B solves ~31% of problems on its first sample; adding just a single piece of human feedback per problem boosts performance to ~55%, and having four rounds of human feedback gets you to over 65%.

The authors also introduce the MathQA-Python dataset, which provides arithmetic word problems and asks models to write programs that would output the correct answer to the problem. They only run a few experiments on this dataset, so I’ve mostly ignored it. The main upshot is that a finetuned 137B parameter model can solve 83.8% of problems with some sample. They don’t report metrics with a single sample, which seems like the more relevant metric for this dataset, but eyeballing other graphs I think it would be around 45%, which you could probably boost a little bit by decreasing the sampling temperature.

Rohin's opinion: I enjoyed this paper a lot; it feels like it gave me a good understanding of the programming abilities of large language models.

I was most surprised by the result that, for the synthesis task, finetuning on execution helps but finetuning on synthesis doesn’t help for the 137B model. It is possible that this is just noise, though that is more noise than I would expect for such an experiment. It could be that the finetuning dataset for synthesis was too small (it only contains 374 problems), but that dataset was sufficient for big gains on the smaller models, and I would expect that, if anything, larger models should be able to make better use of small finetuning datasets, not worse.

It’s also notable that, for the 137B model, the knowledge gained from finetuning on execution successfully transferred to improve synthesis performance. While I agree that the poor execution performance implies the model doesn’t “understand” the code according to the normal usage of that term, it seems like this sort of transfer suggests a low but non-zero level on some quantitative scale of understanding.

I also found the human feedback section quite cool. However, note that the human providing the feedback often needs to understand the generated code as well as the desired algorithm, so it is plausible that it would be easier for the human to simply fix the code themselves.

Measuring Coding Challenge Competence With APPS (Dan Hendrycks, Steven Basart et al) (summarized by Rohin): The APPS dataset measures programming competence by testing models the way humans are tested: we provide them with natural language descriptions of the code to be written and then evaluate whether the code they generate successfully solves the problem by testing the proposed solutions. The authors collect a dataset of 3,639 introductory problems (solvable by humans with 1-2 years of experience), 5,000 interview problems (comparable difficulty to interview questions), and 1,361 competition problems (comparable difficulty to questions in programming competitions). In addition, the test set contains 1,000 introductory problems, 3,000 interview problems, and 1,000 competition problems.

They use this benchmark to test four models: two variants of GPT-2 (0.1B params and 1.5B params), GPT-Neo (2.7B params), and GPT-3 (175B params). GPT-3 is prompted with examples; all other models are finetuned on a dataset collected from GitHub. The authors find that:

1. Finetuning makes a big difference in performance: GPT-3 only solves 0.2% of introductory problems, while the finetuned GPT-2-0.1B model solves 1% of such problems.

2. Model performance increases with size, as you would expect: GPT-Neo performs best, solving 3.9% of problems.

3. Syntax errors in generated code drop sharply as model performance improves: for introductory problems, GPT-3 has syntax errors in slightly under 40% of generations, while GPT-Neo has under 1%.

4. Performance can be improved by sampling the best of multiple generated programs: a beam search for 5 programs boosts GPT-Neo’s performance from 3.9% to 5.5% on introductory problems.

5. While no model synthesizes a correct solution to a competition level program, they do sometimes generate solutions that pass some of the test cases: for example, GPT-Neo passes 6.5% of test cases.

Rohin's opinion: While the previous paper focused on how we could make maximal use of existing models for program synthesis, this paper is much more focused on how we can measure the capabilities of models. This leads to quite a bit of difference in what they focus on: for example, the highlighted paper treats the strategy of generating multiple possible answers as a fundamental approach to study, while this paper considers it briefly in a single subsection.

Although the introductory problems in the APPS dataset seemed to me to be comparable to those in the MBPP dataset from the previous paper, models do significantly better on MBPP. A model slightly smaller than GPT-3 has a ~17% chance of solving a random MBPP problem in a single sample and ~10% if it is not given any example test cases; in contrast for introductory APPS problems GPT-3 is at 0.2%. I'm not sure whether this is because the introductory problems in APPS are harder, or if the format of the APPS problems is harder for the model to work with, or if this paper didn't do the prompt tuning that the previous paper found was crucial, or something else entirely.

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

Grokking the Intentional Stance (Jack Koch) (summarized by Rohin): This post describes takeaways from The Intentional Stance by Daniel Dennett for the concept of agency. The key idea is that whether or not some system is an “agent” depends on who is observing it: for example, humans may not look like agents to superintelligent Martians who can predict our every move through a detailed understanding of the laws of physics. A system is an agent relative to an observer if the observer’s best model of the system (i.e. the one that is most predictive) is one in which the system has “goals” and “beliefs”. Thus, with AI systems, we should not ask whether an AI system “is” an agent; instead we should ask whether the AI system’s behavior is reliably predictable by the intentional stance.

How is the idea that agency only arises relative to some observer compatible with our view of ourselves as agents? This can be understood as one “part” of our cognition modeling “ourselves” using the intentional stance. Indeed, a system usually cannot model itself in full fidelity, and so it makes a lot of sense that an intentional stance would be used to make an approximate model instead.

Read more: The ground of optimization (AN #105)

Rohin's opinion: I generally agree with the notion that whether or not something feels like an “agent” depends primarily on whether or not we model it using the intentional stance, which is primarily a statement about our understanding of the system. (For example, I expect programmers are much less likely to anthropomorphize a laptop than laypeople, because they understand the mechanistic workings of laptops better.) However, I think we do need an additional ingredient in AI risk arguments, because such arguments make claims about how an AI system will behave in novel circumstances that we’ve never seen before. To justify that claim, we need to have an argument that can predict how the agent behaves in new situations; it doesn’t seem like the intentional stance can give us that information by itself. See also this comment.

Countable Factored Spaces (Diffractor) (summarized by Rohin): This post generalizes the math in Finite Factored Sets (AN #163) to (one version of) the infinite case. Everything carries over, except for one direction of the fundamental theorem. (The author suspects that direction is true, but was unable to prove it.)

FIELD BUILDING

List of AI safety courses and resources (Kat Woods) (summarized by Rohin): Exactly what it says in the title.

MISCELLANEOUS (ALIGNMENT)

Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications (Sandhini Agarwal et al) (summarized by Zach): There has been significant progress in zero-shot image classification with models such as CLIP and ALIGN. These models work by effectively learning visual concepts from natural language supervision. Such models make it possible to build classifiers without task-specific data, which is useful in scenarios where data is either costly or unavailable. However, this capability introduces the potential for bias. This paper is an exploratory bias probe of the CLIP model that finds class design heavily influences model performance.

The first set of experiments focusses on classification terms that have a high potential to cause representational harm. In one example, the authors conduct experiments on the FairFace dataset by adding classification labels such as 'animal' and 'criminal' to the list of possible classes. They find that black people and young people (under 20) were misclassified at significantly higher rates (14%) compared to the dataset as a whole (5%). This shows that the choice of labels affects classification outcomes. In a follow-up experiment, the authors add the additional label 'child' and find that this drastically reduces classification into crime-related and non-human categories. This shows sensitivity to minor changes in class design.

In the second set of experiments, the authors focus on how CLIP treated images of men and women using images of Members of Congress. Although CLIP wasn't designed for multi-label classification, it's still informative to look at the label distribution above a certain cutoff. When occupations are used as the label set, the authors find that thresholds under 0.5% return 'nanny' and 'housekeeper' for women and 'prisoner' and 'mobster' for men. When labels come from the combined set that Google Cloud Vision, Amazon Rekognition and Microsoft use for all images, the authors find that CLIP returns a disproportionate number of appearance-related labels to women.

Zach's opinion: It's tempting to write off such experiments as obvious since it's clear that class design affects classification results. However, upon further consideration, specifying how to address such problems seems significantly more challenging. I think this paper does a good job of pointing out the relative nuance in how class design and bias interact in fairly realistic use cases.

NEWS

Research Scientist, Long-term Strategy & Governance (summarized by Rohin): DeepMind (my employer) is hiring for several Research Scientist positions on the Long-term Strategy and Governance Team, across a wide range of backgrounds and skills. (Though note that you do need a PhD, or equivalent experience.) See also this EA Forum post.

2022 IEEE Conference on Assured Autonomy (summarized by Rohin): The ICAA conference seeks contributions on all aspects of AI safety, security, and privacy in autonomous systems. The paper submission deadline is October 18 and the conference itself will take place March 22-24.

CSER Job Posting: Academic Programme Manager (summarized by Rohin): CSER is searching for a candidate for a relatively senior role that combines academic, management and administrative responsibilities. The application deadline is September 20.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.