[AN #152]: How we’ve overestimated few-shot learning capabilities

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

True Few-Shot Learning with Language Models (Ethan Perez et al) (summarized by Rohin): We can get GPT-3 (AN #102) to perform useful tasks using “prompt programming”, in which we design an input sentence such that the most likely continuation of that sentence would involve GPT-3 performing the task of interest. For example, to have GPT-3 answer questions well, we might say something like “The following is a transcript of a dialogue with a helpful, superintelligent, question-answering system:”, followed by a few example question-answer pairs, after which we ask our questions.

Since the prompts only contain a few examples, this would seem to be an example of strong few-shot learning, in which an AI system can learn how to do a task after seeing a small number of examples of that task. This paper contends that while GPT-3 is capable of such few-shot learning, the results reported in various papers exaggerate this ability. Specifically, while it is true that the prompt only contains a few examples, researchers often tune their choice of prompt by looking at how well it performs on a relatively large validation set -- which of course contains many examples of performing the task, something we wouldn’t expect to have in a true few-shot learning context.

To illustrate the point, the authors conduct several experiments where we start with around 12 possible prompts and must choose which to use based only on the examples given (typically 5). They test two methods for doing so:

1. Cross-validation: Given a prompt without examples, we attach 4 of the examples to the prompt and evaluate it on the last example, and average this over all possible ways of splitting up the examples.

2. Minimum description length: While cross-validation evaluates the final generalization loss on the last example after updating on previous examples, MDL samples an ordering of the examples and then evaluates the average generalization loss as you feed the examples in one-by-one (so more like an online learning setup).

On the LAMA-UHN task, the difference between a random prompt and the best prompt looks to be roughly 5-6 percentage points, regardless of model size. Using MDL or cross-validation usually gives 20-40% of the gain, so 1-2 percentage points. This suggests that on LAMA-UHN, typical prompt-based “few-shot” learning results are likely 3-5 percentage points higher than what you would expect if you were in a true few-shot setting where there is no validation set to tune on.

But it may actually be worse than that. We’ve talked just about the prompt so far, but the validation set can also be used to improve hyperparameters, network architecture, the design of the learning algorithm etc. This could also lead to inflated results. The authors conduct one experiment with ADAPET on SuperGLUE which suggests that using the validation set to select hyperparameters can also lead to multiple percentage points of inflation.

Rohin's opinion: The phenomenon in this paper is pretty broadly applicable to any setting in which a real-world problem is studied in a toy domain where there is extra information available. For example, one of my projects at Berkeley involves using imitation learning on tasks where there really isn’t any reward function available, and it’s quite informative to see just how much it slows you down when you can’t just look at how much reward your final learned policy gets; research becomes much more challenging to do. This suggests that performance on existing imitation learning benchmarks is probably overstating how good we are at imitation learning, because the best models in these situations were probably validated based on the final reward obtained by the policy, which we wouldn’t normally have access to.

TECHNICAL AI ALIGNMENT

TECHNICAL AGENDAS AND PRIORITIZATION

High Impact Careers in Formal Verification: Artificial Intelligence (Quinn Dougherty) (summarized by Rohin): This post considers the applicability of formal verification techniques to AI alignment. Now in order to “verify” a property, you need a specification of that property against which to verify. The author considers three possibilities:

1. Formally specifiable safety: we can write down a specification for safe AI, and we’ll be able to find a computational description or implementation

2. Informally specifiable safety: we can write down a specification for safe AI mathematically or philosophically, but we will not be able to produce a computational version

3. Nonspecifiable safety: we will never write down a specification for safe AI.

Formal verification techniques are applicable only to the first case. Unfortunately, it seems that no one expects the first case to hold in practice: even CHAI, with its mission of building provably beneficial AI systems, is talking about proofs in the informal specification case (which still includes math), on the basis of comments like these in Human Compatible. In addition, it currently seems particularly hard for experts in formal verification to impact actual practice, and there doesn’t seem to be much reason to expect that to change. As a result, the author is relatively pessimistic about formal verification as a route to reducing existential risk from failures of AI alignment.

LEARNING HUMAN INTENT

Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation (Sam Devlin, Raluca Georgescu, Ida Momennejad, Jaroslaw Rzepecki, Evelyn Zuniga et al) (summarized by Rohin): Since rewards are hard to specify, we are likely going to have to train AI agents using human feedback. However, human feedback is particularly expensive to collect, so we would like to at least partially automate this using reward models. This paper looks at one way of building such a reward model: training a classifier to distinguish between human behavior and agent behavior (i.e. to be the judge of a Turing Test). This is similar to the implicit or explicit reward model used in adversarial imitation learning algorithms such as GAIL (AN #17) or AIRL (AN #17).

Should we expect these classifiers to generalize, predicting human judgments of how human-like a trajectory is on all possible trajectories? This paper conducts a user study in order to answer the question: specifically, they have humans judge several of these Turing Tests, and see whether the classifiers agree with the human judgments. They find that while the classifiers do agree with human judgments when comparing a human to an agent (i.e. the setting on which the classifiers were trained), they do not agree with human judgments when comparing two different kinds of artificial agents. In fact, it seems like they are anti-correlated with human judgments, rather than simply having no correlation at all -- only one of the six classifiers tested does better than chance (at 52.5%), the median is 45%, and the worst classifier gets 22.5%. (Note however that the sample size is small, I believe n = 40 though I’m not sure.)

Rohin's opinion: Ultimately my guess is that if you want to predict human judgments well, you need to train against human judgments, rather than the proxy task of distinguishing between human and agent behavior. That being said, I do think these proxy tasks can serve as valuable pretraining objectives, or as auxiliary objectives that help to improve sample efficiency.

FORECASTING

AXRP Episode 7.5 - Forecasting Transformative AI from Biological Anchors (Daniel Filan and Ajeya Cotra) (summarized by Rohin): This podcast goes over the biological anchors framework (AN #121), as well as three other (AN #105) approaches (AN #145) to forecasting AI timelines and a post on aligning narrowly superhuman models (AN #141). I recommend reading my summaries of those works individually to find out what they are. This podcast can help contextualize all of the work, adding in details that you wouldn’t naturally see if you just read the reports or my summaries of them.

For example, I learned that there is a distinction between noise and effective horizon length. To the extent that your gradients are noisy, you can simply fix the problem by increasing your batch size (which can be done in parallel). However, the effective horizon length is measuring how many sequential steps you have to take before you get feedback on how well you’re doing. The two are separated in the bio anchors work because the author wanted to impose specific beliefs on the effective horizon length, but was happy to continue extrapolating from current examples for noise.

FIELD BUILDING

AI Safety Career Bottlenecks Survey Responses Responses (Linda Linsefors) (summarized by Rohin): A past survey asked for respondents’ wish list of things that would be helpful and/or make them more efficient (with respect to careers in AI safety). This post provides advice for some of these wishes. If you’re trying to break into AI safety work, this seems like a good source to get ideas on what to try or resources that you hadn’t previously seen.

MISCELLANEOUS (ALIGNMENT)

"Existential risk from AI" survey results (Rob Bensinger) (summarized by Rohin): This post reports on the results of a survey sent to about 117 people working on long-term AI risk (of which 44 responded), asking about the magnitude of the risk from AI systems. I’d recommend reading the exact questions asked, since the results could be quite sensitive to the exact wording, and as an added bonus you can see the visualization of the responses. In addition, respondents expressed a lot of uncertainty in their qualitative comments. And of course, there are all sorts of selection effects that make the results hard to interpret.

Keeping those caveats in mind, the headline numbers are that respondents assigned a median probability of 20% to x-risk caused due to a lack of enough technical research, and 30% to x-risk caused due to a failure of AI systems to do what the people deploying them intended, with huge variation (for example, there are data points at both ~1% and ~99%).

Rohin's opinion: I know I already harped on this in the summary, but these numbers are ridiculously non-robust and involve tons of selection biases. You probably shouldn’t conclude much from them about how much risk from AI there really is. Don’t be the person who links to this survey with the quote “experts predict 30% chance of doom from AI”.

Survey on AI existential risk scenarios (Sam Clarke et al) (summarized by Rohin): While the previous survey asked respondents about the overall probability of existential catastrophe, this survey seeks to find which particular risk scenarios respondents find more likely. The survey was sent to 135 researchers, of which 75 responded. The survey presented five scenarios along with an “other”, and asked people to allocate probabilities across them (effectively, conditioning on an AI-caused existential catastrophe, and then asking which scenario happened).

The headline result is that all of the scenarios were roughly equally likely, even though individual researchers were opinionated (i.e. they didn’t just give uniform probabilities over all scenarios). Thus, there is quite a lot of disagreement over which risk scenarios are most likely (which is yet another reason not to take the results of the previous survey too seriously).

AI GOVERNANCE

Some AI Governance Research Ideas (Alexis Carlier et al) (summarized by Rohin): Exactly what it says.

OTHER PROGRESS IN AI

DEEP LEARNING

The Power of Scale for Parameter-Efficient Prompt Tuning (Brian Lester et al) (summarized by Rohin): The highlighted paper showed that prompt programming as currently practiced depends on having a dataset on which prompts can be tested. If we have to use a large dataset anyway, then could we do better by using ML techniques like gradient descent to choose the prompt? Now, since prompts are discrete English sentences, you can’t calculate gradients for them, but we know how to deal with this -- the first step of a language model is to embed English words (or syllables, or bytes) into a real-valued vector, after which everything is continuous. So instead of using gradient descent to optimize the English words in the prompt, we instead optimize the embeddings directly. Another way of thinking about this is that we have our “prompt” be a sentence of (say) 50 completely new words, and then we optimize the “meaning” of those words such that the resulting sequence of 50 newly defined words becomes a good prompt for the task of interest.

The authors show that this approach significantly outperforms the method of designing prompts by hand. While it does not do as well as finetuning the full model on the task of interest, the gap between the two decreases as the size of the model increases. At ~10 billion parameters, the maximum size tested, prompt tuning and model tuning are approximately equivalent.

In addition, using a prompt is as simple as prepending the new prompt embedding to your input and running it through your model. This makes it particularly easy to do ensembling: if you have N prompts in your ensemble, then given a new input, you create a batch of size N where the ith element consists of the ith prompt followed by the input, and run that batch through your model to get your answer. (In contrast, if you had an ensemble of finetuned models, you would have to run N different large language models for each input, which can be significantly more challenging.)

NEWS

AI Safety Research Project Ideas (Owain Evans et al) (summarized by Rohin): In addition to a list of research project ideas, this post also contains an offer of mentorship and/or funding. The deadline to apply is June 20.

Research Fellow- AI TEV&V (summarized by Rohin): CSET is currently seeking a Research Fellow to focus on the safety and risk of deployed AI systems.

Deputy Director (CSER) (summarized by Rohin): The Centre for the Study of Existential Risk (CSER) is looking to hire a Deputy Director.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[-]Zac Hatfield-Dodds3y00

High Impact Careers in Formal Verification: Artificial Intelligence

My research focuses on advanced testing and fuzzing tools, which are so much easier to use that people actually use them - eg in Pytorch, and I understand in Deepmind. If people seem interested I could write up a post on relevance to AI safety in a few weeks.

Core idea: even without proofs, writing out safety properties or other system invariants in code is valuable both (a) for deconfusion, and (b) because we can have a computer search for counterexamples using a variety of heuristics and feedbacks. At the current margin this tends to improve team productivity and shift ML culture towards valuing specifications, which may be a good thing for AI x-risk.

[-]Rohin Shah3y20

My general take is that anything relying on safety properties / system invariants being written out in formal languages / interpretable code seems not that helpful, because we just don't get such clean safety properties / system invariants.

I am pretty enthusiastic about relying on learned models of safety properties / system invariants, that are assumed to be good but not perfect specifications; and fuzzing / testing with respect to those learned models seems great.