[AN #136]: How well will GPT-N perform on downstream tasks?



Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.


Extrapolating GPT-N performance (Lukas Finnveden) (summarized by Asya): This post describes the author’s insights from extrapolating the performance of GPT on the benchmarks presented in the GPT-3 paper (AN #102). The author compares cross-entropy loss (which measures how good a model is at predicting the next token) with benchmark performance normalized to the difference between random performance and the maximum possible performance. Since previous work (AN #87) has shown that cross-entropy loss scales smoothly with model size, data, and FLOP requirements, we can then look at the overall relationship between those inputs and benchmark performance.

The author finds that most of the benchmarks scale smoothly and similarly with respect to cross-entropy loss. Three exceptions are arithmetic, scramble (shuffling letters around the right way), and ANLI (a benchmark generated adversarially against transformer-based language models), which don't improve until the very end of the cross-entropy loss range. The author fits linear and s-shaped curves to these relationships, and guesses that:

- Performance improvements are likely to slow down closer to maximum performance, making s-curves a better progress estimate than linear curves.

- Machine learning models may use very different reasoning from humans to get good performance on a given benchmark, so human-level performance on any single benchmark would likely not be impressive, but human-level performance on almost all of them with few examples might be.

- We might care about the point where we can achieve human-level performance on all tasks with a 1 token "horizon length"-- i.e., all tasks where just 1 token is enough of a training signal to understand how a change in the model affects its performance. (See this AI timelines report draft (AN #121) for more on horizon length.) Achieving this milestone is likely to be more difficult than getting to human-level performance on these benchmarks, but since scaling up GPT is just one way to do these tasks, the raw number of parameters required for this milestone could just as well be less than the number of parameters that GPT needs to beat the benchmarks.

- Human-level performance on these benchmarks would likely be enough to automate lots of particular tasks with short horizon length, such as customer service, PA and RA work, and writing routine sections of code.

The author augments his s-curves graph with references to certain data, FLOP, and parameter levels, including the number of words in common crawl, the number of FLOPs that could currently be bought for $1B, the point where reading or writing one word would cost 1 cent, and the number of parameters in a transformative model according to this AI timelines report draft (AN #121). (I recommend looking at the graph of these references to see their relationship to the benchmark trends.)

Overall, the author concludes that:

- GPT-3 is in line with smooth performance on benchmarks predicted by smaller models. It sharply increases performance on arithmetic and scramble tasks, which the author thinks is because the tasks are 'narrow' in that they are easy once you understand their one trick. The author now finds it less likely that a small amount of scaling will suddenly lead to a large jump in performance on a wide range of tasks.

- Close to optimal performance on these benchmarks seems like it's at least ~3 orders of magnitude away ($1B at current prices). The author thinks more likely than not, we'd get there after increasing the training FLOP by ~5-6 orders of magnitude ($100B -$1T at current prices, $1B - $10B given estimated hardware and software improvements over the next 5 - 10 years). The author thinks this would probably not be enough to be transformative, but thinks we should prepare for that eventuality anyway.

- The number of parameters estimated for human-equivalent performance on these benchmarks (~1e15) is close to the median number of parameters given in this AI timelines report draft (AN #121), which is generated via comparison to the computation done in the human brain.

Asya's opinion: Ask and ye shall receive! In my last summary (AN #125), I mentioned that I was uncertain about how cross-entropy loss translates to transformative progress that we care about, and here is an excellent post exploring just that question. I'm sure I'll end up referencing this many times in the future.

The post discusses both what benchmarks might suggest for forecasting "human equivalence" and how benchmarks might relate to economic value via concrete task automation. I agree with the tasks the author suggests for the latter, and continuing my "opinions as calls for more work" trend, I'd be interested in seeing even more work on this-- i.e. attempts to decompose tasks into a set of concrete benchmark performances which would be sufficient for economically valuable automation. This comment thread discusses whether current benchmarks are likely to capture a substantial portion of what is necessary for economic value, given that many jobs end up requiring a diverse portfolio of skills and reasoning ability. It seems plausible to me that AI-powered automation will be "discontinuous" in that a lot of it will be unlocked only when we have a system that's fairly general.

It seems quite noteworthy that the parameter estimates here and in the AI timelines report draft are close together, even though one is anchored to human-level benchmark performance, and the other is anchored to brain computation. That updates me in the direction of those numbers being in the right range for human-like abilities.

People interested in this post maybe also be interested in BIG-bench, a project to crowdsource the mother of all benchmarks for language models.



Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning (Valerie Chen et al) (summarized by Rohin): It is particularly challenging for RL agents to perform hierarchical tasks when there is only a sparse reward. One natural piece of feedback in this setting is instructions in natural language specifying the different subtasks needed to solve the task. In particular, this paper assumes we have access to a dataset of human demonstrations paired with natural language instructions for each subtask that they complete.

We then have an architecture that first generates the language instruction for the current subtask given the final task and the current state, and then takes a low-level action computed from the current state and the language instruction. This is trained via imitation learning on the human demonstrations.

Using a small Minecraft-inspired gridworld, the authors show that the language generation is crucial for good generalization: if the agent is trained on “cobblestone block” and “iron ingot”, then it is able to generalize to cobblestone ingot, as long as it was trained to generate the language instruction as well. Intuitively, the combinatorial structure of language leads to better generalization than direct imitation on low-level actions.

A Narration-based Reward Shaping Approach using Grounded Natural Language Commands (Nicholas Waytowich et al) (summarized by Rohin): One way to specify what an AI system should do is to simply specify it in natural language. If we have some way to map natural language instructions to states, then we could turn natural language into a reward function and use RL to optimize it.

This paper proposes specifying a task by breaking it down into a sequence of steps to be completed. Given a mapping from natural language to states, they define a reward function that gives a positive reward every time the mapping detects that the agent has completed the next stage in the sequence of steps. They show that this outperforms vanilla reinforcement learning on a win/loss reward function in a StarCraft minigame.

For the mapping of language to states, the authors use a mutual embedding model (MEM) they developed in previous work. The core idea is to write down programs that identify states matching a particular natural language instruction, use this to generate a dataset of states and the corresponding natural language instruction, and then training a model to map the natural language instructions to be “close to” the mappings of the states (which are produced by a CNN).

Rohin's opinion: My understanding is the MEM only handles the six natural language instructions used in the StarCraft minigame, and so is roughly equivalent to training six classifiers using the hardcoded programs to generate datasets. Thus, these two papers ultimately boil down to “decompose the task into six steps, train classifiers for these six steps, and then do RL where the reward function gives positive reward every time a particular step is marked as complete”.

However, this is primarily because the authors had to ground the natural language instructions themselves. If we could instead leverage a pretrained model which already grounds natural language, such as CLIP, then it seems like this approach could in fact save a lot of human effort in specifying what the AI system should do.

Learning Rewards from Linguistic Feedback (Theodore R. Sumers et al) (summarized by Rohin): This paper proposes another approach to reinforcement learning using natural language. After the agent plays an episode, we can ask a human for feedback in natural language. We then take their response, figure out what features of the environment the response mentions, and then use sentiment analysis to determine how to update the weights on the features. For sentiment analysis we can use an off-the-shelf classifier; the hard part is in determining the relevant environment feature vectors:

1. Evaluative feedback is feedback about the trajectory the agent produced, for example “good job”, so we can just use the features of this trajectory.

2. Imperative feedback specifies what the agent should have done, e.g. “you should have gone to the top right corner”. In this case, we must find the features consistent with the given instruction.

3. Descriptive feedback provides feedback directly about the reward, for example “yellow objects are bad”. In this case, we use a feature vector that has a 1 for every feature mentioned (in this case, the feature for yellow objects) and 0 everywhere else.

Types 2 and 3 require some domain knowledge in order to write down programs that map language to the relevant features. The environment the authors used was simple enough that they were able to do this.

Once we have the feature vector f and the sentiment s, we perform a Bayesian update on our weight distribution. This is similar to the way we perform Bayesian updates on the reward distribution upon seeing a human action as evidence, as in Bayesian IRL (AN #132) or reward-rational implicit choice (AN #89).

This model so far performs reasonably well. By adding a couple of heuristics inspired by pragmatics (e.g. assuming that features that aren’t mentioned aren’t decision-relevant), they reach approximately human-level performance.


Avoiding Side Effects in Complex Environments (Alex Turner et al) (summarized by Zach): One proposal for impact regularization is attainable utility preservation (AUP) (AN #91), in which we view side effects as changes in the ability of an agent to optimize a variety of reward functions. By incentivizing the agent not to change the optimal value for a wide range of auxiliary reward functions, the agent may avoid decreasing the optimal value for the true reward.

To test the claim that AUP is a suitable way to avoid side-effects the authors experiment in SafeLife (AN #91), an environment suite based on Conway's "Game of Life". In the Game of Life, depending on how many live neighbors surround a cell, the cell either comes to life, dies, or retains its state. In SafeLife the eight cells surrounding the agent cells are frozen and can be modified by the agent. Thus, the agent can disturb, or modify, dynamic patterns by merely approaching them.

To measure side-effects the authors compare the evolution as it would've evolved without agent interference vs. the evolution with the agent present. The tasks are simple: either add or remove cells from a specified location. However, there are obstacles in the way that the agent could disturb. To implement AUP, the authors use a single randomly sampled reward function based on downsampling from the observation space. As a baseline, the authors compare AUP against PPO.

Generally, AUP is able to achieve fewer side-effects than PPO while still obtaining reasonable performance. However, AUP does take longer to train than PPO. Additionally, the side-effects incurred during the training of AUP increase to a peak before settling below the side-effect score of PPO. It's also important to note that sampling multiple rewards for AUP has the counter-intuitive effect of increasing the side-effect score.

Zach's opinion: This paper presents a clear approach to handling side-effects and provides a fairly thorough analysis via experimentation. Having said that, I find the experimental findings to be mixed. Intuitively, adding more random rewards would decrease task performance and the number of side-effects. However, this isn't shown out in the data which raises interesting questions about how to best sample random reward functions. Related to this, the phenomena of side-effects increasing at the start of training for AUP is worth further investigation.


Adversarial examples for the OpenAI CLIP in its zero-shot classification regime and their semantic generalization (Stanislav Fort) (summarized by Rohin): CLIP is a model that was trained on a vast soup of image-caption data, and as a result can perform zero-shot image classification (for example, it gets 87% accuracy on CIFAR-10 out of the box). Does it also have adversarial examples within the image classification regime? This post shows that the answer is yes, and in fact these adversarial examples are easy to find.

More interestingly though, these adversarial examples persist if you change the labels in a semantically meaningful way. For example, if you take an image X that is correctly classified as a cat and imperceptibly modify it to Y which is now classified as a dog, if you change the class names to “kitty” and “hound”, then the same X will now be classified as a kitty while the same Y will be classified as a hound. This even works (though not as well) for labels like “domesticated animal which barks and is best friend”. The author takes this as evidence that the adversarial image actually looks like the adversarial class to the neural net, rather than being a peculiar consequence of the specific label.

Rohin's opinion: This seems like further validation of the broad view put forth in Adversarial Examples Are Not Bugs, They Are Features (AN #62).



Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design (Michael Dennis, Natasha Jaques et al) (summarized by Rohin): One argument for AI risk is that we have to specify some aspects of the training procedure, and if these are poorly specified, then bad outcomes may result. Typically we think of bad specification of the reward function as the risk, but this can also apply to environments: if we train a system in a simulated environment, then it may fail if the simulation is insufficiently similar to the real environment.

A typical approach would be domain randomization: we randomly vary some parameters that control the behavior of the environment. Unfortunately, this can often create environments that are too easy: in a maze environment, this approach often doesn’t have enough walls. Another approach could be to choose the environment adversarially, so that the agent learns the skills needed for hard environments. Unfortunately, this can often make the environment unsolvable: in the maze environment, the goal may be unreachable from the initial position.

The key idea of this paper is a method to create environments that are just on the edge of the agent’s abilities, by finding an environment that maximizes the agent’s regret: how poorly the agent performs, relative to how well it could have done. To operationalize how well the agent “could have done”, we also train an antagonist agent, and we then choose an environment that the antagonist performs well on but the protagonist performs poorly on. This results in environments that are solvable but challenging for the protagonist.


AI Safety Career Bottlenecks Survey (AI Safety Support) (summarized by Rohin): AI Safety Support have released a career bottlenecks survey that they will use to guide their work. You can take the survey here.

AISU 2021 (summarized by Rohin): The third AI safety unconference will take place online from April 23rd to April 28th, 2021. The registration deadline is April 13th.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.


New Comment