[AN #99]: Doubling times for the efficiency of AI algorithms

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

HIGHLIGHTS

AI and Efficiency (Danny Hernandez et al) (summarized by Flo): Given the exponential increase (AN #7) in compute used for state-of-the-art results in ML, one might come to think that there has been little algorithmic progress. This paper presents strong evidence against that hypothesis. We can roughly measure algorithmic progress by tracking the compute needed to achieve a concrete performance benchmark over time. Doing so yields doubling times in efficiency (time until only half of the initial compute was needed for the same performance) of around 16 months for ImageNet, which is faster than Moore's law. Other tasks like translation as well as playing Go and Dota 2 exhibit even faster doubling times over short periods. As making a task feasible for the first time arguably presents more algorithmic progress than improving the efficiency of solving an already feasible task, actual progress might be even faster than these numbers suggest. However, the amount of data points is quite limited and it is unclear if these trends will persist and whether they will generalize to other domains. Still, the authors conjecture that similar trends could be observed for tasks that received large amounts of investment and have seen substantial gains in performance.

Combining these results with the increased available compute over time, the authors estimate that the effective training compute available to the largest AI experiments has increased by a factor of 7.5 million (!) in 2018 relative to 2012.

A focus on efficiency instead of top performance allows actors with limited amounts of compute to contribute. Furthermore, models that reach a particular benchmark quickly seem like strong candidates for scaling up. This way, more efficient algorithms might act as a catalyst for further progress. There is a public git repository to keep better track of algorithmic efficiency.

Flo's opinion: Even though access to compute has surely helped with increased efficiency in ways that I would not really label as algorithmic progress (for example by enabling researchers to try more different hyperparameters), the aggregated numbers seem surprisingly high. This suggests that I either had not correctly internalized what problems AI is able to solve these days, or underestimated the difficulty of solving these problems. It would be quite interesting to see whether there are similar improvements in the sample efficiency of deep reinforcement learning, as I expect this to be a major bottleneck for the application of agentic AIs in the absence of accurate simulators for real-world decision making.

TECHNICAL AI ALIGNMENT

ROBUSTNESS

Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment (Di Jin, Zhijing Jin et al) (summarized by Asya): This paper presents TextFooler, an algorithm for generating adversarial text for natural language tasks with only black-box access to models. TextFooler tries to generate sentences that are grammatical and semantically similar to original input sentences but produce incorrect labels. It does this by identifying a small set of most important words in the original sentence, generating candidate synonyms for those words, and gradually replacing the important words in the sentence by testing which synonyms cause the model to mispredict or report the least confidence score.

TextFooler is tested on three state-of-the-art NLP models-- WordCNN, WordLSTM, and BERT, all trained to ~80 - 90% test accuracy. On a variety of text classification datasets, TextFooler reduces accuracy to below ~15% with less than ~20% of the words perturbed. Humans evaluating the generated sentences say they are approximately as grammatical as the original, have the same label as the original in ~90% of cases, and have a sentence similarity score to the original sentence of 0.9 on a 0 to 1 scale. The paper finds that generally, models with higher original accuracy have higher after-attack acuracy.

The authors retrain BERT from scratch using data produced by TextFooler and then attack it using TextFooler again. They find that the after-attack accuracy is higher and that attacks require more perturbed words.

Asya's opinion: I was surprised that the accuracies the paper presented after adversarially training on TextFooler-produced sentences still weren't very high-- BERT's after-attack accuracy on one dataset went from 11.5% to 18.7%, and on another went from 4.0% to 8.3%. The paper didn't give a detailed description of its retraining procedure, so this may just be because they didn't adversarially train as much as they could have.

Rohin's opinion: This is an instance of the general trend across domains where if you search in a black-box way around training or test inputs, you can relatively easily uncover examples where your model performs poorly. We've seen this with adversarial examples in image classification, and with adversarial (AN #73) policies (AN #70) in deep reinforcement learning.

Pretrained Transformers Improve Out-of-Distribution Robustness (Dan Hendrycks et al) (summarized by Asya): One important metric for the performance of deep learning models is the extent to which they generalize to examples that are out-of-distribution (OOD) from the original distribution on which they were trained. This ability is sometimes called out-of-distribution robustness. This paper examines the OOD robustness of several NLP models: a bag-of-words model, word embedding models that use word averages, LSTMs, or ConvNets, and several models that use pretrained bidirectional transformers (BERT).

The paper finds that:

- Pretrained transformers (BERT) are significantly more OOD robust.

- Pretrained transformers (BERT) are significantly better at detecting when they've encountered an OOD example. Previous models do worse than random chance at detection.

- Larger models don't increase OOD robustness in NLP the way they seem to in computer vision.

- Model distillation (using a larger trained neural network to train a smaller neural network) reduces OOD robustness, suggesting that naive in-distribution tests for model distillation methods may mask later failures.

- More diverse data improves OOD robustness.

The paper hypothesizes that these pretrained models may perform better because they were pretrained on particularly diverse data, were trained on a large amount of data, and were trained with self-supervised objectives, which previous work has suggested improves OOD robustness and detection.

Asya's opinion: I think this is an awesome paper which, among other things, points at potential research directions for increasing OOD robustness: training more, training more diversely, and training in self-supervised ways. I think it's pretty noteworthy that larger models don't increase OOD robustness in NLP (all else equal), because it implies that certain guarantees may be constrained entirely by training procedures.

MISCELLANEOUS (ALIGNMENT)

Corrigibility as outside view (Alex Turner) (summarized by Rohin): This post proposes thinking of the outside view as an aspect of corrigible (AN #35) reasoning. In particular, before an agent takes an action that it believes is right, it can simulate possible overseers with different values, and see whether the reasoning that led to this action would do the right thing in those situations as well. The agent should then only take the action if the action usually turns out well.

This is similar to how we might reason that it wouldn't be good for us to impose the rules we think would be best for everyone, even if we had the power to do so, because historically every instance of this happening has actually been bad.

Rohin's opinion: I agree that this sort of "outside-view" reasoning seems good to have. In cases where we want our agent to be deferential even in a new situation where there isn't an outside view to defer to, the agent would have to construct this outside view via simulation, which would probably be infeasibly computationally expensive. Nonetheless, this seems like a cool perspective and I'd like to see a more in-depth take on the idea.

AI STRATEGY AND POLICY

AI Governance in 2019 - A Year in Review: Observations from 50 Global Experts (Shi Qian, Li Hui, Brian Tse et al) (summarized by Nicholas): This report contains short essays from 50 experts reviewing progress in AI governance. I’ll describe a few themes here rather than try to summarize each essay.

The first is a strong emphasis on issues of bias, privacy, deception, and safety. Bias can occur both due to biases of programmers designing algorithms as well as bias that exists in the data. Deception includes deepfakes as well as online accounts that impersonate humans, a subset of which were made illegal in California this year.

The benefit of international collaborations and conferences and getting broad agreement from many stakeholders both in government and companies was frequently highlighted throughout. One example is the OECD Principles on AI, which were later adopted by the G20 including both the US and China, but there were many working groups and committees organized as well, both within industry and governments.

The other shift in 2019 was moving from broad principles towards more specific sets of requirements and policy decisions. The principles agreed to have been quite similar, but the specific implementations vary significantly by country. There were individual essays describing the regional challenges in Europe, the UK, Japan, Singapore, India, and East Asia. Many essays also highlighted the debate around publication norms (AN #73), which garnered a lot of attention in 2019 following OpenAI’s staged release of GPT-2.

Nicholas's opinion: I am very impressed by the number and diversity of experts that contributed to this report. I think it is quite valuable to get people with such different backgrounds and areas of expertise to collaborate on how we should be using AI ahead of time. I was also pleasantly surprised to hear that there was broad international agreement on principles so far, particularly given an overall political trend against global institutions that has occurred recently. I’m definitely interested to know what the key factors were in managing that and how we can make sure these things continue.

Another piece that jumped out at me is the overlap between longer-term issues of safety and shorter-term issues of bias and privacy. For technical safety work, I think the problems are largely distinct and it is important for safety researchers to remain focused on solving problems with major long-term consequences. However, in the governance context, the problems seem to have much more in common and require many similar institutions / processes to address. So I hope that these communities continue to work together and learn from each other.

OTHER PROGRESS IN AI

UNSUPERVISED LEARNING

A Simple Framework for Contrastive Learning of Visual Representations (Ting Chen et al) (summarized by Rohin): Contrastive learning is a major recent development, in which we train a neural net to learn representations by giving it the task of maximizing "agreement" between similar images, while minimizing it across dissimilar images. It has been used to achieve excellent results with semi-supervised learning on ImageNet.

The authors performed a large empirical study of contrastive learning. Their framework consists of three components. First, the data augmentation method specifies how to get examples of "similar images": we simply take an (unlabeled) training image, and apply data augmentations to it to create two images that both represent the same underlying image. They consider random crops, color distortion, and Gaussian blur. Second is the neural network architecture, which is split into the first several layers f() which compute the representation from the input, and the last few layers g() which compute the similarity from the representation. Finally, the contrastive loss function defines the problem of maximizing agreement between similar images, while minimizing agreement between dissimilar images. They primarily use the same InfoNCE loss used in CPC (AN #92).

They then show many empirical results, including:

1. Having a simple linear layer in g() is not as good as introducing one hidden layer, or in other words, the representations in the penultimate layer are more useful than those in the final layer.

2. Larger batch sizes, longer training, and larger networks matter even more for unsupervised contrastive learning than they do for supervised learning.

Momentum Contrast for Unsupervised Visual Representation Learning (Kaiming He et al) (summarized by Rohin): In most deep learning settings, the batch size primarily controls the variance of the gradient, with higher batch sizes decreasing variance. However, with typical contrastive learning, batch size also determines the task: typically, the task is to maximize agreement between two examples in the batch, and minimize agreement with all the other examples in the batch. Put another way, given one input, you have to correctly classify which of the remaining examples in the minibatch is a differently transformed version of that input. So, the batch size determines the number of negative examples.

So, besides decreasing variance, large batch sizes also increase the difficulty of the task to be solved. However, such large batch sizes are hard to fit into memory and are computationally expensive. This paper proposes momentum contrast (MoCo), in which we get large numbers of negative examples for contrastive learning, while allowing for small batch sizes.

Think of contrastive learning as a dictionary lookup task -- given one transformed image (the query), you want to find the same image transformed in a different way out of a large list of images (the keys). The key idea of this paper is to have the minibatch contain queries, while using all of the previous N minibatches as the keys (for some N > 1), allowing for many negative examples with a relatively small minibatch.

Of course, this wouldn't help us if we had to encode the keys again each time we trained on a new minibatch. So, instead of storing the images directly as keys, we store their encoded representations in the dictionary, ensuring that we don't have to rerun the encoder every iteration on all of the keys. This is where the computational savings come from.

However, the encoder is being updated over time, which means that different keys are being encoded differently, and there isn't a consistent kind of representation against which similarity can be computed. To solve this, the authors use a momentum-based version of the encoder to encode keys, which ensures that the key encodings change slowly and smoothly, while allowing the query encoder to change rapidly. This means that the query representation and the key representations will be different, but the layers on top of the representations can learn to deal with that. What's important is that within the key representations, the representations are approximately consistent.

Improved Baselines with Momentum Contrastive Learning (Xinlei Chen et al) (summarized by Rohin): This paper applies the insights from the SimCLR paper to the MoCo framework: it adds an extra hidden layer on top of the representations while training on the contrastive loss, and adds the blur data augmentation. This results in a new SOTA on self-supervised representation learning for images.

REINFORCEMENT LEARNING

CURL: Contrastive Unsupervised Representations for Reinforcement Learning (Aravind Srinivas, Michael Laskin et al) (summarized by Rohin): This paper applies contrastive learning (discussed above) to reinforcement learning. In RL, rather than training in an initial unsupervised phase, the contrastive learning happens alongside the RL training, and so serves as an auxiliary objective to speed up learning. They use random crops for their data augmentation.

Reinforcement Learning with Augmented Data (Michael Laskin, Kimin Lee et al) (summarized by Rohin): While CURL (summarized above) applies contrastive learning in order to ensure the network is invariant to specific data augmentations, we can try something even simpler: what if we just run a regular RL algorithm on augmented observations (e.g. observations that have been randomly cropped)? The authors term this approach RAD (RL with Augmented Data), and find that this actually outperforms CURL, despite not using the contrastive learning objective. The authors speculate that CURL is handicapped by using the contrastive loss as an auxiliary objective, and so its representations are forced to be good both for the true task and for the contrastive prediction task, whereas RAD only trains on the true task.

Read more: RAD Website

Rohin's opinion: I'd be interested in seeing a variant on CURL where the weight for the contrastive loss decays over time: if the author's speculation is correct, this should mitigate the problem with CURL, and one would hope that it would then be better than RAD.

Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels (Ilya Kostrikov et al) (summarized by Rohin): This paper applies data augmentation to Q-learning algorithms, again without a contrastive loss. Specifically, they suggest that the Q-values of states should be invariant to data augmentations (e.g. random translations, which is what they use), and so any time we need to estimate a Q-value, we can reduce the variance of this estimate by sampling multiple data augmentations of the state, and averaging the predicted Q-values for each of them. They apply this to Soft Actor-Critic (SAC) and find that it significantly improves results.

A Reinforcement Learning Potpourri (Alex Irpan) (summarized by Rohin): This blog post summarizes several recent papers in RL (including the data augmentation papers I summarized above, as well as First Return Then Explore, the successor to Go-Explore (AN #35).

Rohin's opinion: The whole blog post is worth reading, but I particularly agree with his point that data augmentation generally seems like a no-brainer, since you can think of it either as increasing the size of your dataset by some constant factor, or as a way of eliminating spurious correlations that your model might otherwise learn.

NEWS

BERI seeking new university collaborators (Sawyer Bernath) (summarized by Rohin): BERI is expanding its offerings to provide free services to a wider set of university-affiliated groups and projects, and they’re now accepting applications from groups and individuals interested in receiving their support. If you’re a member of a research group, or an individual researcher, working on long-termist projects, you can apply here.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

14