[AN #129]: Explaining double descent by measuring bias and variance

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (Zitong Yang et al) (summarized by Nicholas): A fundamental result in ML theory shows that the squared error loss function can be decomposed into two components: bias and variance. Suppose that we train a model f to predict some ground truth function y. The bias measures how incorrect the model will be in expectation over the training process, while the variance measures how different the model’s output can be over different runs of the training process. More concretely, imagine that we run a training process N times, each with a different training set drawn iid from the same underlying training distribution, to get N different models. Bias is like taking the average of these N models, and asking how far away it is from the truth. Meanwhile, variance is like the average distance from each of the N models to the average of all of the N models.

Classical ML predicts that larger models have lower bias but higher variance. This paper shows that instead, the variance of deep NNs first increases but then decreases at larger model sizes. If the bias tends to be much larger than variance, then we see monotonically decreasing total error. If the variance tends to be much larger than the bias, then loss will also look bell-shaped, initially increasing as models get bigger and then decreasing. Finally, if the bias starts high, but over time is overshadowed by the variance, we get double descent (AN #77) curves; this explains why previous work needed to add label noise to get double descent curves (as higher label noise should lead to higher variance).

In order to estimate the variance, the authors split their data into two subsets and use these to create an unbiased estimator of the variance (effectively following a similar procedure to the one described in the first paragraph). The bias estimate can then be determined from the test loss and the estimated variance. They then test how various factors contribute to test loss. As expected, label noise increases variance. Out-of-distribution samples have higher test loss, which is driven by both bias and variance, but most of the increase comes from bias. Deeper networks sharing the same architecture have lower bias but higher variance.

Nicholas' opinion: I really appreciated how this paper gives a clear explanation for empirical results that I had found quite surprising. While there is still the mystery of why variance is unimodal, this helps isolate the confusing phenomenon much more precisely. The explanation of why label noise is necessary seemed quite clear to me and makes much more sense to me than the description in Deep Double Descent (AN #77). I was encouraged to see that most of the out-of-distribution generalization gap comes from an increase in bias, paired with the finding that deeper networks exhibit lower bias. This seems to indicate that larger models should actually get more robust, which I think will make powerful AI systems safer.

Rohin’s opinion: I really liked this paper for the same reasons as Nicholas above. Note that while the classical theory and this paper both talk about the bias and variance of “models” or “hypothesis classes”, what’s actually being measured is the bias and variance of a full training procedure on a specific task and dataset. This includes the model architecture, choice of training algorithm, choice of hyperparameters, etc. To me this is a positive: we care about what happens with actual training runs, and I wouldn’t be surprised if this ends up being significantly different from whatever happens in theory when you assume that you can find the model that best fits your training data.

I’d be excited to see more work testing the unimodal variance explanation for deep double descent: for example, can we also explain epochal deep double descent (AN #77) via unimodal variance?

In addition, I’d be interested in speculation on possible reasons for why variance tends to be unimodal. For example, perhaps real datasets tend to have some simple patterns that every model can quickly learn, but then have a wide variety of small patterns that they could learn in order to slightly boost performance, where different facts generalize differently. If a model with capacity C can learn N such facts, and there are F facts in total, then perhaps you could get F choose N different possible models depending on the random seed. The more possible models with different facts, the higher the variance, and so variance might then be maximized when C is such that N = F / 2, and decrease to either side of that point, giving the unimodal pattern.

(To be clear, this explanation is too high-level -- I don’t really like having to depend on the idea of “facts” when talking about machine learning -- the hope is that with speculation like this we might figure out a more mechanistic explanation.)

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

Robust Imitation Learning from Noisy Demonstrations (Voot Tangkaratt et al) (summarized by Zach): One weakness of vanilla imitation learning is that it struggles to handle demonstrations from sub-optimal experts. Let’s consider a simplified setting where the sub-optimal experts are modeled as optimal policies with injected gaussian noise. Ideally, the agent would learn to separate the noise from the true optimal policy.

This paper proposes an algorithm that is able to do this separation. The authors assume that the sub-optimal demonstrations and the learned agent policy can both be decomposed into a mixture of expert-policy and noise distributions. The main insight is that we can then learn a single classifier to distinguish noisy data from expert data; this classifier can then be used to define a reward function for an RL agent.

One issue is that since there is no ground truth for what is expert vs. noise, the classifier has to be trained on its own predictions, which can lead to overconfidence via positive feedback loops. To stabilize training, the authors train two models concurrently (co-training); each model is used to create training data for the other model. The authors call this approach RIL-Co. The experimental results show their algorithm RIL-Co is able to perform better than GAIL and other algorithms in the noisy regime.

Prerequisites: GAIL (AN #17)

Read more: DART

Zach's opinion: The derivation of the method is interesting in its own right since it also offers an alternative derivation of GAIL. However, the main assumption of this paper is that the agent policy can be decomposed into an expert and non-expert mixture. This seems like a strong assumption and makes me skeptical of the theoretical analysis. Nevertheless, the experimental results do indicate that RIL-Co is able to outperform other naive approaches. I'm also surprised this paper has no reference to DART, which directly analyzes imitation learning in the noise-injection setting. Specifically, given that these experts are deterministic and we have a significant number of demonstrations, I'd expect that a reasonable amount of noise should be directly removable with regular behavioral cloning. Yet, there is no comparison with behavioral cloning.

ROBUSTNESS

Building Trust through Testing (Michèle A. Flournoy et al) (summarized by Nicholas): In order to deploy AI systems, especially in the military, we will need to have good reason to trust them. However, under the current machine learning paradigm, there are several challenges to building such trust. These challenges are broken down into technological and bureaucratic challenges. Technological challenges include lack of robustness, lack of representative test sets, lack of interpretability, and complexity once ML is integrated into larger systems. Bureaucratic challenges include lack of coordination, recruiting talent, and coordination between DoD, the private sector, and academia.

To address these challenges, the authors suggest that DoD updates its testing, evaluation, verification, and validation (TEVV) process to handle AI systems. They make 11 recommendations. A few that I found particularly interesting are:

- Create an OSD coordinating body to lead on AI/ML TEVV and incentivize strong cooperation with the services.

- Develop industry / U.S. government TEVV standards and promote them internationally.

- Test, train, and certify human-machine teams through wargaming, simulation, and experimentation.

- Increase resources for and attention on adversarial testing and red-teaming.

Nicholas' opinion: I think it is plausible that the final stages of AGI deployment are carried out by the US military, similar to how the Manhattan Project took on the final stages of nuclear bomb development. If this happens, it will be important for the DoD to recognize and address issues of AI safety very carefully. If the TEVV process does get updated in a meaningful way soon, this could be a great opportunity for the AI safety community to try to translate some of its research into practice as well as building credibility and trust with the DoD.

Separately from long-term AGI deployment, I think that the DoD could be a good partner for AI safety research, particularly regarding areas like adversarial training and robustness. My main concern is that confidentiality might lead to safety research being hidden and thus less useful than if it was carried out in the open. I don’t know enough about the details of working with the DoD to know how likely that is to be an issue, however.

FORECASTING

Automating reasoning about the future at Ought (Ought) (summarized by Rohin): Roughly speaking, we can think of an axis of reasoning, spanning from high-confidence statistical reasoning with lots of data to general quantitative reasoning to very qualitative reasoning. Ought is now building Elicit to help with judgmental forecasting, which is near but not at the end of the spectrum. In judgmental forecasting, we might take a complicated question such as “Will Roe v. Wade be overturned if Trump nominates a new justice”, decompose it into subquestions, estimate probabilities for those subquestions, and then combine them to get to a final forecast. Crucially, this requires relying on people’s judgment: we cannot just look at the historical rate at which landmark Supreme Court decisions are overturned, since the situation has rarely arisen before.

Currently, Elicit has several features that help with the quantitative aspects of judgmental forecasting, for example by enabling users to input complex distributions and visualizing these distributions. However, in the long-term, the hope is to also help with the more qualitative aspects of judgmental forecasting as well, for example by proposing an important subquestion for answering the current question under consideration, or recommending a source of data that can answer the question at hand.

Ought is now working on adding these sorts of features by using large language models (GPT-3 in particular). They are currently looking for beta testers for these features!

Rohin's opinion: This seems like a pretty interesting direction, though I’m not totally clear on how it relates to AI alignment (assuming it is supposed to). It does seem quite related to iterated amplification, which also relies on this sort of decomposition of questions.

I found the video mock ups and demos in the blog post to be particularly interesting, both in the original post and in the call for beta testers; I think they were much better at showcasing the potential value than the abstract description, and I recommend you all watch them.

MISCELLANEOUS (ALIGNMENT)

When AI Systems Fail: Introducing the AI Incident Database (Sean McGregor) (summarized by Rohin): One obvious way to improve safety is to learn from past mistakes, and not repeat them. This suggests that it would be particularly valuable to have a repository of past incidents that have occurred, so that people can learn from them; indeed both aviation and cybersecurity have their own incident databases. The AI Incidents Database aims to fill this gap within AI. The database currently has over a thousand incidents covering a wide range of potential issues, including self-driving car accidents, wrongful arrests due to bad facial recognition or machine translation, and an algorithm-driven “flash crash”.

Read more: AI Incident Database

Paper: Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database

NEWS

MIT Postdoc Role (summarized by Rohin): Neil Thompson, who works on forecasting progress in AI (see for example The Computational Limits of Deep Learning), is looking for a postdoc in economics and computer science to (1) understand the key innovation trends in computing and artificial intelligence, and (2) analyze the economic and policy implications of these trends. The application deadline is Jan 3.

S-Risk Intro Seminar (Stefan Torges) (summarized by Rohin): The first intro seminar to s-risks will take place on the weekend of February 20 & 21, 2021. It is targeted at people who are at least seriously thinking about addressing s-risks as part of their career, and who have not yet spent a lot of time interacting with the Center on Long-Term Risk.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[-]adamShimi6y30

As always, thanks to everyone involved for this newsletter! I especially like when the opinions also recontextualize the papers, like Rohin's on the first paper here.

My main concern is that confidentiality might lead to safety research being hidden and thus less useful than if it was carried out in the open. I don’t know enough about the details of working with the DoD to know how likely that is to be an issue, however.

We can also see that as positive, in that it limits information hazards (which might be more prevalent closer to AGI)