Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter **resources here**. In particular, you can look through **this spreadsheet** of all summaries that have ever been in the newsletter.

Audio version **here** (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

# HIGHLIGHTS

**Distributional Generalization: A New Kind of Generalization** *(Preetum Nakkiran and Yamini Bansal)* (summarized by Rohin): Suppose you train a classifier to distinguish between CIFAR-10 classes, except each airplane has a 30% chance of being mislabeled as a car. If you then train a model to achieve perfect accuracy on this badly labeled dataset, it will get 100% accuracy on the training set, and 97% of those labels will actually be correct (since 3% are mislabeled airplanes). Under the current paradigm, if we say that the model “generalizes”, that means that it will also get 97% accuracy at test time (according to the actually correct labels). However, this doesn’t tell us anything about what mistakes are made at test time -- is it still the case that 30% of airplanes are mislabeled as cars, or does the model also make mistakes on e.g. deer?

*Distributional generalization* aims to make claims about situations like these. The core idea is to make claims about the full distribution of classifier outputs, rather than just the single metric of test accuracy.

Formally, we assume there is some distribution D, from which we can sample pairs of points (x, y), which generates both our train and test sets. Then, the train (resp. test) distribution of classifier outputs is (x, f(x)), with x coming from the train (resp. test) set. The train and test distributions of classifier outputs are the objects of study in distributional generalization. In particular, given a [0,1]-valued function on distributions (called a *test* T), we say that the classifier *generalizes w.r.t T* if T outputs similar values on the train and test distribution. (W.r.t means “with respect to”.) For example, given a distribution, the *accuracy test* checks how often the classifier’s output is correct in expectation over that distribution. Generalization w.r.t the accuracy test is equivalent to the canonical notion of generalization.

Let’s suppose that the classifier perfectly fits the training set, so that the train distribution of classifier outputs is the same as the original distribution D. Let’s additionally suppose that the classifier generalizes with respect to the accuracy test, so that the classifier has perfect test accuracy. Then, the test distribution of classifier outputs will also be the same as the original distribution D, that is, all distributions are identical and there isn’t much more to say. So, the interesting situations are when one of these two assumptions is false, that is, when either:

1. The classifier does not perfectly fit the training set, or

2. The classifier does not generalize w.r.t accuracy.

This paper primarily focuses on classifiers that *do* perfectly fit the training set, but don’t generalize w.r.t accuracy. One typical way to get this setting is to inject label noise (as in the mislabeled airplanes case), since this prevents the classifier from getting 100% test accuracy.

Speaking of which, let’s return to our original example in which we add label noise by mislabeling 30% of airplanes as cars. Notice that, since the label noise is completely divorced from the classifier’s input x, the best way for the classifier to minimize *test* loss would be to always predict the true CIFAR-10 label, and then 3% of the time the true distribution will say “lol, jk, this airplane is actually a car”. However, in practice, classifiers will also label approximately 30% of airplanes as cars in the test set as well! This incurs higher loss, because the 30% of airplanes that the classifier labels as cars must be independent of the 30% of airplanes that the true distribution labels as cars, which implies that the model disagrees with the true distribution 4.2% of the time; this is worse than the 3% it would get if it consistently labeled airplanes as airplanes. **Classifiers trained to interpolation are not Bayes-optimal in the presence of label noise.**

Okay, let’s get back to distributional generalization. We already know the classifier does not generalize w.r.t accuracy. However, the fact that it still labels about 30% of airplanes as cars suggests a different kind of generalization. Recall that the train and test distributions of classifier outputs have the form (x, f(x)). Consider the feature L(x) that says whether x is an airplane or not. Then, if we replace (x, f(x)) with (L(x), f(x)), then this now looks identical between the train and test distributions! Specifically, this distribution places 7% on (“yes airplane”, “airplane”), 3% on (“yes airplane”, “car”), and 10% on (“no airplane”, c) for every class c other than “airplane”. An alternative way of stating this is that the classifier generalizes w.r.t all tests whose dependence on x factors through the feature L. (In other words, the test can only depend on whether x is an airplane or not, and cannot depend on any other information about x.)

The authors make a more general version of this claim they call *feature calibration*: for every feature L that *could* be learned by the classifier, the classifier generalizes w.r.t all tests whose dependence on x factors through L. Note that they do not assume that the classifier *actually* learns L: just that, if you hypothetically trained the classifier on a dataset of (x, L(x)), then it could learn that function near-perfectly.

They then provide evidence for this through a variety of experiments and one theorem:

- If you plug in the constant feature L(x) = 0 into the conjecture, it implies that classifiers should get the right class balance (i.e. if your distribution contains class 1 twice as often as class 0, then you predict class 1 twice as often as class 0 at test time). They demonstrate this on a rebalanced version of CIFAR-10, even for classifiers that generalize poorly w.r.t accuracy.

- When using a WideResNet (for which the true CIFAR-10 labels are learnable), if you add a bunch of structured label noise into CIFAR-10, the test predictions reflect that same structure.

- The same thing is true for decision trees applied to a molecular biology dataset.

- A ResNet-50 trained to predict attractiveness on the CelebA dataset (which does not generalize w.r.t accuracy) does satisfy feature calibration w.r.t “wearing lipstick”, “heavy makeup”, “blond hair”, “male”, and “eye-glasses”. Note there is no label noise in this case.

- AlexNet predicts that the right fraction of dogs are Terriers, even though it mistakes which exact dogs are Terriers.

- The nearest-neighbor classifier provably satisfies feature calibration under relatively mild regularity conditions.

In an appendix, they provide preliminary experiments suggesting this holds *pointwise*. In our mislabeled airplane example, for a *specific* airplane x from the test set, if you resample a training set (with the 30% mislabeling of airplanes) and retrain a classifier on that set, then there is a roughly 30% chance that that specific x will be misclassified as a car.

The authors then introduce another distributional generalization property: *agreement*. Suppose we have two classifiers f and g trained on independently sampled training sets. The agreement conjecture states that the test accuracy of f is equal to the expected probability that f agrees with g on the test distribution (loosely speaking, this is how often f and g make the same prediction for test inputs). The agreement property can also be framed as an instance of distributional generalization, though I won’t go into the specific test here. The authors perform similar experiments as with feature calibration to demonstrate that the agreement property does seem to hold across a variety of possible classifiers.

Interestingly, these properties are *not* closed under ensembling. In our mislabeled airplane example, every model will label 30% of airplanes as cars, but *which* airplanes are mislabeled is independent across models. As a result, the plurality voting used in ensembles reduces the misclassification rate to 22%, which means that you no longer satisfy feature calibration. Consistent with this, the authors observe that neural network ensembles, random forests, and k-nearest neighbors all did not satisfy feature calibration, and tended to be closer to the Bayes-optimal solution (i.e. getting closer to being robust to label noise, in our example).

Summary of the summary: Let’s look at the specific ways in which classifiers make mistakes on the test distribution. This is called distributional generalization. The paper makes two conjectures within this frame. *Feature calibration* says that for any feature that a classifier could have learned, the distribution of its predictions, conditioned on that feature, will be the same at train and test time, including any mistakes it makes. *Agreement* says that the test accuracy of a classifier is equal to the probability that, on some randomly chosen test example, the classifier’s prediction matches that of another classifier trained on a freshly generated training set. Interestingly, while these properties hold for a variety of ML models, they do not hold for ensembles, because of the plurality voting mechanism.

**Read more:** Section 1.3 of **this version of the paper**

# TECHNICAL AI ALIGNMENT

## AGENT FOUNDATIONS

**The Many Faces of Infra-Beliefs** *(Diffractor)* (summarized by Rohin): When modeling an agent that acts in a world **that contains it** (**AN #31**), there are different ways that we could represent what a “hypothesis about the world” should look like. (We’ll use **infra-Bayesianism** (**AN #143**) to allow us to have hypotheses over environments that are “bigger” than the agent, in the sense of containing the agent.) In particular, hypotheses can vary along two axes:

1. **First-person vs. third-person:** In a first-person perspective, the agent is central. In a third-person perspective, we take a “birds-eye” view of the world, of which the agent is just one part.

2. **Static vs. dynamic:** In a dynamic perspective, the notion of time is explicitly present in the formalism. In a static perspective, we instead have beliefs directly about entire world-histories.

To get a tiny bit more concrete, let the world have states S and the agent have actions A and observations O. The agent can implement policies Π. I will use ΔX to denote a belief over X (this is a bit handwavy, but gets the right intuition, I think). Then the four views are:

1. First-person static: A hypothesis specifies how policies map to beliefs over observation-action sequences, that is, Π → Δ(O × A)*.

2. First-person dynamic: This is the typical POMDP framework, in which a hypothesis is a belief over initial states and transition dynamics, that is, ΔS and S × A → Δ(O × S).

3. Third-person static: A hypothesis specifies a belief over world histories, that is, Δ(S*).

4. Third-person dynamic: A hypothesis specifies a belief over initial states, and over the transition dynamics, that is, we have ΔS and S → ΔS. Notice that despite having “transitions”, actions do not play a role here.

Given a single “reality”, it is possible to move between these different views on reality, though in some cases this requires making assumptions on the starting view. For example, under regular Bayesianism, you can only move from third-person static to third-person dynamic if your belief over world histories Δ(S*) satisfies the Markov condition (future states are conditionally independent of past states given the present state); if you want to make this move even when the Markov condition isn’t satisfied, you have to expand your belief over initial states to be a belief over “initial” world histories.

You can then define various flavors of (a)causal influence by saying which types of states S you allow:

1. If a state s consists of a policy π and a world history (oa)* that is consistent with π, then the environment transitions can depend on your choice of π, leading to acausal influence. This is the sort of thing that would be needed to formalize Newcomb’s problem.

2. In contrast, if a state s consists only of an environment E that responds to actions but *doesn’t* get to see the full policy, then the environment cannot depend on your policy, and there is only causal influence. You’re implicitly claiming that Newcomb’s problem cannot happen.

3. Finally, rather than have an environment E that (when combined with a policy π) generates a world history (oa)*, you could have the state s directly be the world history (oa)*, *without* including the policy π. In normal Bayesianism, using (oa)* as states would be equivalent to using environments E as states (since we could construct a belief over E that implies the given belief over (oa)*), but in the case of infra-Bayesianism it is not. (Roughly speaking, the differences occur when you use a “belief” that isn’t just a claim about reality, but also a claim about which parts of reality you “care about”.) This ends up allowing some but not all flavors of acausal influence, and so the authors call this setup “pseudocausal”.

In all three versions, you can define translations between the four different views, such that following any path of translations will always give you the same final output (that is, translating from A to B to C has the same result as A to D to C). This property can be used to *define* “acausal”, “causal”, and “pseudocausal” as applied to belief functions in infra-Bayesianism. (I’m not going to talk about what a belief function is; see the post for details.)

## FORECASTING

**Three reasons to expect long AI timelines** *(Matthew Barnett)* (summarized by Rohin): This post outlines and argues for three reasons to expect long AI timelines that the author expects are not taken into account in current forecasting efforts:

1. **Technological deployment lag:** Most technologies take decades between when they're first developed and when they become widely impactful.

2. **Overestimating the generality of AI technology:** Just as people in the 1950s and 1960s overestimated the impact of solving chess, it seems likely that current people are overestimating the impact of recent progress, and how far it can scale in the future.

3. **Regulation will slow things down,** as with **nuclear energy**, for example.

You might argue that the first and third points don’t matter, since what we care about is when AGI is *developed*, as opposed to when it becomes widely deployed. However, it seems that we continue to have the opportunity to intervene until the technology becomes widely impactful, and that seems to be the relevant quantity for decision-making. You could have some specific argument like “the AI goes FOOM and very quickly achieves all of its goals” that then implies that the development time is the right thing to forecast, but none of these seem all that obvious.

**Rohin's opinion:** I broadly agree that (1) and (3) don’t seem to be discussed much during forecasting, despite being quite important. (Though see e.g. **value of the long tail**.) I disagree with (2): while it is obviously possible that people are overestimating recent progress, or are overconfident about how useful scaling will be, there has at least been a lot of thought put into that particular question -- it seems like one of the central questions tackled by **bio anchors** (**AN #121**). See more discussion in this **comment thread**.

## FIELD BUILDING

**FAQ: Advice for AI Alignment Researchers** *(Rohin Shah)* (summarized by Rohin): I've written an FAQ answering a broad range of AI alignment questions that people entering the field tend to ask me. Since it's a meta post, i.e. about how to do alignment research rather than about alignment itself, I'm not going to summarize it here.

## MISCELLANEOUS (ALIGNMENT)

**Testing The Natural Abstraction Hypothesis: Project Intro** *(John Wentworth)* (summarized by Rohin): We’ve previously seen some discussion about **abstraction** (**AN #105**), and some **claims** that there are “natural” abstractions, or that AI systems will **tend** (**AN #72**) to **learn** (**AN #80**) increasingly human-like abstractions (at least up to a point). To make this more crisp, given a system, let’s consider the information (abstraction) of the system that is relevant for predicting parts of the world that are “far away”. Then, the **natural abstraction hypothesis** states that:

1. This information is much lower-dimensional than the system itself.

2. These low-dimensional summaries are exactly the high-level abstract objects/concepts typically used by humans.

3. These abstractions are “natural”, that is, a wide variety of cognitive architectures will learn to use approximately the same concepts to reason about the world.

For example, to predict the effect of a gas in a larger system, you typically just need to know its temperature, pressure, and volume, rather than the exact positions and velocities of each molecule of the gas. The natural abstraction hypothesis predicts that many cognitive architectures would all converge to using these concepts to reason about gases.

If the natural abstraction hypothesis were true, it could make AI alignment dramatically simpler, as our AI systems would learn to use approximately the same concepts as us, which can help us both to “aim” our AI systems at the right goal, and to peer into our AI systems to figure out what exactly they are doing. So, this new project aims to test whether the natural abstraction hypothesis is true.

The first two claims will likely be tested empirically. We can build low-level simulations of interesting systems, and then compute what summary is useful for predicting its effects on “far away” things. We can then ask how low-dimensional that summary is (to test (1)), and whether it corresponds to human concepts (to test (2)).

A **followup post** illustrates this in the case of a linear-Gaussian Bayesian network with randomly chosen graph structure. In this case, we take two regions of 110 nodes that are far apart from each other, and operationalize the relevant information between the two as the covariance matrix between the two regions. It turns out that this covariance matrix has about 3-10 “dimensions” (depending on exactly how you count), supporting claim (1). (And in fact, if you now compare to another neighborhood, two of the three “dimensions” remain the same!) Unfortunately, this doesn’t give much evidence about (2) since humans don’t have good concepts for parts of linear-Gaussian Bayesian networks with randomly chosen graph structure.

While (3) can also be tested empirically through simulation, we would hope that we can also prove theorems that state that nearly all cognitive architectures from some class of models would learn the same concepts in some appropriate types of environments.

To quote the author, “the holy grail of the project would be a system which provably learns all learnable abstractions in a fairly general class of environments, and represents those abstractions in a legible way. In other words: it would be a standardized tool for measuring abstractions. Stick it in some environment, and it finds the abstractions in that environment and presents a standard representation of them.”

**Rohin's opinion:** The notion of “natural abstractions” seems quite important to me. There are at least some weak versions of the hypothesis that seem obviously true: for example, if you ask GPT-3 some new type of question it has never seen before, you can predict pretty confidently that it is still going to respond with real words rather than a string of random characters. This is effectively because you expect that GPT-3 has learned the “natural abstraction” of the words used in English and that it uses this natural abstraction to drive its output (leaving aside the cases where it must produce output in some other language).

The version of the natural abstraction hypothesis investigated here seems a lot stronger and I’m excited to see how the project turns out. I expect the author will post several short updates over time; I probably won’t cover each of these individually and so if you want to follow it in real time I recommend following it on the Alignment Forum.

**FEEDBACK**

I'm always happy to hear feedback; you can send it to me, **Rohin Shah**, by **replying to this email**.

**PODCAST**

An audio podcast version of the **Alignment Newsletter** is available. This podcast is an audio version of the newsletter, recorded by **Robert Miles**.

I asked the authors for feedback on my summary of the distributional generalization paper, and Preetum responded with the following (copied with his permission):

I agree with everything you've said in this summary, so my feedback below is mostly commentary / minor points.

- One intuitive way to think about Feature Calibration is that f(x) is "close to" a sample from p(y|x). Where the quality of the "closeness" is depends on the power of the classifier.

- Re. "classifiers which do not fit their train set": As you say, our paper mostly focuses on Distributional Generalization (DG) for interpolating models. But I am hopeful that DG actually holds much more generally, and we should really be thinking of generalization as saying "test and train behaviors are close *as distributions*".

Though we don't formalize this yet for non-interpolating models, there are some suggestive experiments in Section 7 (eg: the confusion matrix of a model on the test set remains close to its confusion matrix on the train set, throughout the training process. As you start to fit noise on the train set, you see exactly this noise start to appear on the test set. Regularization which prevents fitting noise on the train set, also prevents this noise from appearing at test time).

- For me, one of the most interesting implications of DG/feature-calibration is that it gives a separation between overparameterized and underparameterized regimes (in the scaling limits of large models/data). With enough data, large enough underparameterized models will converge to Bayes-optimal classifiers, whereas overparameterized models will not (assuming DG). That is, interpolation is not always "benign", it can actually hurt.

- You may like the discussion we added on these issues in the short version of our paper: Section 1.3 ("Related Work and Significance") here: https://mltheory.org/dg_short.pdf

(there is no new material in this pdf, outside the Related Work).

- Also, we have a number of supporting experiments for Feature Calibration in the appendix that didn't make it into the body (eg: more tasks for decision trees, and experiments with "bad" image classifiers like MLPs and RBF kernels).

- Sidenote: The "agreement property" has been bugging me for a while since it seems kind of magical. My current view is that "agreement" may be be a special case of a stronger (but less magical) property: The joint distribution (f(x), y) is statistically close to (f(x), f'(x))

on the test set, where f' is an independently-trained classifier.

This can also be seen as an instance of DG, and it implies the agreement property. I sketched this conjecture in this tweet: https://twitter.com/PreetumNakkiran/status/1385741115211530241

(But this is speculative -- not in the paper and hasn't been rigorously tested).

- I included this figure in a talk on DG recently -- point being that DG is a general definition, which includes both classical generalization and our new conjectures as special cases (and could include other yet-undiscovered behaviors).

- As mentioned at the end of our paper, there are *many* open questions remaining (and I would be very happy to see more work in this area).

I think these are two instances of a general heuristic of treating what have traditionally been seen as philosophical positions (e.g. here cognitive and behavioral views and A and B theories of time) instead as representations one can run various kinds of checks on to achieve more sample complexity reduction than using a single representation.

Thanks for fixing the formatting!