Causal confusion as an argument against the scaling hypothesis

RobertKirk; David Scott Krueger

Abstract

We discuss the possibility that causal confusion will be a significant alignment and/or capabilities limitation for current approaches based on "the scaling paradigm": unsupervised offline training of increasingly large neural nets with empirical risk minimization on a large diverse dataset. In particular, this approach may produce a model which uses unreliable (“spurious”) correlations to make predictions, and so fails on “out-of-distribution” data taken from situations where these correlations don’t exist or are reversed. We argue that such failures are particularly likely to be problematic for alignment and/or safety in the case when a system trained to do prediction is then applied in a control or decision-making setting.

We discuss:

We believe this topic is important because many researchers seem to view scaling as a path toward AI systems that 1) are highly competent (e.g. human-level or superhuman), 2) understand human concepts, and 3) reason with human concepts.
We believe the issues we present here are likely to prevent (3), somewhat less likely to prevent (2), and even less likely to prevent (1) (but still likely enough to be worth considering). Note that (1) and (2) have to do with systems’ capabilities, and (3) with their alignment; thus this issue seems likely to be differentially bad from an alignment point of view.

Our goal in writing this document is to clearly elaborate our thoughts, attempt to correct what we believe may be common misunderstandings, and surface disagreements and topics for further discussion and research.

A DALL-E 2 generation for "a green stop sign in a field of red flowers". In 8/10 of the pictures the stop sign is red, not green. — *A DALL-E 2 generation for "a green stop sign in a field of red flowers".*
Current foundation models still fail on examples that seem simple for humans, and causal confusion and spurious correlations may be among the culprits causing such failures. Examples like these show that DALL-E 2 makes systematic deviations from the way humans interpret text, possibly (in this case) because stop signs are almost always red in the training dataset, especially if the word *red* appears in the caption.

Introduction and Framing

GPT-3 and Scaling Laws (among other works) have made the case that scaling will be a key part of future transformative AI systems or AGI. Many people now believe that there’s a possibility of AGI happening in the next 5-10 years from simply scaling up current approaches dramatically (inevitably with a few tweaks, and probably added modalities, but importantly still performing the large bulk of training offline using ERM). If this is the case, then it’s more likely we can do useful empirical work on AI safety and alignment by focusing on these systems, and much current research effort (Anthropic, Redwood, Scalable Alignment@DeepMind, Safety@OpenAI) is focused on aligning systems primarily based on large language models (which are the current bleeding edge of scaled-up systems).

However, we think there is a potential flaw or limitation in this scaling approach. While this flaw perhaps isn’t apparent in current systems, it will likely become more apparent as these systems are deployed in wider and more autonomous settings. To sketch the argument (which will be made more concrete in the rest of this post): Current scaling systems are based on offline Empirical Risk Minimisation (ERM): using SGD to make a simple loss go as low as possible on a static dataset. ERM often leads to learning spurious correlations or causally confused models, which would result in bad performance Out-Of-Distribution (OOD). There are theoretical reasons for believing that this problem won’t be solved by using more data or more diverse data, making this a fundamental limitation of offline training with ERM. Since it may be practically difficult or infeasible to collect massive amounts of data “online” (i.e. from training AI systems in a deployment context), this may be a major limitation of the scaling paradigm. Finally, the OOD situations which would produce this bad performance are very likely to occur at deployment time through the model’s own actions/interventions in the environment, which the model hasn’t seen during training. It’s an open question (which we also discuss in this post) whether fine-tuning can work sufficiently well to fix the issues arising from the ERM-based pretraining approach.

We think resolving to what extent this is a fundamental flaw or limitation in the scaling approach is important for two main reasons. Firstly, from a forecasting perspective, knowing whether the current paradigm of large-scale but simply-trained models will get us to AGI or not is important for predicting when AGI will be developed, and what that resulting AGI will look like. Resolving the questions around fine-tuning are also important here as if fine-tuning can fix the problem but only with large amounts of hard-to-collect data, then this makes it harder to develop AGI. Secondly, from a technical alignment perspective, we want alignment techniques we develop to work on the type of models that will be used for AGI, and so ensuring that our techniques work for large-scale models despite their deficiencies is important if we expect these large-scale models to still be used. These deficiencies will also likely impact what kind of fine-tuning approaches work.

In this post we describe this problem in more detail, motivating it theoretically, as well as discussing the key cruxes for whether this is a real issue or not. We then discuss what this means for current scaling and alignment research, and what promising research directions this perspective informs.

Preliminaries: ERM, Offline vs. Online Learning, Scaling, Causal Confusion, and Out-of-Distribution (OOD) Generalisation

Here we define ERM and causal confusion and give our definition of Scaling.

Empirical Risk Minimization (ERM) is a nearly-universal principle in machine learning. Given some risk (or loss function), ERM dictates that we should try to minimise this risk over the empirical distribution of data we have access to. This is often easy to do - standard minibatch SGD on the loss function over mini-batches drawn iid from the data will produce an approximate ERM solution.

Offline learning, roughly speaking, refers to training an AI system on a fixed data set. In contrast, in online learning data is collected during the learning process, typically in a manner informed or influenced by the learning system. This potential for interactivity makes online learning more powerful, but it can also be dangerous (e.g. because the system’s behaviour may change unexpectedly) and impractical (e.g. since offline data can be much easier/cheaper to collect). Offline learning is currently much more popular in research and in practice.^[1]

Scaling is a less well-defined term, but in this post, we mean something like: Increasing amounts of (static) data, compute and model size can lead to (a strong foundation for) generally competent AI using only simple (ERM-based) losses and offline training algorithms (e.g. minimising cross-entropy loss of next token with SGD in language modelling on a large static corpus of text). This idea is based on work finding smooth scaling laws between model size, training time, dataset size and loss; the scaling hypothesis, which simply states that these laws will continue to hold even in regimes where we haven’t yet tested them; and the fact that large models trained in this simple way produce impressive capabilities now, and hence an obvious recipe for increasing capabilities is just to scale up on the axes described in the scaling laws.

Of course, even researchers who are fervent believers in scaling don’t think that a very large model, without any further training, will be generally competent and aligned. The standard next step after pretraining a large-scale model is to perform a much smaller amount of fine-tuning based on either task-specific labelled data (in a supervised learning setting), or some form of learning from human preferences over model outputs (often in an RL setting). This is often combined with prompting the model (specifically in the case of language models, but possibly applicable in other scenarios). Methods of prompting and fine-tuning have improved rapidly in the last year or so, but it’s unclear whether such improvements can solve the underlying problems this paradigm may face.

Causal confusion is a possible property of models, whereby the model is confused as to what parts of the environment cause other parts. For example, suppose that whenever the weather is sunny, I wear shorts, and also buy ice cream. If the model doesn’t observe the weather, but just my clothing choice and purchases, it might believe that wearing shorts caused me to buy ice cream, which would mean it would make incorrect predictions if I wore shorts for reasons other than the weather (e.g. I ran out of trousers, or I was planning to do exercise). This can be particularly likely to happen when not all parts of the environment can be observed by the model (i.e. if the parts of the environment which are the causal factors aren’t observed, like the weather in the previous example). It’s also likely to occur if a model ever observes the environment without acting in it; in the previous example, if the model was just trying to predict whether I would buy ice cream, it would do a pretty good job by looking at my clothing choice (although it would be occasionally incorrect). However, if the model was acting in the environment with the goal of making me buy ice cream, suggesting I wear shorts would be entirely ineffective in getting me to buy ice cream.

If a model is causally confused, this can have several consequences.

Capabilities consequences:
1) The model may not make competent predictions out-of-distribution (capabilities misgeneralisation). We discuss this further in ERM leads to causally confused models that are flawed OOD.

Alignment consequences:
2) If the model is causally confused about objects related to its goals or incentives, then it might competently pursue changes in the environment that either don’t actually result in the reward function used for training being optimised (goal misgeneralisation).

3) Another issue is incentive mismanagement; Krueger et al. (HI-ADS) show that causal confusion can lead models to optimise over what Farquhar et al. subsequently define as “delicate” parts of the state that it is not meant to optimise over, yielding higher rewards via undesirable means.

Further, if during training fine-tuning a model suddenly becomes deconfused, it’s likely to exhibit a sudden leap in competence and generality, as it can now perform in a much wider range of situations. This is relevant from a forecasting/timelines perspective: if current language models (for example) are partially causally confused and this limitation is addressed (e.g. via online fine-tuning or some other fix), this could lead to a sudden increase in language model capabilities. On the other hand, it could be that fine-tuning is unlikely to solve issues of causal confusion.

Out-of-distribution (OOD) generalisation is a model’s ability to perform well on data that is not drawn from the training distribution. Historically most work in machine learning has focused on IID generalisation, generalising to new examples from the same training distribution. There has been recent interest in tackling OOD generalisation challenges, although the field has struggled to settle on a satisfactory definition of the problem in formal terms, and has had issues ensuring that results are robust. The issue of OOD generalisation is very related to causal confusion, as causal confusion is one possible reason why models fail to generalise OOD (to situations where the causal model is no longer correct), and we can often only demonstrate causal confusion in OOD settings (as otherwise, the spurious correlations the model learned during training will continue to hold).

Stating the Case

The argument for why scaling may be flawed comes in two parts. The first is a more theoretical (and more mathematically rigorous) argument that ERM is flawed in certain OOD settings, even with large amounts of diverse data, as it leads to causally confused models. The second part builds on this point, arguing that it applies to current approaches to scaling (due to models trained with offline prediction being used for online interaction and control, leading to OOD settings).

ERM leads to causally confused models that are flawed OOD

Out-of-distribution (OOD) generalisation is a model’s ability to perform well on data that is not drawn from the training distribution. Historically most work in machine learning has focused on IID generalisation, generalising to new examples from the same training distribution. Different distributions are sometimes called different domains or environments because these differences are assumed to result from the data being collected under different conditions. It might be suspected that OOD generalisation can be tackled in the scaling paradigm by using diverse enough training data, for example, including data sampled from every possible test environment. Here, we present a simple argument that this is not the case, loosely adapted from Remark 1 from Krueger et al. REx:

The reason data diversity isn’t enough comes down to concept shift (change in ). Such changes can be induced by changes in unobserved causal factors, Z. Returning to the ice cream ( $Y$ ) and shorts ( $X$ ), and sun ( $Z$ ) example, shorts are a very reliable predictor of ice cream when it is sunny, but not otherwise. Putting numbers on this, let’s say $P (Y = 1 | X = 1, Z = 1) = 90 %, P (Y = 1 | X = 1, Z = 0) = 25 %$ . Since the model doesn’t observe $Z$ , there is not a single setting of $P (Y = 1 | X = 1)$ that will work reliably across different environments with different climates (different $P (Z)$ ). Instead

$P (Y = 1 | X = 1) = P (Y = 1 | X = 1, Z = 1) P (Z = 1) + P (Y = 1 | X = 1, Z = 0) P (Z = 0)$

depends on $P (Z)$ , which in turn depends on the climate in the locations where the data was collected. In this setting, to ensure a model trained with ERM can make good predictions in a new “target” location, you would have to ensure that that location is as sunny as the average training location so that $P (Z = 1)$ is the same at training and test time. It is not enough to include data from the target location in the training set, even in the limit of infinite training data - including data from other locations changes the overall $P (Z = 1)$ of the training distribution. This means that without domain/environment labels (which would allow you to have different $P (Y = 1 | X = 1)$ for different environments, even if you can’t observe $Z$ ), ERM can never learn a non-causally confused model.

Note that there is, however, still a correct causal model for how wearing shorts affects your desire for ice cream: the effect is probably weak and plausibly even negative (since you might want ice cream less if you are already cooler from wearing shorts). While this model might not make very good predictions, it will correctly predict that getting you to put on shorts is not an effective way of getting you to want ice cream, and thus will be a more reliable guide for decision-making (about whether to wear shorts). There are some approaches for learning causally correct models in machine learning, but this is considered a significant unsolved problem and is a focus of research for luminaries such as Yoshua Bengio, who views this as a key limitation of current deep learning approaches, and a necessary step towards AGI.

To summarise, this argument gives us three points:

(a) More data isn’t useful if it’s from the same or similar distributions over domains - we need to distributionally match the deployment domain(s) (match $P (Z)$ ), or have domain labels (make $Z$ observed).

(b) This happens when there are unobserved confounding variables, or more generally partial observability, meaning that we can’t achieve 0 training loss (which could imply a perfect causal model). We assume that this will be the case in the style of large scale offline pretraining used in foundation models.

(c) The above two points combine to imply that ERM-trained models will fail in OOD settings due to being causally confused, in particular under concept shift ( $P (Y | X)$ changes).

Scaling is hence flawed OOD

On top of the argument above, we need several additional claims and points to argue that current approaches to scaling could be unsafe and/or incompetent:

Points

(i) Current scaling approaches use simple ERM losses, on a large diverse data set.

(ii) While scaling produces models trained on static data, these models will be used in interactive and control settings.

Claims

(Note here that (a, ii =>) means points (a) and (ii) from above imply this point, and similarly for the rest of the list).

(a, ii =>) 0 training loss on the training data (and hence possibly a perfect causal model) isn’t possible with the loss functions used, and there are spurious correlations and the potential for causal confusion in this data.
(1 & i, ii =>) scaling will produce models that capture and utilise these spurious correlations for lower loss and are causally confused.
(b =>) At deployment time these models will be used in environments not seen during training, and their actions/interventions can easily lead to OOD situations.
1. These shifts may be incentivised at both training and deployment time and may be difficult to spot due to a misspecification problem (see Hidden Incentives for Auto-Induced Distributional Shift).
(2, 3 & iii =>) These models’ representations will be causally confused and misleading in many (OOD) settings during deployment time, leading to the models failing to generalise OOD. This could be a generalisation failure of capabilities or of objective, depending on the type of shift that occurs. 2. For example, objective misgeneralisation could occur if the internal representation of the goal the agent is optimising for is causally confused (e.g. a proxy of the true goal), and so comes apart from the true goal under distributional shift.

Possible objections to the argument

Here we deal with possible disagreements with this argument as it stands. We cover the implications of the argument and its relevance below. If we imagine the argument above is valid, then any disagreement would come with disagreeing with some of the premises. Some likely places people will disagree:

While in theory, ERM will result in a model utilising spurious correlations to get lower training loss, in practice this won’t be a big issue. This position probably stems from an intuition that the causally correct model is the best model, and so if we expect large-scale models to get almost 0 training error, then it’s likely these large-scale models will have found this causally correct model. This effectively disagrees with claims 1 and 2 (that there are spurious correlations in static training data that ERM-SGD will exploit for lower loss).
1. There’s possibly a less well-argued position that Deep Learning is kind of magical, and hence these issues will probably just disappear with more data. For instance, you might expect something like this to happen for the same sorts of reasons you might expect a model trained offline with SGD+ERM to spawn an inner optimizer that behaves as if it has goals with respect to the outside world: at some level of complexity/intelligence, learning a good causal model of the world might be the best way of quickly/easily/simply explaining the data. We do not dismiss such views but note that they are speculative.
Disagree with claim 3: Some people might think that we’ll have a wide enough data distribution such that the model won’t encounter OOD situations at deployment time. To us, this seems unlikely, especially in the limit if we are to use the AIs to do tasks that we ourselves can’t perform.
We’ll be able to fine-tune in the test environment so won’t experience OOD at deployment, and while changes will happen, continual fine-tuning will be good enough to stop the model from ever being truly OOD
1. We think this may apply in settings where we’re using the model for prediction, but it’s unclear whether continual fine-tuning will be able to help models learn and adapt to the rapid OOD shifts that could occur when the models are transferred from offline learning to online interaction at deployment.

As with some other alignment problems, it could be argued that this is more of a capabilities issue than an alignment issue, and hence that mainstream ML is likely to solve this (if it can be solved). We think It’s still important to discuss whether (and how) we think the issue can be solved: Suppose we accept that mainstream ML will solve this issue. To argue we can still do useful empirical alignment research on large language models, we’d need that solution to not change these models so much that the alignment research won’t generalise. Furthermore, many powerful capabilities might be accessible without proper causal understanding, and mainstream ML research might focus on developing those capabilities instead.

How might we fix this?

If the issue described above is real, then it’s likely it will need to be solved if we are to build aligned AGI (or perhaps AGI at all). Here we describe several possible solutions, ranging from obvious but probably not good enough to more speculative:

We could just find the right data distribution. This seems intractable for current practice, especially when considering tasks that humans haven’t demonstrated frequently or ever in the pretraining data.
We could use something like invariant prediction or a domain generalisation approach, which are methods from supervised learning aimed at tackling OOD generalisation. However, this requires knowledge of the “domains” in the training data, which might be hard to come by. Further, it’s also unclear whether these methods really work in practice; and even in principle, these methods are likely insufficient for addressing causal confusion in general, which is a harder problem.
We could use online training - continually updating our model with new data collected as it interacts with its deployment environment to compensate for the distribution shift. For a system with general capabilities, this would likely mean training in open-ended real-world environments. This seems dangerous - if a shift happens quickly (i.e. due to the agent’s own actions), which then causes a catastrophic outcome, we won’t have time to update our model with new data. Further, this approach might require much greater sample efficiency than currently available, because it would be bottlenecked by the speed of the model’s deployment environment.
We could do online fine-tuning, for example, RL from human preferences. This approach has received attention recently, but it’s currently unknown to what extent it can address the causal confusion problem. For this to work, fine-tuning will likely have to override and correct spurious correlations in the pretrained model’s representations. This seems like the biggest open question in this argument: is it possible for fine-tuning to fix a pretrained model’s representations, and if so, then how? Is it possible with small enough data requirements that human feedback is feasible?
We could extract causal understanding from the model via natural language capabilities. That is, perhaps the pretrained model has “read” about causality in its pretraining data, and if correctly prompted can generate causally correct completions or data. This could then be injected back into the model (e.g. via fine-tuning) or used in some hybrid system that combines the pretrained model with a causal inference component, so as to not rely on the model itself to do correct causal reasoning when unprompted in its internal computations (as an analogy example, see this paper for eliciting reasoning through prompting). This approach seems potentially viable but is so far very speculative, and more research is needed.

Key Cruxes

We can extract several key cruxes from the arguments for and against our position here, and for whether this issue merits concern. These partly pertain to specific future training scenarios, and hence can’t be resolved entirely now (e.g. 1). They’re also determined by (currently) unknown facts about how large-scale pretraining and fine-tuning work, which we hope will be resolved through future work on the following questions:

Are there spurious correlations in the training data?
Will spurious correlation be picked up during ERM large-scale pretraining?
Will the deployment of these models in an interaction and control setting lead to changes too sudden to be handled by continual learning on new data/to what extent would continual learning on new data work?
Can fine-tuning correct or override spurious correlations in pretrained representations?

To us, it seems like (4) is the most important (and most uncertain) crux, as we currently feel like (1), (2) and (3) will probably hold true to an extent that makes fine-tuning essential to achieving sufficient competence and alignment when deploying models trained offline with ERM in decision-making contexts.
Of course, this is all a matter of degrees:

Pretraining will likely pick up some spurious correlations, and some of them may be removed by fine-tuning, and deployed models will change the world to some extent.
The key issue is whether the issues arising from the changes in the world from deployed models combined with the spurious correlations from pretraining can be counteracted by fine-tuning sufficiently well to avoid OOD generalisation failure, especially in a way that leads to objective robustness and alignment.

Implications

There are two strands of relevant implications if this flaw turns out to exist in practice. One strand concerns the implications for capabilities of AI systems and the other concerns alignment research.

In terms of capabilities (and forecasting their increase), this flaw would be relevant as follows: If scaling up doesn’t work, then standard estimates for various AI capabilities which are based on scaling up no longer apply - in effect, everything is more uncertain. This also suggests that current large language models won’t be as useful or pervasive - if they start behaving badly in OOD situations, and we can’t find methods to fix this, then they’ll likely be used less.

In terms of alignment research, it mostly just makes everything less certain. If we’re less certain that scaling up will get to AGI, then we’re less certain in prosaic alignment methods generally. If it’s the case that these issues apply differently to different capabilities or properties (i.e. different properties generalise better or worse OOD), then understanding whether properties related to the representation of the model’s goal and human values generalise correctly is important.

Foundation models might be very powerful and transformative and widely deployed even if they do suffer such flaws. Adversarial examples indicate flaws in models’ representations, and remain an outstanding problem but do not seem likely to prevent deployment. A lack of recognition of this problem might lead to undue optimism about alignment methods that rely on the generalisation abilities of deep learning systems, especially given their impressive in-distribution generalisation abilities.

What’s next?

We think this line of reasoning implies two main directions for further research. First, it’s important to clarify whether this argument actually holds in practice: are large-scale pretrained models causally confused, in what way, and what are the consequences of that? Do they become less confused with scale? This is both empirical work, and also theoretical work investigating to what extent the mathematical arguments about ERM apply to current approaches to large-scale pretraining.

Secondly, under the assumption that this issue is real, we need to build solutions to fix it. In particular, we’re excited about work investigating to what extent fine-tuning addresses these issues, and work building methods designed to fix causal confusion in pretrained models, possibly through fine-tuning or causal knowledge extraction.

Appendix: Related Work

Causality, Transformative AI and alignment - part I discusses how causality (i.e. learning causal models of the world) is likely to be an important part of transformative AI (TAI), and discuss relevant considerations from an alignment perspective. Causality is very related to the issues of ERM, as ERM doesn’t necessarily produce a causal model if it can utilise spurious correlations for lower loss. That work mostly makes the argument that causality is under-appreciated in alignment research, but doesn’t make the specific arguments in this post. Some of their suggested research directions do overlap with ours.

Shaking the foundations: delusions in sequence models for interaction and control and Behavior Cloning is Miscalibrated both point out the issue of causally confused models arising from static pretraining when these models are then deployed in an interactive setting. Focusing on the first work, delusions in large language models (LLMs): an incorrect generation early on throws the LLM off the rails later. The LLM assumes its (incorrect) generation was from the expert/human generating the text, as that's the setting it was trained in, and hence deludes itself. In these settings the human generating the text has access to a lot more information than the model, making generation harder for the model, and delusions more likely: an incorrect generation will make it more likely that the model infers the task or context incorrectly.

The work explains this problem using tools from causality and argues that these models should act as if their previous actions are causal interventions rather than observations. However, training a model in this way requires access to a model of the environment and the expert demonstrating trajectories in an online way, and the authors don't describe a way to do this with purely offline data (it may be fundamentally impossible). The phenomenon of delusions is a perfect example of causal confusion arising from static offline pretraining. This serves as additional motivation for the argument we make, combined with the theory demonstrating that even training on all the correct data isn’t sufficient when using ERM.

The second work makes similar points as the first, framed in terms of models being miscalibrated when trained with offline data. It goes on to discuss possible solutions to this such as combining offline pretraining with online RL fine-tuning and possibly using a penalty to ensure the model stays close to human behaviour (to avoid Goodharting) apart from in scenarios where it’s fixing calibration errors.

Performative Prediction is the concept (introduced in the paper of the same name) where a classifier trained on a training distribution, when deployed, induces a change in the distribution at deployment time (due to the effects of its predictions). For example (quoting from the paper), A bank might estimate that a loan applicant has an elevated risk of default, and will act on it by assigning a high interest rate. In a self-fulfilling prophecy, the high interest rate further increases the customer's default risk. This issue is an example of the causal confusion arising from not modelling a model’s outputs as actions or interventions, but rather just as predictions, which is related to the danger of learning purely predictive models on static data and then using these models in settings where their predictions are actually actions/interventions.

Causal Confusion in Imitation Learning introduces the term causal confusion (although similar issues have been described in the past) in the context of imitation learning (which language model pretraining can be viewed as a version of). As described above, this is when a model learns an incorrect causal model of the world (i.e. causal misidentification), which then leads it to act in a confused manner.

They suggest an approach for solving the problem (assuming access to the correct causal graph structure, but not the edges): using (offline) imitation learning first learn a policy conditioned on causal graphs and then train this policy on many different possible causal graphs. Then perform targeted interventions through either consulting an expert or executing the policy in the environment, to learn the correct causal graph, and then condition the policy on it. This method relies on the assumption of being able to have the correct causal graph structure, which seems infeasible in the more general case, but may still provide intuition or inspiration for other approaches to tackling this problem.

A Study of Causal Confusion in Preference-Based Reward Learning investigates whether preference-based reward learning can result in causally confused preference models (or at least, preference models that pick up on spurious correlations). Unsurprisingly, in the regime they investigate (limited preference data, learning reward models for robotic continuous control tasks), the preference models don’t produce good behaviour when optimised for by a policy, which the authors take as evidence that they’re causally confused. Given the setting, it doesn’t make sense to extrapolate these results into other settings; it seems more likely that they can be explained by “learning a hard task (preference modelling) on limited amounts of static data without pretraining leads to models that don’t generalise well out of distribution”, rather than any specific statement about preference modelling. Of course, preference modelling is a task where OOD inputs are almost guaranteed to occur: when we’re using the preference model to train a policy, it’s likely (and even desired) that the policy will produce behaviour not seen by the preference model during training (otherwise we could just to imitation on the preference model training data).

^{^}
...although it is a bit of a spectrum and our impression is that it is common to retrain models regularly on data that has been influenced by previous versions of the model in deployment.

In the pre-training set, there are lots of places where humans talk about causality (both informally and more formally in myriad academic papers). So a model would ultimately need to learn abstract stuff about causality (e.g. correlation is not causation, arrow of time, causes are local, etc) and concrete causal facts (the moon causes tides, tiny organisms cause mold, etc). Given this knowledge, it's plausible a model M could make reasonable guesses for questions like, "What happens when a model with [properties of model M] starts interacting with the world?" These guesses would be improved by finetuning by RL on actual interaction between M and the world.

(It seems that most of what my ability to make OOD predictions or causal inferences is based on passive/offline learning. I know science from books/papers and not from running my own rigorous control experiments or RCTs.)

I can interpret your argument as being only about the behavior of the system, in which case:
- I agree that models are likely to learn to imitate human dialogue about causality, and this will require some amount of some form of causal reasoning.
- I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling, but it certainly seems highly plausible.

I can also interpret your argument as being about the internal reasoning of the system, in which case:
- I put this in the "deep learning is magic" bucket of arguments; it's much better articulated than what we said though, I think...
- I am quite skeptical of these arguments, but still find them plausible. I think it would be fascinating to see some proof of concept for this sort of thing (basically addressing the question 'when can/do foundation models internalize explicitly stated knowledge')

I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling

GPT-3 (without external calculators) can do very well on math word problems (https://arxiv.org/abs/2206.02336) that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn't seem different in kind from these math word problems.

when can/do foundation models internalize explicitly stated knowledge

Some human causal reasoning is explicit. Humans can't do complex and exact calculations using System 1 intuition, and neither can we do causal reasoning of any sophistication using System 1. The prior over causal relations (e.g. that without looking at any data 'smoking causes cancer' is way more likely than the reverse) is more about general world-model building, and maybe there's more uncertainty about how well scaling learns that.

RE GPT-3, etc. doing well on math problems: the key word in my response was "robustly". I think there is a big qualitative difference between "doing a good job on a certain distribution of math problems" and "doing math (robustly)". This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.

This is a distribution of math problems GPT-3 wasn't finetuned on. Yet it's able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don't see why scaling and access to external tools (e.g. to perform long calculations) wouldn't produce the kind of robustness you have in mind.

I think you're moving the goal-posts, since before you mentioned "without external calculators". I think external tools are likely to be critical to doing this, and I'm much more optimistic about that path to doing this kind of robust generalization. I don't think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.

The important part of his argument is in the second paragraph, and I agree because by and large, pretty much everything we know about science and casuality, at least in the beginning for AI is on trusting the scientific papers and experts. Virtually no knowledge is given by experimentation, but instead by trusting the papers, experts and books.

I disagree; I think we have intuitive theories of causality (like intuitive physics) that are very helpful for human learning and intelligence.

(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:

I. (Safety) Relevant examples of Z

To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology.

Here are a couple examples I came up with: In the NL case, the URL that the text was drawn from. In the code generation case, hardware constraints, such as RAM limits. I don't see why a priori either of these should cause safety problems rather than merely capabilities problems. Would be curious to hear arguments here, and alternative examples which seem more safety relevant. (Note that both of these examples seem like dataset-wide Z).

II. Causal identifiability, and the testability of confoundedness

As Owain's comment thread mentioned, models may be incentivized instrumentally to do causal analysis e.g. by using human explanations of causality. However, even given an understanding of formal methods in causal inference, the model may not have the relevant data at hand. Intuitively, I'd expect there usually not to be any deconfounding adjustment set observable in the data^[1]. As a weaker assumption, one might hope that causal uncertainty might be modellable from the data. As far as I know, it's generally not possible to rule out the existence of unobserved confounders from observational data, but there might be assumptions relevant to the LM case which allow for estimation of confoundedness.

III. Existence of malign generalizations

The strongest, and most safety relevant implication claimed is "(3) [models] reason with human concepts. We believe the issues we present here are likely to prevent (3)". The arguments in this post increase my uncertainty on this point, but I still think there are good a priori reasons to be skeptical of this implication. In particular, it seems like we should expect various causal confusions to emerge, and it seems likely that these will be orthogonal in some sense such that as models scale they cancel and the model converges to causally-valid generalizations. If we assume models are doing compression, we can put this another way: Causal confusions yield shallow patterns (low compression) and as models scale they do better compression. As compression increases, the number of possible strategies which can do that level of compression decreases, but the true causal structure remains in the set of strategies. Hence, we should expect causal confusion-based shallow patterns to be discarded. To cash this out in terms of a simple example, this argument is roughly saying that even though data regarding the sun's effect mediating the shorts<>ice cream connection is not observed -- more and more data is being compressed regarding shorts, ice cream, and the sun. In the limit the shorts>ice cream pathway incurs a problematic compression cost which causes this hypothesis to be discarded.

^{^}
High uncertainty. One relevant thought experiment is to consider adjustment sets of unobserved var Z=IsReddit. Perhaps there exists some subset of the dataset where Z=IsReddit is observable and the model learns a sub-model which gives calibrated estimates of how likely remaining text is to be derived from Reddit

Relevant: Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations.

They argue that the pre-trained network already learns some non-confused features but doesn't use them. And you just need to fine-tune the last layer to utilize them.

We’ll be able to fine-tune in the test environment so won’t experience OOD at deployment, and while changes will happen, continual fine-tuning will be good enough to stop the model from ever being truly OOD. We think this may apply in settings where we’re using the model for prediction, but it’s unclear whether continual fine-tuning will be able to help models learn and adapt to the rapid OOD shifts that could occur when the models are transferred from offline learning to online interaction at deployment.

Couldn't the model just fail at the start of fine-tuning (because it's causally confused), then learn in a decision setting to avoid causal confusion, and then no longer be causally confused?

If no - I'm guessing you expect that the model only unlearns some of its causal confusion. And there's always enough left so that after the next distribution shift the model again performs poorly. If so, I'd be curious why you believe that the model won't unlearn all or most of its causal confusion.

Here we're saying that the continual fine-tuning might not necessarily resolve causal confusion within the model; instead, it will help the model learn the (new) spurious correlations so that it still performs well on the test data. This is assuming that continual fine-tuning is using a similar ERM-based method (e.g. the same pretraining objective but on the new data distribution). In hindsight, we probably should have written "continual training" rather than specifically "continual fine-tuning". If you could continually train online in the deployment environment then that would be better, and whether it's enough is very related to whether online training is enough, which is one of the key open questions we mention.

OOD misgeneralisation is unlikely to be a direct x-risk from superintelligence

Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.

OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, internet, (future) autonomous AI agents, the civilisation becomes more complex as a whole, and distribution of many variables change. ("Towards a Theory of Evolution as Multilevel Learning" is my primary source of intuition about this.) In the study of complex systems, there is a postulate that each component (subsystem) is ignorant of the behaviour of the system as a whole, and doesn't know the full effect of its actions. This applies to any components, no matter how intelligent. Humans misgeneralise all the time (examples: lead in petrol, creation of addictive apps such as Instagram, etc.) Superintelligence will misgeneralise, too, though perhaps in ways which are very subtle or even incomprehensible to humans.

Then, it's possible that superintelligence will misgeneralise due to casual confusion on some matter which is critical to humans' survival/flourishing, e. g. about something like qualia, human's consciousness and their moral value. And, although I don't feel this is a negligible risk, exactly because superintelligence probably won't have direct experience or access to human consciousness, I feel this exact failure mode is somewhat minor, compared to all other reasons for which superintelligence might kill us. Anyway, I don't see what can we do about this, if the problem is indeed that superintelligence will not have first-hand experience of human consciousness.

Capabilities consequences:
1) The model may not make competent predictions out-of-distribution (capabilities misgeneralisation). We discuss this further in ERM leads to causally confused models that are flawed OOD.
Alignment consequences:
2) If the model is causally confused about objects related to its goals or incentives, then it might competently pursue changes in the environment that either don’t actually result in the reward function used for training being optimised (objective misgeneralisation).

Did you use the term "objective misgeneralisation" rather than "goal misgeneralisation" on purpose? "Objective" and "goal" are synonyms, but "objective misgeneralisation" is hardly used, "goal misgeneralisation" is the standard term.

Also, I think it's worth noting that this distinction between capabilities and goal misgeneralisation is defined in the RL framework. In other frameworks, such as Active Inference, this is the same thing, because there is no ontological distinction between reward and belief.

It might be suspected that OOD generalisation can be tackled in the scaling paradigm by using diverse enough training data, for example, including data sampled from every possible test environment. Here, we present a simple argument that this is not the case, loosely adapted from Remark 1 from Krueger et al. REx:
The reason data diversity isn’t enough comes down to concept shift (change in ). Such changes can be induced by changes in unobserved causal factors, Z. Returning to the ice cream ( $Y$ ) and shorts ( $X$ ), and sun ( $Z$ ) example, shorts are a very reliable predictor of ice cream when it is sunny, but not otherwise. Putting numbers on this, let’s say $P (Y = 1 | X = 1, Z = 1) = 90 %, P (Y = 1 | X = 1, Z = 0) = 25 %$ . Since the model doesn’t observe $Z$ , there is not a single setting of $P (Y = 1 | X = 1)$ that will work reliably across different environments with different climates (different $P (Z)$ ). Instead
$P (Y = 1 | X = 1) = P (Y = 1 | X = 1, Z = 1) P (Z = 1) + P (Y = 1 | X = 1, Z = 0) P (Z = 0)$
depends on $P (Z)$ , which in turn depends on the climate in the locations where the data was collected. In this setting, to ensure a model trained with ERM can make good predictions in a new “target” location, you would have to ensure that that location is as sunny as the average training location so that $P (Z = 1)$ is the same at training and test time. It is not enough to include data from the target location in the training set, even in the limit of infinite training data - including data from other locations changes the overall $P (Z = 1)$ of the training distribution. This means that without domain/environment labels (which would allow you to have different $P (Y = 1 | X = 1)$ for different environments, even if you can’t observe $Z$ ), ERM can never learn a non-causally confused model.

Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks (but true for random forests). A deep neural network can discover variables that are not present but are probable confounders of several other variables, such as "something that is a confounder of shorts, sunscreen, and ice-cream".

Discovering such hidden confounders doesn't give interventional capacity: Mendel discovered genetic inheritance factors, but without observing them, he couldn't intervene on them. Only the discovery of DNA and later the invention of gene editing technology allowed intervention on genetic factors.

One can say that discovering hidden confounders merely extends what should be considered in-distribution environment. But then, what is OOD generalisation, anyway? And can't we prove that ERM (or any other training method whatsoever) will create models which will fail sometimes simply because there is Gödel's incompleteness in the universe?

While this model might not make very good predictions, it will correctly predict that getting you to put on shorts is not an effective way of getting you to want ice cream, and thus will be a more reliable guide for decision-making (about whether to wear shorts).

I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?

(a, ii =>)

What do these symbols in parens before the claims mean?

Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.

The main argument of the post isn't "ASI/AGI may be causally confused, what are the consequences of that" but rather "Scaling up static pretraining may result in causally confused models, which hence probably wouldn't be considered ASI/AGI". I think in practice if we get AGI/ASI, then almost by definition I'd think it's not causally confused.

OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity

In a theoretical sense this may be true (I'm not really familiar with the argument), but in practice OOD misgeneralisation is probably a spectrum, and models can be more or less causally confused about how the world works. We're arguing here that static training, even when scaled up, plausibly doesn't lead to a model that isn't causally confused about a lot of how the world works.

Did you use the term "objective misgeneralisation" rather than "goal misgeneralisation" on purpose? "Objective" and "goal" are synonyms, but "objective misgeneralisation" is hardly used, "goal misgeneralisation" is the standard term.

No reason, I'll edit the post to use goal misgeneralisation. Goal misgeneralisation is the standard term but hasn't been so for very long (see e.g. this tweet: https://twitter.com/DavidSKrueger/status/1540303276800983041).

Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks

Given that the model is trained statically, while it could hypothesise about additional variables of the kinds your listed, it can never know which variables or which values for those variables are correct without domain labels or interventional data. Specifically while "Discovering such hidden confounders doesn't give interventional capacity" is true, to discover these confounders he needed interventional capacity.

I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?

We're not saying that P(shorts, icecream) is good for decision making, but P(shorts, do(icecream)) is useful in sofar as the goal is to make someone where shorts, and providing icecream is one of the possible actions (as the causal model will demonstrate that providing icecream isn't useful for making someone where shorts).

What do these symbols in parens before the claims mean?

They are meant to be referring to the previous parts of the argument, but I've just realised that this hasn't worked as the labels aren't correct. I'll fix that.

I think I basically hold disagreement (1), which I think is close to Owain’s comment. Specifically. I think a plausible story for a model learning causality is:

The model learns a lot of correlations, most real (causal) but many spurious.
The model eventually groks that there’s a relatively simple causal model explaining the real correlations but not the spurious ones. This gets favored by whatever inductive bias the training process/architecture encodes.
The model maintains uncertainty as to whether the spurious correlations are real or spurious, the same way humans do.

In this story the model learns both a causal model and the spurious correlations. It doesn’t dismiss the spurious correlations but still models the causal ones. This lets it minimize loss, which I think addresses the counter argument to (1).

Forget OOD for a minute; ERM can't even learn to avoid spurious correlations that have counterexamples in the training data. Datasets like Waterbirds (used in that previously linked paper) are good toy datasets for figuring this out. I think we need to solve this problem first before trying to figure out OOD generalization.

I expect that these kinds of problems could mostly be solved by scaling up data and compute (although I haven't read the paper). However, the argument in the post is that even if we did scale up, we couldn't solve the OOD generalisation problems.

Assuming online training for causality follows a similar trajectory to the tasks studied here (language modeling for code or natural language), we would expect scaling to reduce the amount of additional signal needed to get good at causality. But obviously that's an empirical question that would be good to get an answer for as you mention in the article!

Are you aware of any examples of the opposite happening? I guess it should for some tasks.

I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling

when can/do foundation models internalize explicitly stated knowledge

I disagree; I think we have intuitive theories of causality (like intuitive physics) that are very helpful for human learning and intelligence.

(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:

I. (Safety) Relevant examples of Z

II. Causal identifiability, and the testability of confoundedness

III. Existence of malign generalizations

^{^}
High uncertainty. One relevant thought experiment is to consider adjustment sets of unobserved var Z=IsReddit. Perhaps there exists some subset of the dataset where Z=IsReddit is observable and the model learns a sub-model which gives calibrated estimates of how likely remaining text is to be derived from Reddit

Relevant: Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations.

They argue that the pre-trained network already learns some non-confused features but doesn't use them. And you just need to fine-tune the last layer to utilize them.

We’ll be able to fine-tune in the test environment so won’t experience OOD at deployment, and while changes will happen, continual fine-tuning will be good enough to stop the model from ever being truly OOD. We think this may apply in settings where we’re using the model for prediction, but it’s unclear whether continual fine-tuning will be able to help models learn and adapt to the rapid OOD shifts that could occur when the models are transferred from offline learning to online interaction at deployment.

OOD misgeneralisation is unlikely to be a direct x-risk from superintelligence

Capabilities consequences:
1) The model may not make competent predictions out-of-distribution (capabilities misgeneralisation). We discuss this further in ERM leads to causally confused models that are flawed OOD.
Alignment consequences:
2) If the model is causally confused about objects related to its goals or incentives, then it might competently pursue changes in the environment that either don’t actually result in the reward function used for training being optimised (objective misgeneralisation).

It might be suspected that OOD generalisation can be tackled in the scaling paradigm by using diverse enough training data, for example, including data sampled from every possible test environment. Here, we present a simple argument that this is not the case, loosely adapted from Remark 1 from Krueger et al. REx:
The reason data diversity isn’t enough comes down to concept shift (change in ). Such changes can be induced by changes in unobserved causal factors, Z. Returning to the ice cream ( $Y$ ) and shorts ( $X$ ), and sun ( $Z$ ) example, shorts are a very reliable predictor of ice cream when it is sunny, but not otherwise. Putting numbers on this, let’s say $P (Y = 1 | X = 1, Z = 1) = 90 %, P (Y = 1 | X = 1, Z = 0) = 25 %$ . Since the model doesn’t observe $Z$ , there is not a single setting of $P (Y = 1 | X = 1)$ that will work reliably across different environments with different climates (different $P (Z)$ ). Instead
$P (Y = 1 | X = 1) = P (Y = 1 | X = 1, Z = 1) P (Z = 1) + P (Y = 1 | X = 1, Z = 0) P (Z = 0)$
depends on $P (Z)$ , which in turn depends on the climate in the locations where the data was collected. In this setting, to ensure a model trained with ERM can make good predictions in a new “target” location, you would have to ensure that that location is as sunny as the average training location so that $P (Z = 1)$ is the same at training and test time. It is not enough to include data from the target location in the training set, even in the limit of infinite training data - including data from other locations changes the overall $P (Z = 1)$ of the training distribution. This means that without domain/environment labels (which would allow you to have different $P (Y = 1 | X = 1)$ for different environments, even if you can’t observe $Z$ ), ERM can never learn a non-causally confused model.

While this model might not make very good predictions, it will correctly predict that getting you to put on shorts is not an effective way of getting you to want ice cream, and thus will be a more reliable guide for decision-making (about whether to wear shorts).

I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?

(a, ii =>)

What do these symbols in parens before the claims mean?

Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.

OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity

Did you use the term "objective misgeneralisation" rather than "goal misgeneralisation" on purpose? "Objective" and "goal" are synonyms, but "objective misgeneralisation" is hardly used, "goal misgeneralisation" is the standard term.

Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks

I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?

What do these symbols in parens before the claims mean?

They are meant to be referring to the previous parts of the argument, but I've just realised that this hasn't worked as the labels aren't correct. I'll fix that.

I think I basically hold disagreement (1), which I think is close to Owain’s comment. Specifically. I think a plausible story for a model learning causality is:

The model learns a lot of correlations, most real (causal) but many spurious.
The model eventually groks that there’s a relatively simple causal model explaining the real correlations but not the spurious ones. This gets favored by whatever inductive bias the training process/architecture encodes.
The model maintains uncertainty as to whether the spurious correlations are real or spurious, the same way humans do.

Are you aware of any examples of the opposite happening? I guess it should for some tasks.