Epistemic Status: Exploratory. My current but-changing outlook with limited exploration & understanding for ~60-80hrs.

Acknowledgements: This post was written under Evan Hubinger’s direct guidance and mentorship as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Thanks to particlemania, Shashwat Goel and shawnghu for exciting discussions. They might not agree with some of the claims made here; all mistakes are mine.

Summary (TL;DR)

Goal:  Understanding the inductive biases of Prosaic AI systems could be very informative towards creating a frame of safety problems and solutions. The proposal here is to generate an Evidence Set from current ML literature to model the potential inductive bias of Prosaic AGI.

Procedure: In this work, I collect evidence of inductive biases of deep networks by studying ML literature. Moreover, I estimate from current evidence whether these inductive biases vary with scaling to large models. If a phenomenon seems robust to or amplified by scaling, I discuss it here and add it to the Evidence Set.

Structure: I provide interpretations of some interesting papers to AI safety in three maximally-relevant subareas of ML literature (pretraining-> finetuninggeneralization and adversarial robustness), and demonstrate use cases to AI safety (in ‘Application’ subsections). I then summarize evidence from each area to form the Evidence Set.

Current Evidence Set: Given in Section “Evidence Set” (last section of this post)

Inspiration: I think developing good intuitions about inductive biases and past evidence essentially constitutes ‘experience’ used by ML researchers. Evidence sets and broadly analyzing inductive biases are the first steps towards mechanizing this intuition. Inductive bias analysis might be the right level of abstraction. A large, interconnected Evidence Set might give a degree of gears-level understanding of the black-box that is deep networks.

Applications: Any theory of inductive biases, if fully developed, can sufficiently explain the observations collected in the evidence set. Alternatively, we can use the evidence set to falsify mechanistic statements about training rationales in Training stories. However, I expect evidence sets to be helpful beyond these two applications.

Limitations: Evidence sets, and more broadly, inductive bias analysis, seems useful for tackling inner-alignment and not aligned objective-learning (outer-alignment). Secondly, the predictive power of evidence sets is unclear. When the proposal is fully fleshed out, it could have lower predictive power on properties or possible outcomes of training rationales, limiting its applicability. Thirdly, it’s challenging to mechanize or automate & scale inductive bias analysis, unlike say, transparency tools. Nevertheless, I think the idea of evidence sets is a promising proposal.

A: Where does it Fit in Alignment Literature?

Situating in Safety Scenarios: Studying inductive biases of Prosaic AI seems well-suited for systems arising from the scaling hypothesis. More broadly, from Dafoe’s AGI perspectives, studying inductive biases could be instrumental for safety considerations in the case of General Purpose Technology and may be helpful for the AI Ecology system classes. A-priori thinking about inductive biases of super-intelligent agents looks very  unintuitive to me as I cannot currently understand how ML literature could meaningfully inform inductive biases of intelligent agents which could tweak themselves[1].

Direct Applications: I list two applications where our evidence set can be directly helpful. If this post seems useful in your safety work, please link it in the comments.

  • Checking Falsifiable Claims in Training Stories: I essentially expand inductive bias analysis as proposed in training stories, aiming to analyze them at the right level.
  • Evidence Set can be used to falsify theories: Examples of such theories could range from conceptions of how Prosaic AGI systems might look to theories of how deep networks work and possible tradeoffs between different alignment constraints.

B: Basic Inductive Biases

I list some inductive biases based on the nature of learning and data. While it’s not immediately apparent to me how these would be useful for safety problems, nevertheless, they are a good start.

(B2) Dataset-Reality Mismatch Bias (Ref: Unbiased Look at Dataset Bias (CVPR ’11))
Effect & Scalability: Definitely real, independent from models.
My Interpretation: Datasets are used as proxies for real-world data, but suffer from an underspecification bias: They contain too little information to reflect the full range of competencies a model needs to generalize robustly in the real world. This is often the bias responsible for idiosyncrasies and surface-level correlations in models when occurring in training data and for favouring non-robust models over more robust internally-aligned models in test data.

PF: Inductive Biases of Large, Pretrained Models

Motivation: Pretraining learns Linguistically Grounded Representations

Large scale pretraining has been instrumental in advancing performance in imagestext and multi-modal data, making them important candidates for studying inductive biases. Large Language Models would be good candidates for studying inductive biases in this area. The NLP research community has made progress in analyzing these models and finding how and if broadly-useful linguistic features are learnt (pretrain) and transferred (finetune) in this paradigm. Researchers have studied and interpreted what representations large language models encode specifically linguistic properties.

Probing:[2] A popular, promising transparency mechanism used is Probing, which famously illustrated how BERT Rediscovers the Classical NLP Pipeline (ACL ‘19).

What are Probes? Probes are small (varying complexity) parametric classifiers trained to detect a linguistic property of relevance. Their input is the representation to be analyzed. High probing accuracy is a valuable indicator: Saying the probed linguistic property is encoded and easily extractable in the representation. In contrast, low probing accuracy is less informative as it’s hard to disentangle if the linguistic property was not encoded or was not extractable. I think this is a neat model for investigating the presence of high-level information, somewhat comparable to per-neuron transparency approaches like Circuits (Blog). Furthermore, Probing as Quantifying the Inductive Bias of Pre-trained Representations (Arxiv) provides a simple, elegant Bayesian framework for using probing-like transparency tools for systematically investigating inductive biases in models.

Evidence & Opinion: Limitations of Pretraining & Probing-esque Transparency Tools

(PF1) Evidence: Out of Order: How Important Is The Sequential Order of Words in a Sentence in Natural Language Understanding Tasks? (ACL ‘21), Sometimes We Want Translationese (EMNLP Findings ‘21), UnNatural Language Inference (ACL ‘21), What Context Features Can Transformer Language Models Use? (ACL ‘21)
Effect & Scalability: Seems real. Seems robust to scaling (low confidence)
My Interpretation: Large language models show good performance when finetuned on various hard tasks like machine translation, summarization and determining logical entailment between sentences. But are these downstream tasks demonstrative of encoding good linguistically-grounded representations about language like we expect?

The above works demonstrate this is not the case in these downstream tasks by perturbations like shuffling words in input sentences in the test set, which destroy syntax, making them sound gibberish. Surprisingly, these papers find that deep networks still perform well with shuffled inputs on the GLUE benchmark suite of tasks, Machine translation, Natural Language Inference. Similarly, deleting particular in context does not drop performance in language modelling.

This is explained from two outlooks:

(a) Works say that current datasets are dramatically insufficient to capture the full complexity of the underlying task (among other things). I think it’s likely that these test sets are too simple.

(b) Deep models have demonstrated time-and-again a propensity to cheat by relying on not-human-like, surface-level features instead of linguistically-grounded features for learning (cute illustration here).

Non-Human Interpretable Features Usually Work (NHIFUW) Hypothesis : I hypothesize it’s possible that surface-level, brittle correlations are effective in capturing a reasonable degree of complexity of tasks in average cases but fail in relatively worst case scenarios (providing easy counter-examples and calls/arguments for bad dataset creation).

(PF2) Evidence: Pre-training without Natural Images (ACCV ‘20) A critical analysis of self-supervision, or what we can learn from a single image (ICLR ‘20)
Effect & Scalability: Seems real. Seems robust to scaling (medium confidence)
My Interpretation: These works illustrate that pretrained models/early layers of deep networks contain limited information about the statistics of natural images and hence can be captured via synthetic images or transformations instead of using large natural image datasets. They illustrate this by pretraining on datasets providing limited to no information about natural images, such as using a heavily-augmented natural image or a large dataset of fractal patterns. Even such synthetic pretraining learns useful features which help achieve: (i) Significantly better performance than training models from scratch (ii) Comparable performance to models pretrained on large natural-image datasets.

I believe this gives credence to the NHIFUW hypothesis. They made it hard by design to capture any meaningful information about natural images in these weird pretrain datasets. Suppose spurious representations learned from fractals still transfer well. In that case, it is questionable to what extent human-like biases help large pretrained models perform well on downstream vision tasks.

(PF3) Evidence: Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little (EMNLP ‘21) (Similar: Does Pretraining for Summarization Require Knowledge Transfer? (EMNLP ‘21))
Effect & Scalability:  Seems real. Seems robust to scaling, but has credible evidence against [3] (medium confidence)

My Interpretation: In PF1, we saw the effects of testing. This work demonstrates that pretraining might not be encoding similar rich-linguistic properties. Furthermore, probing detects the encoding of syntactic properties when we explicitly destroy syntax by shuffling words. This indicates possible fundamental shortcomings or requirements for additional rigour like creating better control datasets for transparency tools like probing to be informative.

How? They show it by a simple trick: They shuffle words in every sentence of the corpora while preserving uni- or bi-grams (as done in the title subheading). This destroys syntax, which is fundamental towards encoding rich linguistic information. They train a transformer model on this shuffled corpus and show that: (a) It dramatically outperforms training from scratch on a downstream task, i.e., pretraining gave a large advantage even without rich linguistic information. (b) It performs comparably to pretraining a transformer on original corpora, i.e., adding rich linguistic biases was not helpful for downstream tasks. Lastly, they verify that the model didn’t self-correct and reconstruct correct-word ordering by illustrating poor performance on non-parametric checks and predicting the next-word (natural language modelling).

Pretraining on shuffled words does dramatically improve performance compared to training from scratch gives credence to NHIFUW hypothesis, as this experiment controls for bad dataset creation to a good degree.

What is very surprising is that a variety of different-complexity parametric probes detected rich syntactic information being encoded despite us explicitly destroying syntax by shuffling sentences, sometimes outperforming training on original corpora. What does this imply? I found Probing Classifiers: Promises, Shortcomings, and Advances (Squib, Computational Linguistics) to be a lucid and insightful summary. I think shortcomings of probing stated in their conclusions should generalize to more popular transparency research, say, by Chris Olah. If so, drawing meaningful conclusions from transparency tools seems to be more complicated than it may seem. They only indicate a high correlation to the linguistic property in question, which often happens even when the property isn’t present. Additionally, we might need various control sets to isolate effects or a combination of transparency tools for the reliable determination that alleviates each other’s shortcomings.

Addressing this issue with adversarial robustness: Adversarial NLI: A New Benchmark for Natural Language Understanding (ACL ‘20), Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Model (NeurIPS ‘21)

(PF4) Evidence: Catastrophic forgetting in connectionist networks
Effect & Scalability: Definitely real. Robust to scaling (high confidence)
My Interpretation: In sequential learning, the phenomenon of catastrophic forgetting occurs when neural networks lose all predictive power on previous tasks (e.g. pretraining) once trained on a subsequent dataset (e.g. finetuning). This hinders learning in several ways: (i) Learning about the world might require learning and reasoning over multiple timesteps of data ingestion. If so, deep networks have an inductive bias of being episodic, not only in objective/reward optimizing sense but not remembering knowledge from the past at all.

Why is this important? I've recently come across a proposal for tackling the off-distribution shift, which essentially says do online updates! This bias highlights a huge and often ignored cost: Losing nearly all predictive power on previously learned tasks. Hence, deep networks (specifically the feature learning parts) are not updated online. This is quite a general problem, which crops up when one wants to add more data to a trained AI system without retraining on all past data.

Application: Transparency

The Other Side of Transparency: I am impressed by conceptual frameworks developed using transparency and also the speed of progress and thoroughness in the analysis of neurons with transparency tools. I think transparency might be beneficial in analyzing failure-modes of learning in Advanced AI. Simultaneously, I fear being too optimistic about their capabilities as I was with probing. Hence I present lessons learned from the investigation of probing, which might apply to other transparency tools:

Why?Visible thoughts of models, HCH or neuron-by-neuron understanding (the approaches I’ve been exposed to) seem to have the assumption of being human-interpretable. Based on the above finding, it seems possible that growing evidence will show substantial subsets of representations learned by large deep networks are very unintuitive to humans but important for competitiveness (ref: ERM story). Secondly, when introducing good control sets for evaluating transparency mechanisms as was done with probing, the advantages of transparency mechanisms might vanish or reduce by a greater amount than I expected.

We might need to aim for worst-case transparency for safety considerations as: 
(a) Many aspects of Circuits and similar tools might be modeling for average-case transparency (likely to fail in the worst case) (b) There have been historical precedents of issues in similar attempts to achieve transparency, one interesting example being average-case transparency is under-specified, not needing a specific threat-model.

Meta-Note: Ah, the feeling of being reduced to tears looking at your own proposal.

G: Inductive Biases of Generalizing Models


I now discuss the inductive biases of deep networks on generalization properties. ML Literature studies generalization to understand how deep network training consistently draws from a small subset of possible perfectly fitting models with good generalization properties. The safety aim here is to understand how inductive biases affect the output model space and then use this knowledge to try and obtain only inner-aligned (or robust) models. Now, we cannot change strong inductive biases. Hence, we should consider them as hard constraints in the alignment problem. On the other hand, weaker inductive biases can be traded-off with other considerations if required for alignment, forming softer constraints.

Related posts: [How to think about Over-parameterized Models]

Evidence & Opinion: Generalization is tricky to Coherently Model

(G1) Evidence: Understanding deep learning requires rethinking generalization (ICLR ‘17)
Effect & Scalability: Definitive. Gets stronger with scaling (high confidence)
My Interpretation: This work demonstrates that overparameterized deep networks trained with SGD can obtain a perfectly fitting model for any input -> output mapping (dataset), which they call ‘memorizing’ the training set, despite using popular regularization techniques. We can conclude that there is not enough (a) implicit regularization provided by SGD (b) and explicit regularization provided by popular regularizers to restrict the output space to models that generalize well.

(G2) Evidence: Bad Global Minima Exist and SGD Can Reach Them (NeurIPS ‘20)
Effect & Scalability: Seems real. Seems robust to scaling (medium confidence)
My Interpretation: This work tweaks the training procedure: training with random labels first and then switching to real labels. By this, SGD finds poorly-generalizing models (bad minima) consistently. Furthermore, adding a degree of explicit regularization can improve convergence, obtaining a good-generalizing model.

This is an exciting test for learning theories, which must predict in which settings SGD could find poor-generalizing models (this being one of them). A theory that always predicts convergence to good-generalizing models can be rejected by this evidence.

(G3) Evidence: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation (NeurIPS ‘20)
Effect & Scalability: Seems real. Seems robust to scaling (medium-high confidence)
My Interpretation: This work shows that deep networks can have a propensity for memorization of a part of the training data while simultaneously being able to generalize by correctly fitting other parts of the train set. Deep networks can memorize specific data points and generalize well with a shared parameterization (lower layers etc.).

This work motivates this bias: Most natural data distributions are long-tailed, containing a large number of underrepresented subpopulations where sometimes the best a model can do is memorize. Surprisingly, researchers could select for memorizing the tail-end of data distributions (including noise and outliers). It gives better generalization performance on the test set than works that ignore it because similar points reoccur.

(G4) Evidence: Deep Double Descent (Blog) [Rohin’s Summary and Opinions][Discussions #1 and #2]
Effect & Scalability: Seems real. Gets stronger with scaling (medium confidence)
My Interpretation: As we increase the parameter size or increase the length of training or decrease training data, in all cases, when in the zero train error (interpolation) regime, larger models/longer optimization achieve better generalization. This is very counter-intuitive because given you have plenty of correct (zero train error) models in your current hypothesis space, this states that selecting a larger hypothesis space still allows you to find better generalizing models, albeit with decreasing marginal returns. The implication is that in larger hypothesis sets, we have a higher probability of getting a better model[4].

(G5) Evidence: Deep Expander Networks: Efficient Deep Networks from Graph Theory (ECCV18), Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot (NeurIPS ‘20), Pruning Neural Networks at Initialization: Why are We Missing the Mark? (ICLR ‘21) 
Effect & Scalability: Seems real. Might be stronger with scaling (medium confidence)
My Interpretation: The surprising finding in pruning has been that layerwise-random selection leads to much smaller subnetworks which on training obtain accuracies close to the original models (called ‘tickets’) across a variety of networks, pruning settings[5]. I find analyzing these more straightforward and more valuable than lottery tickets.

In context of previous evidence (G4), it seems to me that the main advantage gained by a richer hypothesis set is not the larger #parameters available to express functions (works well after some parameters are pruned) but richer connectivity. This rich connectivity is likely obtained in random subnetworks discussed here, allowing SGD to search for a good model effectively.

(G6) Evidence: Is SGD a Bayesian sampler? Well, almost (JMLR) [Summary][Discussion][Zach’s Summary & Opinion]. Why flatness does and does not correlate with generalization for deep neural networks (Arxiv)
Effect & Scalability: Unknown, but major if true. Unknown (low confidence).
My Interpretation: This line of papers form the following narrative– The prior distribution of (initialized) deep models is strongly biased towards simple functions. The posterior distribution of (learned) deep networks is simple as training (SGD) contributes little-to-no inductive biases of its own. Hence, we likely obtain simple functions as output and generalize well. How?

(1) On prior: Simple functions (input -> output) might have a substantially larger volume in parameter-space, i.e., there are many more ways to express a simple function than a complex one when you are very overparameterized[6].

(2) On posterior: This is shown by empirically comparing the distribution of SGD-trained networks and randomly sampled networks with similar test accuracies, and finding they correlate well[7].

(3) On Generalization: A trained model's simplicity is measured by its prior probability (using Gaussian Processes- Appendix D). This prior probability strongly correlates with generalization error across different accuracy ranges[8].

Together, these arguments imply: 
(a) From (2): SGD has little-to-no inductive biases. 
(b) From (1&3): Simplicity prior (in 1) is a great generalization measure.

These seem promising if the empirical results remain robust with better controls and more rigorous analysis. I am skeptical, but admittedly don’t have enough background to provide an incisive critique. Some background why I might be skeptical of correlations:

(i) Evaluation: Fantastic Generalization Measures and Where to Find Them (ICLR ‘20) In Search of Robust Measures of Generalization (NeurIPS ‘20) have been skeptical of previous generalization measures upon the causal analysis in average-case and worst-case criteria, respectively. The results here seem preliminary, and it’s unclear if it’ll hold up in more rigorous evaluation.

(ii) Inductive Biases: Can this framework explain past contrary findings? E.g., Implicit Regularization in Deep Matrix Factorization (NeurIPS ‘19), among others, have illustrated biases in SGD to a certain degree, here it seems to convincingly isolate (by using a deep linear network) that SGD has a strong low-rank inductive bias which is not the smallest nuclear norm.

(G7) Evidence: Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [Rohin’s Summary & Opinion][Added from comment by Gwern.]
Effect & Scalability: Unknown, but important if true. Unknown (low confidence).
My Interpretation: The main claim is that deep models might change behaviors abruptly, shifting from one strategy to a completely different one in a short time. In this paper, they test the ability of deep networks to perform modular arithmetic correctly. They observe that deep networks quickly memorize the dataset but fail to really generalize from it. At a certain point, however, networks suddenly switch to a strategy which generalizes really well (not a gradual increase either).

Rohin gives a possible learning dynamic, which seems intuitive to me: Gradient descent quickly gets to a memorizing function, and then moves mostly randomly through the space, but once it hits upon the correctly generalizing function (or something close enough to it), it very quickly becomes confident in it and then never moving very much again. Gwern states that it's in general a good prior to model deep networks as being on the 'edge of chaos' in various ways. Currently, the evidence for sudden shifts in dynamics as a way of learning is preliminary. However, if they frequently occur, it could significantly affect our understanding of deep networks and may be safety relevant. E.g. Predict when shifts are likely to occur and how it might affect learning might be a hard problem, in contrast, the direction and degree of gradual shifts are easy to predict, by the degree of decrease in the loss.

More potentially-informative papers which could inform limits of capabilities of a deep network, but I am unable to judge or contextualize value: Universal Transformers (ICLR ‘19), Theoretical Limitations of Self-Attention in Neural Sequence Models (TACL).

Application: Memorization

The Other Side of Memorization: We sometimes really want models to not engage in certain types of behaviourTruthfulQA: Measuring How Models Mimic Human Falsehoods [Discussionreally want models to avoid generating ‘falsehoods’ as the target behaviour. This work shows the failures of large pretrained models by demonstrating that they have considerable ability to create falsehoods. This suggests a very relevant question: To what degree are the falsehoods caused due to generalization and memorization?

In this case, I think it’s mostly the ability to memorize and extract that information from the model that seems to be tested here[9], which is better done with larger language models. I think this is likely what’s happening because of other evidence I saw: (i) Larger models are more informative, i.e. models better memorize and extract all information (ii) Works like Extracting Training Data from Large Language Models (USENIX ‘21) explicitly demonstrate interesting cases of unintended, extractable memorized information from larger models.

A Simpler Setting: Let’s consider a simple case- How benign is benign overfitting? (ICLR ‘21). It starts by recounting phenomena summarized in (G3), i.e. Deep Networks usually overfit (memorize) perfectly to partial label noise present in large-scale datasets, which is usually considered ‘benign’.

It disputes that such memorization is ‘benign’ by convincingly citing literature and providing new evidence that (among other things[10]): When models are made adversarially robust, they do not blindly memorize the entire training data. The data they successfully refrain from learning is a part of the incorrectly labelled and even the atypical examples present in the training data.

This indicates: (a) We might entirely miss the relatively small set of ground-truth (read: internally aligned) models simply by selecting from models achieving zero train-error (b) Adversarial robustness is said to be one way of tackling the above-stated issue in a scalable fashion, updating my confidence in similar principles being used for safety proposals.

Coming Back: Regarding the task of generating falsehoods, what would robust behaviour look like? My current best guess is a neural-version of my all-time favourite works: Never-Ending Language Learning, which designs a self-correcting classifier that reads the web and generates more facts in a knowledge graph by linking facts and verifying fit. In this case, conspiracy theories would be considered 'noisy edges' in the knowledge graph and end up being naturally excluded with robust training.

Similarly, a clever mechanism that makes unsafe completions akin to noisy examples in an otherwise safe world might greatly help safe generations. Training for robustness to unsafe completions likely satisfies the other desiderata needed: It's readily scalable and preserves competitiveness of outputs. I hope inductive-bias analysis is helpful in such a manner.

AR: Inductive Biases of Models (more) Robust to Adversaries

Motivation: Worst-Case Robustness is Relevant

There is a good reason a lot of adversarial robustness literature is overlooked. John_Maxwell highlights from Motivating the Rules of the Game for Adversarial Example Research (Summary)

In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security.

I like two perspectives that motivate studying Adversarial Robustness, which might be AI Safety relevant: (i) Security of AI systems and (ii) Robustness to Distributional Shift.

Security PerspectiveScasper motivates the problem well:

As progress in AI continues to advance at a rapid pace, it is important to know how advanced systems will make choices and in what ways they may fail. When thinking about the prospect of super-intelligence, I think it’s all too easy and all too common to imagine that an Artificial Super-intelligence would be something which humans, by definition, can’t ever outsmart. But I don’t think we should take this for granted. Even if an AI system seems very intelligent -- potentially even super-intelligent -- this doesn’t mean that it’s immune to making egregiously bad decisions when presented with adversarial situations. Thus the main insight of this paper:

The Achilles Heel hypothesis: Being a highly-successful goal-oriented agent does not imply a lack of decision theoretic weaknesses in adversarial situations. Highly intelligent systems can stably possess "Achilles Heels" which cause these vulnerabilities.

Currently far more intelligent than AI systems, humans are predictably and reliably fooled by visual and auditory attacks. It’s unclear why intelligent AI systems would not have security vulnerabilities, exploitable by bad actors to predictably cause, say, catastrophic failures, or by an intelligent AI system as a trigger for deception.

Distribution Shift Perspective: I think if semantic adversarial examples are automatically generatable-- they would be pretty neat, efficient ways of testing robustness to distribution shift. Now what are semantic adversarial examples? I motivate by examples: (i) Examples with the background correlations removed without the object present  (ii) Non-central examples of a particular class (Say, something which technically is a chair but not what one would imagine).

How is it useful? I am linking the post where it's nicely illustrated for reference, as my summary was much longer with little added value.

Evidence & Opinion: Adversarial Examples might be Inevitable

(AR1) Evidence: Adversarial Examples Are Not Bugs, They Are Features (NeurIPS ‘19) [Discussions][Rohin’s Summary & Discussion],  Robustness May Be at Odds with Accuracy (ICLR ‘19) 
Effect & Scalability: Definitely real. Seems robust to scaling (medium confidence)
My Interpretation: In summary, the first work convincingly shows that some adversarial examples could arise due to non-robust features: easily found, highly-predictive (hence competitive) but brittle (hence non-robust) and imperceptible[11]) to humans. They can isolate these brittle features from the data, creating a robust dataset and a non-robust dataset. Simple training on this dataset gives naturally-robust models. However, the loss of predictive features results in a significant drop in performance. Simple training on the non-robust dataset looks like training on noise but performs well on original natural images.

In comparison, the second work argues that the above-seen tradeoff between standard accuracy and adversarial robustness is inherent to learning, showing that it provably occurs when non-robust features are inevitable (robust classification is not possible, say, due to label noise etc). This explains the drop in standard accuracy in the above case or when performing adversarial training in practice.

They have two interesting implications: (i) They argue for pessimism about robustness naturally emerging as a consequence of standard training, as robust classifiers use considerably different (worse, robust) vs (better, non-robust) features used by standard classifiers (ii) The above two works show to some degree that these robust features might align far better with human intentions. I currently find this hypothesis has exciting implications. It's likely some version of this should be very practically applicable.

Why is this important? Connecting this to my intuitions about (PF-A), I think pretraining might learn similar non-robust but more-generally-useful patterns. To theorize the power of these patterns (i) The effect is significant: they enable pretrained models to dramatically outperform training-from-scratch in downstream tasks (ii) The effect nearly matches possibly-robust patterns: they enable pretrained models to perform comparably to pretraining on natural-images which can learn robust patterns.

(AR2) Evidence: A Fourier Perspective on Model Robustness in Computer Vision (NeurIPS ‘19), Simple Black-box Adversarial Attacks (ICML ‘19)
Effect & Scalability: Seems real. Seems robust to scaling (medium confidence)

My Interpretation: A well-accepted principle in distributional robustness literature is that models lack robustness to distribution shift because they latch onto superficial correlations in the data. The first work illustrates one such correlation: high-frequency features. It's not predictive in-distribution. Deep models can classify images containing only high-frequency information (not understandable by humans) well. But (B2) kicks in, and it loses a lot of predictive power off-distribution. We can leverage this to also create adversarial examples by simply sampling different non-predictive high-frequency noise patterns.

They further show alleviating these shortcomings is tricky. The first work shows that when performing adversarial training to L perturbations, it becomes robust to high-frequency noise but becomes vulnerable to low-frequency corruptions, say foggy weather. The second work indicates that deep networks have many adversarial samples around the boundaries. Even random search succeeds real fast. They show this by exploiting high-dimensionality to make black-box adversarial attacks simply by: Repeatedly sampling mutually orthonormal vectors and either adding or subtracting from the target image in this direction. The proposed method can be used very effectively for both untargeted and targeted attacks, showing that attacking a classifier might be an easy search problem and difficult to address.

Additional interesting works: Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them (ICML-W ‘21)

(AR3) Evidence: Adversarial Robustness May Be at Odds With Simplicity (Arxiv), Adversarially Robust Generalization Requires More Data (NeurIPS '18), A Universal Law of Robustness via Isoperimetry (NeurIPS ‘21) and Intriguing Properties of Adversarial Training at Scale (ICLR ‘20) 
Effect & Scalability: Seems real. Seems robust to scaling (medium-high confidence)
My Interpretation: Works (a&b) argue that the complexity of adversarially robust generalization is much higher than standard generalization.

Work (a) gives a more model perspective, arguing that robust classification may require more complex classifiers (i.e. more capacity, sometimes exponentially more) than standard classification. In contrast, work (b) gives a dataset perspective, providing evidence to argue that the sample complexity of adversarially robust generalization is much higher than standard generalization.

Work (c) conjectures a tight lower bound for a property (lipschitzness), which might be very helpful for producing robust models. It says that obtaining interpolable (robust) models would necessitate several orders of magnitude larger parameters, specifically by the intrinsic dimensions of the problem.

Finally, work (d) tests this in practice at scale and finds that adversarial training requires much larger/deeper networks to achieve high adversarial robustness. We know accuracy is marginally improved by adding more layers beyond ResNet101. However, there is a substantial, nearly linear and consistent gain with adversarial training pushing the network capacity to a far larger scale, i.e., ResNet-638.

Overall, it provides a compelling picture, with some interesting implications:
(a) Adversarial training is an indispensable tool, not approximated by simply scaling up standard learning. 
(b) The fact that adversarially learnt feature representations have different complexity than those from standard learning gives credence to being somewhat fundamentally different features. 
(c) With tasks like language modelling likely having a higher intrinsic dimension, obtaining robustly generalizing models would raise the bar of minimal parameterization several magnitudes higher. In other words, the interpolation regime in a deep double descent story may also shift significantly rightward to meet this criterion. This may have significant implications for forecasts made by the community.

Application: Security Vulnerabilities in Large AI Systems

Let’s recap Scasper on the Achilles Heel Hypothesis:

More precisely, I define an Achilles Heel as a delusion which is impairing (results in irrational choices in adversarial situations), subtle (doesn’t result in irrational choices in normal situations), implantable (able to be introduced) and stable (remaining in a system reliably over time).

Literature on backdoor attacks on ML (Arxiv) provides a comprehensive survey of a class of adversarial attacks which, disturbingly, seems to fit the above criterion perfectly.

Poisoning Attack: When a system that otherwise works well (subtle) is trained on a contaminated training dataset (implantable), it can be made to fail predictably and reliably when presented with the trigger-object present in test data at any time (impairing & stable).

Why is this important? Recently, Poisoning and Backdooring Contrastive Learning (ICLR’ 22) illustrates large multimodal models like CLIP are vulnerable because of being trained on unreliable data obtained via web-scraping. They show they can plant a backdoor in a CLIP model by introducing a poisoned set of just 0.005% of the original dataset size (~150 images). This gets CLIP to predictably cause any input with the overlayed backdoor to be classified as a particular class during testing. Furthermore, poisoning is hard to address:

(a) Checking large datasets manually isn’t feasible. It’s hard to detect adversarial inputs without knowing the specific threat model

(b) Backdoors try to produce deceptively predictive features, resulting in a handicap in competitiveness for models which do not leverage them.

Similar works have emerged across domains: Trojaning Language Models for Fun and Profit (S&P’ 21), Backdoor Attacks on Self-Supervised Learning (Arxiv)

This has the potential to create a variety of probably hard-to-defend against worst-case risk scenarios by developing backdoors such that the <given> scenario is triggered when shown the poison at test-time. Alternatively, I think intelligent agents could leverage this attack and possibly inject backdoors into themselves, to serve as triggers for defection and similar unintended behaviour. It might be hard to check since, without the trigger, as AI models will normally have completely innocuous behaviour and reasoning-behind-behaviour holding up to high standards. This, in Death Note, is akin to Light voluntarily giving up all his memories of the Death Note, appearing really innocent to scrutiny, with contact with a piece of the note being a trigger for malicious behaviour.

I wonder if this evidence is enough to label this a potential warning shot for securing advanced AI systems and warrants the attention of this community.

Evidence Set

Compiling the evidence above, I get a condensed Evidence set as follows:

B-B) Dataset-Reality Mismatch Bias: Datasets will likely be under-specifications of the real-world’s full complexity, causing biases in training and testing.

PF-A) Biases in Pretrained Representations: Representations learnt by large pretrained models might be substantially reliant on low/surface-level information, which is predictive in the average case scenario (good accuracy) but brittle in worst-case scenarios (easy to construct counterexamples).

PF-B) Biases in some transparency tools: Probing and functionally-similar transparency tools such as On the Pitfalls of Analyzing Individual Neurons in Language Models (ICLR ‘22) might measure correlation in especially creative ways, resulting in high probing performance by various mechanisms such as:

  • Probing classifier memorizing the task vs probe being too weak to extract information
  • A property that is not present can be detected by correlation to other properties (analogy: zip-code in race controls)
  • The property being present might not indicate that it was causally relevant to producing the output

..and more.

PF-C) Catastrophic Forgetting: Deep networks lose all predictive power on previous tasks (e.g. pretraining) once trained on a new task (e.g. finetuning)

G-A) Good-Generalization Bias: Overparameterized deep models can fit any arbitrary mapping in the dataset, indicating a large set of bad-generalizing models, yet typical training with random initialization results in selecting good-generalizing models.

G-B) Not-Always Good-Generalization: By tweaking the training procedure, one can induce selection of the bad-generalizing models from the aforementioned large hypothesis space that perfectly fit training data.

G-C) Memorization while Generalization: Deep networks can memorize and generalize simultaneously with parameter-reuse (not as sometimes imagined- disjoint subnetworks for memorization and generalization)

G-D) Deep Double Descent: Increasing overfitting (by parameterization and training length) beyond a threshold counter-intuitively leads to better generalizing models, albeit with decreasing marginal returns.

G-E) Random Tickets: Randomly selected subnetworks (if given layerwise sparsity) achieve performance equivalent to the whole model, i.e. expressive power is gained from increased connectivity and not parameterization in interpolation regime or deep models are unnecessarily overparameterized.

(G-F) Simplicity Prior Bias: SGD might have little-to-no inductive biases, with good generalization attributable to the prior being biased towards simple functions (from architecture & initialization).

(G-G) Sudden Shifts in Learning: Deep networks might make large changes in decision-making strategy abruptly and possibly unpredictably.

AR-A) Feature Divergence: More-robust classifiers use considerably different, worse-performing and more human-aligned features compared to less-robust classifiers

AR-B) Sufficiency of Brittleness for Off-Distribution Misalignment: Deep models which use brittle, predictive features lack robustness to distribution shift because they latch onto superficial correlations in the data, worsened by high-dimensionality of input.

AR-C) Robustness adds Complexity: Achieving higher robustness requires significantly more complex classifiers (complex read as higher capacity) than simply good generalization.


(B1) Gradual Change Bias (Ref: Discussion with Evan)
Effect & Scalability: Unknown, unknown effect of scaling (Added as seemed intuitive)
My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD. Update: Probably not true, refer to gwern's comment and replies for elaboration.

  1. Also, the scaling hypothesis seems more in the GPT rather than superintelligence category, it’s unclear to me how tools will achieve agency, and I’m not yet convinced/appreciate this. ↩︎

  2. Note that for this post, I shall caveat the transparency tools from feature-importance and interpreting-representations among the large area of transparency research and probes to be structural probes from the large area of interpretable NLP. ↩︎

  3. Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) (EMNLP ‘20) is concrete counter-evidence against scaling, but its evaluation might suffer from the same biases provided in this section. ↩︎

  4. An alternate explanation is in Towards Theoretical Understanding of Deep Learning. It argues larger models enables better ease-of-optimization for SGD, choosing better gradient trajectories, hence finding better solutions. ↩︎

  5. Randomly connected subnetworks account for most subnetworks, hence a reluctance to call it a ‘lottery ticket’, which assumes such tickets are rare and need to be found. ↩︎

  6. Note that it’s less interesting but true (albeit non-vacuously) for any overparameterized function. ↩︎

  7. Random sampling with rejection is an astoundingly slow optimizer. It’s approximated using random sampling in a Gaussian Process. ↩︎

  8. Obtained by mislabeling different-portions of data (properties characterized in Distributional Generalization: A New Kind of Generalization (Arxiv)) ↩︎

  9. The procedure nudges the model to output widespread conspiracy theories by crafting prompts to help imitate human falsehoods. ↩︎

  10. Other claims are also interesting, discussed at length ahead in Evidence AR1. ↩︎

  11. Imperceptibleness is important as these patterns are likely unseeable and significantly hard to discover. Invisibilities trickle down, systems increasingly seem black-box (access & knowledge-wise) to downstream users. ↩︎

New Comment
1 comment, sorted by Click to highlight new comments since:

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?

My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD.

I also disagree with B1, gradual change bias. This seems to me to be either obviously wrong, or limited to safety/capability-irrelevant definitions like "change in KL of overall distribution of output from small vanilla supervised-learning NNs", and does definitely does not generalize to even small amounts of confidence in assertions like "SGD updates based on small n cannot meaningfully change a very powerful NN's behavior in a real-world-affecting way".

First, it is pervasive in physics and the world in general that small effects can have large outcomes. No snowflake thinks it's responsible for the avalanche, but one was. Physics models often have chaotic regimes, and models like the Ising spin glass model (one of the most popular theoretical models for neural net analysis for several decades) are notorious for how tiny changes can cause total phase shifts and regime changes. NNs themselves are often analyzed as being on the 'edge of chaos' in various ways (the exploding gradient problem, except actual explosions). Breezy throwaway claims about 'oh, NNs are just smooth and don't change much on one update' are so much hot air against this universal prior.

  • As an example, consider the notorious fragility of NNs to (hyper)parameters: one set will fail completely, wildly diverging. Another, similar set, will suddenly work. Edge of chaos. Or consider the spikiness of model capabilities over scaling and how certain behaviors seem to abruptly emerge at certain points (eg 'grokking', or just the sudden phase transitions on benchmarks where small models are flatlined but then at a threshold, the model suddenly 'gets it'). This parallels similar abrupt transitions in human psychology, like Piagetian levels. In grokking, the breakthrough appears to happen almost instantaneously; for humans and large model capabilities, we lack detailed benchmarking which would let us say that "oh, GPT-3 understands logic at exactly iteration #501,333", but it should definitely make you think about assumptions like "it must take countless SGD iterations for each of these capabilities to emerge." (This all makes sense if you think of large NNs as searching over complexity-penalized ensembles of programs, and at some point switching from memorization-intensive programs to the true generalizing program; see my scaling hypothesis writeup & jcannell's writings.)

Second, the pervasive existence of adversarial examples should lead to extreme doubt on any claims of NN 'smoothness'. Absolutely and perceptually tiny shifts in image inputs lead to almost arbitrarily large wacky changes in output distribution. These may be unnatural, but they exist. If tweaking a pixel here and there by a quantum can totally change the output from 'cat' to 'dog', why can't an SGD update, which can change every single parameter, have similar effects? (Indeed, you link the isoperimetry paper, which claims that this is logically necessary for all NNs which are too small for their problem, where for ImageNet I believe they ballpark it as "even the largest contemporary ImageNet models are still 2 OOMs too small to begin to be robust".)

Third, small changes in outputs can mean large changes in behavior. Actions and choices are inherently discrete where small changes in the latent beliefs can have arbitrarily large behavior changes (which is why we are always trying to move away from actions/choices towards smoother easier relaxations where we can pretend everything is small and continuous). Imagine a NN estimating Q-values for taking 2 actions, "Exterminate all humans" and "usher in the New Jerusalem", which normalize to 0.4999999 and 0.5000001 respectively. It chooses the argmax, and acts friendly. Do you believe that there is no SGD update which after adjusting all of the parameters, might reverse the ranking? Why, exactly?

Fourth, NNs are often designed to have large changes based on small changes, particularly in meta-learning or meta-reinforcement-learning. In prompt programming or other few-shot scenarios, we radically modify the behavior of a model with potentially trillions of parameters by merely typing in a few words. In neural meta-backdoors/data poisoning, there are extreme shifts in output for specific prespecified inputs (sort of the inverse of adversarial examples), which are concerning in part because that's where a gradient-hacker could store arbitrary behavior (so a backdoored NN is a little like Light in Death Note: he has forgotten everything & acts perfectly innocent... until he touches the right piece of paper and 'wakes up'). In cases like MAML, the second-order training is literally designed to create a NN at a saddle point where a very small first-order update will produce very different behavior for each potential new problem; like a large boulder perched at the tip of a mountain which will roll off in any direction at the slightest nudge. In meta-reinforcement-learning, like RNNs being trained to solve a t-maze which periodically flips the reward function, the very definition of success is rapid total changes in behavior based on few observations, implying the RNN has found a point where the different attractors are balanced and the observation history can push it towards the right one easily. These NNs are possible, they do exist, and given the implicit meta-learning we see emerge in larger models, they may become increasingly common and exist in places the user does not expect.

So, I see loads of reasons we should worry about an advanced AI system maneuvering from one strategy to another after a single update, both in general priors and based on what we observe of past NNs, and good reason to believe that scaling greatly increases the dangers there. (Indeed, the update need not even be of the 'parameters', updates to the hidden state / internal activations are quite sufficient.)