Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?

My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD.

I also disagree with B1, gradual change bias. This seems to me to be either obviously wrong, or limited to safety/capability-irrelevant definitions like "change in KL of overall distribution of output from small vanilla supervised-learning NNs", and does definitely does not generalize to even small amounts of confidence in assertions like "SGD updates based on small n cannot meaningfully change a very powerful NN's behavior in a real-world-affecting way".

First, it is pervasive in physics and the world in general that small effects can have large outcomes. No snowflake thinks it's responsible for the avalanche, but one was. Physics models often have chaotic regimes, and models like the Ising spin glass model (one of the most popular theoretical models for neural net analysis for several decades) are notorious for how tiny changes can cause total phase shifts and regime changes. NNs themselves are often analyzed as being on the 'edge of chaos' in various ways (the exploding gradient problem, except actual explosions). Breezy throwaway claims about 'oh, NNs are just smooth and don't change much on one update' are so much hot air against this universal prior.

As an example, consider the notorious fragility of NNs to (hyper)parameters: one set will fail completely, wildly diverging. Another, similar set, will suddenly work. Edge of chaos. Or consider the spikiness of model capabilities over scaling and how certain behaviors seem to abruptly emerge at certain points (eg 'grokking', or just the sudden phase transitions on benchmarks where small models are flatlined but then at a threshold, the model suddenly 'gets it'). This parallels similar abrupt transitions in human psychology, like Piagetian levels. In grokking, the breakthrough appears to happen almost instantaneously; for humans and large model capabilities, we lack detailed benchmarking which would let us say that "oh, GPT-3 understands logic at exactly iteration #501,333", but it should definitely make you think about assumptions like "it must take countless SGD iterations for each of these capabilities to emerge." (This all makes sense if you think of large NNs as searching over complexity-penalized ensembles of programs, and at some point switching from memorization-intensive programs to the true generalizing program; see my scaling hypothesis writeup & jcannell's writings.)

Second, the pervasive existence of adversarial examples should lead to extreme doubt on any claims of NN 'smoothness'. Absolutely and perceptually tiny shifts in image inputs lead to almost arbitrarily large wacky changes in output distribution. These may be unnatural, but they exist. If tweaking a pixel here and there by a quantum can totally change the output from 'cat' to 'dog', why can't an SGD update, which can change every single parameter, have similar effects? (Indeed, you link the isoperimetry paper, which claims that this is logically necessary for all NNs which are too small for their problem, where for ImageNet I believe they ballpark it as "even the largest contemporary ImageNet models are still 2 OOMs too small to begin to be robust".)

Third, small changes in outputs can mean large changes in behavior. Actions and choices are inherently discrete where small changes in the latent beliefs can have arbitrarily large behavior changes (which is why we are always trying to move away from actions/choices towards smoother easier relaxations where we can pretend everything is small and continuous). Imagine a NN estimating Q-values for taking 2 actions, "Exterminate all humans" and "usher in the New Jerusalem", which normalize to 0.4999999 and 0.5000001 respectively. It chooses the argmax, and acts friendly. Do you believe that there is no SGD update which after adjusting all of the parameters, might reverse the ranking? Why, exactly?

Fourth, NNs are often designed to have large changes based on small changes, particularly in meta-learning or meta-reinforcement-learning. In prompt programming or other few-shot scenarios, we radically modify the behavior of a model with potentially trillions of parameters by merely typing in a few words. In neural meta-backdoors/data poisoning, there are extreme shifts in output for specific prespecified inputs (sort of the inverse of adversarial examples), which are concerning in part because that's where a gradient-hacker could store arbitrary behavior (so a backdoored NN is a little like Light in Death Note: he has forgotten everything & acts perfectly innocent... until he touches the right piece of paper and 'wakes up'). In cases like MAML, the second-order training is literally designed to create a NN at a saddle point where a very small first-order update will produce very different behavior for each potential new problem; like a large boulder perched at the tip of a mountain which will roll off in any direction at the slightest nudge. In meta-reinforcement-learning, like RNNs being trained to solve a t-maze which periodically flips the reward function, the very definition of success is rapid total changes in behavior based on few observations, implying the RNN has found a point where the different attractors are balanced and the observation history can push it towards the right one easily. These NNs are possible, they do exist, and given the implicit meta-learning we see emerge in larger models, they may become increasingly common and exist in places the user does not expect.

So, I see loads of reasons we should worry about an advanced AI system maneuvering from one strategy to another after a single update, both in general priors and based on what we observe of past NNs, and good reason to believe that scaling greatly increases the dangers there. (Indeed, the update need not even be of the 'parameters', updates to the hidden state / internal activations are quite sufficient.)

Also, the scaling hypothesis seems more in the GPT rather than superintelligence category, it’s unclear to me how tools will achieve agency, and I’m not yet convinced/appreciate this. ↩︎
Note that for this post, I shall caveat the transparency tools from feature-importance and interpreting-representations among the large area of transparency research and probes to be structural probes from the large area of interpretable NLP. ↩︎
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) (EMNLP ‘20) is concrete counter-evidence against scaling, but its evaluation might suffer from the same biases provided in this section. ↩︎
An alternate explanation is in Towards Theoretical Understanding of Deep Learning. It argues larger models enables better ease-of-optimization for SGD, choosing better gradient trajectories, hence finding better solutions. ↩︎
Randomly connected subnetworks account for most subnetworks, hence a reluctance to call it a ‘lottery ticket’, which assumes such tickets are rare and need to be found. ↩︎
Note that it’s less interesting but true (albeit non-vacuously) for any overparameterized function. ↩︎
Random sampling with rejection is an astoundingly slow optimizer. It’s approximated using random sampling in a Gaussian Process. ↩︎
Obtained by mislabeling different-portions of data (properties characterized in Distributional Generalization: A New Kind of Generalization (Arxiv)) ↩︎
The procedure nudges the model to output widespread conspiracy theories by crafting prompts to help imitate human falsehoods. ↩︎
Other claims are also interesting, discussed at length ahead in Evidence AR1. ↩︎
Imperceptibleness is important as these patterns are likely unseeable and significantly hard to discover. Invisibilities trickle down, systems increasingly seem black-box (access & knowledge-wise) to downstream users. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

17

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

17

Summary (TL;DR)

A: Where does it Fit in Alignment Literature?

B: Basic Inductive Biases

PF: Inductive Biases of Large, Pretrained Models

Motivation: Pretraining learns Linguistically Grounded Representations

Evidence & Opinion: Limitations of Pretraining & Probing-esque Transparency Tools

Application: Transparency

G: Inductive Biases of Generalizing Models

Motivation

Evidence & Opinion: Generalization is tricky to Coherently Model

Application: Memorization

AR: Inductive Biases of Models (more) Robust to Adversaries

Motivation: Worst-Case Robustness is Relevant

Evidence & Opinion: Adversarial Examples might be Inevitable

Application: Security Vulnerabilities in Large AI Systems

Evidence Set

Mistakes