Alex Turner

Alex Turner, independent researcher working on AI alignment. Reach me at turner.alex[at]berkeley[dot]edu.


Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact


On applying generalization bounds to AI alignment. In January, Buck gave a talk for the Winter MLAB. He argued that we know how to train AIs which answer on-distribution questions at least as well as the labeller does. I was skeptical. IIRC, his argument had the following structure:


1. We are labelling according to some function f and loss function L.

2. We train the network on datapoints (x, f(x)) ~ D_train.

3. Learning theory results give (f, L)-bounds on D_train. 


4. The network should match f's labels on the rest of D_train, on average.

5. In particular, when f represents our judgments, the network should be able to correctly answer questions which we ourselves could correctly answer.

I think (5) is probably empirically true, but not for the reasons I understood Buck to give, and I don't think I can have super high confidence in this empirical claim. Anyways, I'm more going to discuss premise (3).

It seems like (3) is false (and no one gave me a reference to this result), at least for one really simple task from The Pitfalls of Simplicity Bias in Neural Networks (table 2, p8): 

In this classification task, a 2-layer MLP with 100 hidden neurons was trained to convergence. It learned a linear boundary down the middle, classifying everything to the right as red, and to the left as blue. Then, it memorized all of the exceptions. Thus, its validation accuracy was only 95%, even though it was validated on more synthetic data from the same crisply specified historical training distribution. 

However, conclusion (4) doesn't hold in this situation. Conclusion 4 would have predicted that the network would learn a piecewise linear classifier (in orange below) which achieves >99% accuracy on validation data from the same distribution:

The MLP was expressive enough to learn this, according to the authors, but it didn't. 

Now, it's true that this kind of problem is rather easy to catch in practice -- sampling more validation points quickly reveals the generalization failure. However, I still think we can't establish high confidence (>99%) in (5: the AI has dependable question-answering abilities for questions similar enough to training). From the same paper,

Given that neural networks tend to heavily rely on spurious features [45, 52], state-of-the-art accuracies on large and diverse validation sets provide a false sense of security; even benign distributional changes to the data (e.g., domain shifts) during prediction time can drastically degrade or even nullify model performance.

And I think many realistic use cases of AI Q&A can, in theory, involve at least "benign" distributional changes, where, to our eyes, there hasn't been any detectable distributional shift -- and where generalization still fails horribly. But now I'd anticipate Buck / Paul / co would have some other unknown counterarguments, and so I'll just close off this "short"form for now.

Quintin Pope also wrote:

This paper shows just how strong neural network simplicity biases are, and also gives some intuition for how the simplicity bias of neural networks is different from something like a circuit simplicity bias or Kolmogorov simplicity bias. E.g., neural networks don't seem all that opposed to memorization. The paper shows examples of neural networks learning a simple linear feature which imperfectly classifies the data, then memorizing the remaining noise, despite there being a slightly more complex feature which perfectly classifies the training data (and I've checked, there's no grokking phase transition, even after 2.5 million optimization steps with weight decay).

Thoughts on "Deep Learning is Robust to Massive Label Noise."

We show that deep neural networks are capable of generalizing from training data for which true labels are massively outnumbered by incorrect labels. We demonstrate remarkably high test performance after training on corrupted data from MNIST, CIFAR, and ImageNet. For example, on MNIST we obtain test accuracy above 90 percent even after each clean training example has been diluted with 100 randomly-labeled examples. Such behavior holds across multiple patterns of label noise, even when erroneous labels are biased towards confusing classes. We show that training in this regime requires a significant but manageable increase in dataset size that is related to the factor by which correct labels have been diluted. Finally, we provide an analysis of our results that shows how increasing noise decreases the effective batch size.

I think this suggests that, given data quality concerns in the corpus, we should look for contexts with low absolute numbers of good examples, or where the good completions are not a "qualitative plurality." For example, if a subset of the data involves instruction finetuning, and there are 10 "personas" in the training data (e.g. an evil clown, a helpful assistant, and so on) -- as long as the helpful dialogues are a plurality, and are numerous "enough" in absolute terms, and the batch size is large enough -- various forms of greedy sampling should still elicit helpful completions.

I wonder if the above is actually true in the LLM regime!

Another interesting result is that absolute number of correct labels matters a lot, not just the proportion thereof:

Furthermore, this should all be taken relative to the batch size. According to these experiments, in the limit of infinite batch size, greedy sampling will be good as long as the good training completions constitute a plurality of the training set.

Thoughts on "The Curse of Recursion: Training on Generated Data Makes Models Forget." I think this asks an important question about data scaling requirements: what happens if we use model-generated data to train other models? This should inform timelines and capability projections.


Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

Before reading: Can probably get around via dynamic templates...; augmentations to curriculum learning... training models to not need chain of thought via token-level distillation/"thought compression"; using model-critiqued (?) oversight / revision of artificial outputs. Data augmentation seems quite possible, and bagging seems plausible. Plus can pretrain for ~4 epochs with negligible degradation compared to having 4x unique tokens; having more epochs need not be disastrous. 

After reading: This paper examines a dumb model (OPT-125m) and finetunes it on (mostly/exclusively) its own outputs, which are going to be lower-quality than the wikitext2 content. In contrast, I think GPT-4 outputs are often more coherent/intelligent than random internet text. So I think it remains an open question what happens if you use a more intelligent bagging-based scheme / other training signals. 

Overall, I like the broad question but not impressed with their execution. Their theoretical analysis suggests shrinking tails, while their OPT results claim to have growing tails (but I don't see it, looking at the figures). 

EDIT: Removed a few specifics which aren't as relevant for alignment implications.

a lot of interpretability work that performs act-add like ablations to confirm that their directions are real

Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?

ITI is basically act adds but they compute act adds with many examples instead of just a pair

Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).

Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping.

LLMs aren't trained to convergence because that's not compute-efficient, so early stopping seems like the relevant baseline. No?

everyone who reads those seems to be even more confused after reading them

I want to defend "Reward is not the optimization target" a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don't think it's true. For some reason, some people really get a lot out of the post; others think it's trivial; others think it's obviously wrong, and so on. See Rohin's comment:

(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya's recent post for similar reasons. I don't think that the people I'm explaining it to literally don't understand the point at all; I think it mostly hasn't propagated into some parts of their other reasoning about alignment. I'm less on board with the "it's incorrect to call reward a base objective" point but I think it's pretty plausible that once I actually understand what TurnTrout is saying there I'll agree with it.)

You write:

In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not 'optimize the reward'?

These algorithms do optimize the reward. My post addresses the model-free policy gradient setting... [goes to check post] Oh no. I can see why my post was unclear -- it didn't state this clearly. The original post does state that AIXI optimizes its reward, and also that:

For point 2 (reward provides local updates to the agent's cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates. 

However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE. 

I don't know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you -- I'm happy to answer more specific questions.

I agree that with time, we might be able to understand. (I meant to communicate that via "might still be incomprehensible")

All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing.

Strong claim! I'm skeptical (EDIT: if you mean "in the limit" to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn't matter? (Also, of course, even one single optimal policy can be represented by multiple different network parameterizations which induce the same semantics, with eg some using the WM and some using heuristics.)

I think the more relevant question is "given a frozen initial network, what are the circuit-level inductive biases of the training process?". I doubt one can answer this via appeals to RL convergence results.

(I skimmed through the value equivalence paper, but LMK if my points are addressed therein.)

a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward.

As a side note, I think this "agent only wants to maximize reward" language is unproductive (see "Reward is not the optimization target", and "Think carefully before calling RL policies 'agents'"). In this case, I suspect that your language implicitly equivocates between "agent" denoting "the RL learning process" and "the trained policy network":

As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.

(The original post was supposed to also have @Monte M as a coauthor; fixed my oversight.)

AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.

Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed. 


The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts. 

Conclusion: Even if an AI doesn't rely heavily on "alien" or unknown abstractions -- even if the AI mostly uses human-like abstractions and features -- the AI's thoughts might still be incomprehensible to us, even if we took a lot of time to understand them. 

(Also, all AI-doom content should maybe be expunged as well, since "AI alignment is so hard" might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)

Load More