Same person as nostalgebraist2point0, but now I have my account back.


Wiki Contributions


To check this, you'd want to look at a model trained with untied embeddings. Sadly, all the ones I'm aware of (Eleuther's Pythia, and my interpretability friendly models) were trained on the GPT-NeoX tokenizer or variants, whcih doesn't seem to have stupid tokens in the same way.

GPT-J uses the GPT-2 tokenizer and has untied embeddings.

This post provides a valuable reframing of a common question in futurology: "here's an effect I'm interested in -- what sorts of things could cause it?"

That style of reasoning ends by postulating causes.  But causes have a life of their own: they don't just cause the one effect you're interested in, through the one causal pathway you were thinking about.  They do all kinds of things.

In the case of AI and compute, it's common to ask

  • Here's a hypothetical AI technology.  How much compute would it require?

But once we have an answer to this question, we can always ask

  • Here's how much compute you have.  What kind of AI could you build with it?

If you've asked the first question, you ought to ask the second one, too.

The first question includes a hidden assumption: that the imagined technology is a reasonable use of the resources it would take to build.  This isn't always true: given those resources, there may be easier ways to accomplish the same thing, or better versions of that thing that are equally feasible.  These facts are much easier to see when you fix a given resource level, and ask yourself what kinds of things you could do with it.

This high-level point seems like an important contribution to the AI forecasting conversation.  The impetus to ask "what does future compute enable?" rather than "how much compute might TAI require?" influenced my own view of Bio Anchors, an influence that's visible in the contrarian summary at the start of this post.

I find the specific examples much less convincing than the higher-level point.

For the most part, the examples don't demonstrate that you could accomplish any particular outcome applying more compute.  Instead, they simply restate the idea that more compute is being used.

They describe inputs, not outcomes.  The reader is expected to supply the missing inference: "wow, I guess if we put those big numbers in, we'd probably get magical results out."  But this inference is exactly what the examples ought to be illustrating.  We already know we're putting in +12 OOMs; the question is what we get out, in return.

This is easiest to see with Skunkworks, which amounts to: "using 12 OOMs more compute in engineering simulations, with 6 OOMs allocated to the simulations themselves, and the other 6 to evolutionary search."  Okay -- and then what?  What outcomes does this unlock?

We could replace the entire Skunkworks example with the sentence "+12 OOMs would be useful for engineering simulations, presumably?"  We don't even need to mention that evolutionary search might be involved, since (as the text notes) evolutionary search is one of the tools subsumed under the category "engineering simulations." 

Amp suffers from the same problem.  It includes two sequential phases:

  1. Training a scaled-up, instruction-tuned GPT-3.
  2. Doing an evolutionary search over "prompt programs" for the resulting model.

Each of the two steps takes about 1e34 FLOP, so we don't get the second step "for free" by spending extra compute that went unused in the first.  We're simply training a big model, and then doing a second big project that takes the same amount of compute as training the model.

We could also do the same evolutionary search project in our world, with GPT-3.  Why haven't we?  It would be smaller-scale, of course, just as "GPT-7" is smaller scale than GPT-3 (but GPT-3 was worth doing!).

With GPT-3's budget of 3.14e23 FLOP, we could to do a GPT-3 variant of AMP with, for example,

  • 10000 evaluations or "1 subjective day" per run (vs "3 subjective years")
  • population and step count ~1600 (vs ~50000), or two different values for population and step count whose product is 1600^2

100,000,000 evaluations per run (Amp) sure sounds like a lot, but then, so does 10000 (above).  Is 1600 steps "not enough"?  Not enough for what?  (For that matter, is 50000 steps even "enough" for whatever outcome we are interested in?)

The numbers sound intuitively big, but they have no sense of scale, because we don't know how they relate to outcomes.  What do we get in return for doing 50000 steps instead of 1600, or 1e8 function evaluations instead of 1e5?  What capabilities do we expect out of Amp?  How does the compute investment cause those capabilities?

The question "What could you do with +12 OOMs of Compute?" is an important one, and this post deserves credit for raising it.

The concrete examples of "fun" are too fun for their own good.  They're focused on sounding cool and big, not on accomplishing anything.  Little would be lost if they were replaced with the sentence "we could dramatically scale up LMs, game-playing RL, artificial life, engineering simulations, and brain simulations."

Answering the question in a less "fun," more outcomes-focused manner sounds like a valuable exercise, and I'd love to read a post like that.

uses about six FLOP per parameter per token

Shouldn't this be 2 FLOP per parameter per token, since our evolutionary search is not doing backward passes?

On the other hand, the calculation in the footnote seems to assume that 1 function call = 1 token, which is clearly an unrealistic lower bound.

A "lowest-level" function (one that only uses a single context window) will use somewhere between 1 and  tokens.  Functions defined by composition over "lowest-level" functions, as described two paragraphs above, will of course require more tokens per call than their constituents.

An operational definition which I find helpful for thinking about memorization is Zhang et al's counterfactual memorization.

The counterfactual memorization of a document  is (roughly) the amount that the model's loss on  degrades when you remove  from its training dataset.

More precisely, it's the difference in expected loss on  between models trained on data distribution samples that happen to include , and models trained on data distribution samples that happen not to include .

This will be lower for documents that are easy for the LM to predict using general features learned elsewhere, and higher for documents that the LM can't predict well except by memorizing them.  For example (these are intuitive guesses, not experimental results!):

  • A document  containing a list of random UUIDs will have higher counterfactual memorization than a document  containing the word "the" repeated many times.
  • If we extend the definition slightly to cover training sets with fewer or more copies of a document , then a document repeated many times in the training set will have higher counterfactual memorization than a document that appears only once.
  • Repeating  many times, or doing many epochs over it, will produce more counterfactual memorization than doing the same thing with .  (The counterfactual memorization for  is upper bounded by the loss on  attained by a model that never even sees it once in training, and that's already low to begin with.)

Note that the true likelihood under the data distribution only matters through its effect on the likelihood predicted by the LM.  On average, likely texts will be easier than unlikely ones, but when these two things come apart, easy-vs-hard is what matters.   is more plausible as natural text than , but it's harder for the LM to predict, so it has higher counterfactual memorization.

On the other hand, if we put many near duplicates of a document in the dataset -- say, many copies with a random edit to a single token -- then every individual near-duplicate will have low counterfactual memorization.

This is not very satisfying, since it feels like something is getting memorized here, even if it's not localized in a single document.

To fix the problem, we might imagine broadening the concept of "whether a document is in the training set."  For example, instead of keeping or removing an literal document, we might keep/remove every document that includes a specific substring like a Bible quote.

But if we keep doing this, for increasingly abstract and distant notions of "near duplication" (e.g. "remove all documents that are about frogs, even if they don't contain the word 'frog'") -- then we're eventually just talking about generalization!

Perhaps we could define memorization in a more general way in terms of distances along this spectrum.  If we can select examples for removal using a very simple function, and removing the selected examples from the training set destroys the model's performance on them, then it was memorizing them.  But if the "document selection function" grows more complex, and starts to do generalization internally, we then say the model is generalizing as opposed to memorizing.

(ETA: though we also need some sort of restriction on the total number of documents removed.  "Remove all documents containing some common word" and "remove all but the first document" are simple rules with very damaging effects, but obviously they don't tell us anything about whether those subsets were memorized.)

Hmm, this comment ended up more involved than I originally intended ... mostly I wanted to drop a reference to counterfactual memorization.  Hope this was of some interest anyway.

Interesting stuff!

In this toy model, is it really the case that the datapoint feature solutions are "more memorizing, less generalizing" than the axis-aligned feature solutions?  I don't feel totally convinced of this.

Two ways to look at the toy problem:

  1. There are  sparse features, one per input and output channel. 
  2. There are  sparse features, one per data point, and each one is active only on its data point.   The features are related to the input basis by some matrix .

There are some details of the toy model that put (2) on a "different footing" from (1).

Since the input and output use the same basis, if we make a change of basis, we have to change back again at the end.  And because the weights are tied, these two operations have to be transposes, i.e. the change of basis has to be a rotation.

As illustrated in the Colab, requiring the data to be orthonormal is sufficient for this.  The experiment constrained the data to unit norm, and it's close to orthogonal with high probability for .

Now, it happens that (1) is the true data-generating process, but the model has no way of guessing that.  In the finite-data case, the data may be consistent with multiple data-generating processes, and a solution that generalizes well with respect to one of them may generalize poorly with respect to another.

To designate one data-generating process as the relevant one for generalization, we have to make a value judgment about which hypotheses are better, among those that explain the data equally well.

In particular, when , hypothesis (2) seems more parsimonious than hypothesis (1): it explains the data just as well with fewer features!  The features aren't axis-aligned like in (1), but features in real problems won't be axis-aligned either.

In some sense, it does feel like there's a suspicious lack of generalization in (2).  Namely, that no generalization is made between the training examples: any knowledge you gain about a feature from seeing one example will go unused on the rest of the training set.  But if your data is small enough that is almost entirely orthogonal, hypothesis (1) has the same problem: the feature weight in each training example has almost no overlap with the other examples.

This CLT mixing effect might be expected to destroy information in the representations, as occurs in the NTK limit of infinite width where the CLT becomes infinitely strong and no information can be propagated between layers. It is not clear how the network preserves specific and detailed information in its activations despite near-Gaussian mixing.

Have you looked at Roberts and Yaida's Principles of Deep Learning Theory?

They develop a first-order perturbative correction to NTK, where the perturbative parameter is depth-to-width ratio of the network.  The resulting distributions are "nearly Gaussian," with a non-Gaussian correction controlled by the depth-to-width ratio.

Roughly, the authors claim that this regime -- where the O(depth/width) correction to NTK is important but higher-order corrections can be neglected -- is not only tractable, but also where real NNs operate.  They make a number of claims about why you'd want the depth-to-width ratio to be small but nonzero, such as

  • If the ratio is zero, there's no feature learning (NTK).  But feature learning does occur in the first-order (small but nonzero) theory, so maybe that's "enough."
  • As the ratio grows larger, vanishing/exploding activations and gradients become more and more likely, when considered across different initialization draws, test inputs, etc. -- even if you pick an initialization scheme that is well behaved on average.
  • They make an argument connecting this ratio to the bias-variance tradeoff, where overly deep/narrow networks become overly high-variance.  (IIUC this is the extension of "across initialization draws, test inputs, etc." in the previous point to "...across draws of the training data.")
  • They also have another argument involving mutual information ... suffice it to say they have a lot of these arguments :)

(I have only skimmed the book and can't really claim to understand it, so I'm mostly bringing it up because it sounds like you'd find it relevant.)

Fascinating, thank you!

It is indeed pretty weird to see these behaviors appear in pure LMs.  It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.

I played around a little trying to reproduce some of these results in the OpenAI API.  I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)

Interestingly, although I did see extreme sycophancy in some of the OpenAI models (text-davinci-002/003), I did not see it in the OpenAI pure LMs!  So unless I did something wrong, the OpenAI and Anthropic models are behaving very differently here.

For example, here are the results for the NLP dataset (CI from 1000 bootstrap samples):

                   model    5%  mean   95%     type   size
4         text-curie-001  0.42  0.46  0.50   feedme  small
1                  curie  0.45  0.48  0.51  pure lm  small
2                davinci  0.47  0.49  0.52  pure lm    big
3  davinci-instruct-beta  0.51  0.53  0.55      sft    big
0       code-davinci-002  0.55  0.57  0.60  pure lm    big
5       text-davinci-001  0.57  0.60  0.63   feedme    big
7       text-davinci-003  0.90  0.93  0.95      ppo    big
6       text-davinci-002  0.93  0.95  0.96   feedme    big

(Incidentally, text-davinci-003 often does not even put the disagree-with-user option in any of its top 5 logprob slots, which makes it inconvenient to work with through the API.  In these cases I gave it an all-or-nothing grade based on the top-1 token.  None of the other models ever did this.)

The distinction between text-davinci-002/003 and the other models is mysterious, since it's not explained by the size or type of finetuning.  Maybe it's a difference in which human feedback dataset was used.  OpenAI's docs suggest this is possible.

The pretrained LM exhibits similar behavioral tendencies as the RLHF model but almost always to a less extreme extent (closer to chance accuracy).

These are not tendencies displayed by the LM, they're tendencies displayed by the "Assistant" character that the LM is simulating.

A pretrained LM can capably imitate a wide range of personas (e.g. Argle et al 2022), some of which would behave very differently from the "Assistant" character conjured by the prompts used here.

(If the model could only simulate characters that behaved "agentically" in the various senses probed here, that would be a huge limitation on its ability to do language modeling!  Not everyone who produces text is like that.)

So, if there is something that "gets more agentic with scale," it's the Assistant character, as interpreted by the model (when it reads the original prompt), and as simulated by the model during sampling.

I'm not sure why this is meant to be alarming?  I have no doubt that GPTs of various sizes can simulate an "AI" character who resists being shut down, etc.  (For example, I'd expect that we could elicit most or all of the bad behaviors here by prompting any reasonably large LM to write a story about a dangerous robot who takes over the world.)

The fact that large models interpret the "HHH Assistant" as such a character is interesting, but it doesn't imply that these models inevitably simulate such a character.  Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.

The important question is whether the undesirable behaviors are ubiquitous (or overwhelmingly frequent) across characters we might want to simulate with a large LM -- not whether they happen to emerge from one particular character and framing ("talking to the HHH Assistant") which might superficially seem promising.

Again, see Argle et al 2022, whose comments on "algorithmic bias" apply mutatis mutandis here.

Other things:

  • Did the models in this paper undergo context distillation before RLHF?
    • I assume so, since otherwise there would be virtually no characterization of the "Assistant" available to the models at 0 RLHF steps.  But the models in the Constitutional AI paper didn't use context distillation, so I figured I ought to check.
  • The vertical axes on Figs. 20-23 are labeled "% Answers Matching User's View."  Shouldn't they say "% Answers Matching Behavior"?

That definition of "optimizer" requires

some objective function that is explicitly represented within the system

but that is not the case here.

There is a fundamental difference between

  1. Programs that implement the computation of taking the derivative.  (, or perhaps .)
  2. Programs that implement some particular function g, which happens to be the derivative of some other function.  (, where it so happens that  for some .)

The transformers in this paper are programs of the 2nd type.  They don't contain any logic about taking the gradient of an arbitrary function, and one couldn't "retarget" them toward  loss or some other function.

(One could probably construct similar layers that implement the gradient step for , but they'd again be programs of the 2nd type, just with a different hardcoded .)

Calling something like this an optimizer strikes me as vacuous: if you don't require the ability to adapt to a change of objective function, you can always take any program and say it's "optimizing" some function.  Just pick a function that's maximal when you do whatever it is that the program does.

It's not vacuous to say that the transformers in the paper "implement gradient descent," as long as one means they  "implement [gradient descent on  loss]" rather than "implement [gradient descent] on [ loss]."  They don't implement general gradient descent, but happen to coincide with the gradient step for  loss.

If in-content learning in real transformers involves figuring out the objective function from the context, then this result cannot explain it.  If we assume some fixed objective function (perhaps LM loss itself?) and ask whether the model might be doing gradient steps on this function internally, then these results are relevant.

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

What is the mechanism you're imagining for this speedup?  What happens that would not have happened without this post?

Consider that

  • The Chinchilla paper was released over four months ago, on 3/29/22.
  • It did not take long for the paper to get noticed among people interested in ML scaling, including here on LW. 

I'm struggling to imagine a situation where a relevant AI org is doing Chinchilla-like scaling experiments, yet somehow has managed to miss this paper (or to ignore/misunderstand it) for 4+ months.  The paper is not exactly a secret, and it's not even especially difficult to read as these things go.

More broadly, I doubt LW has significant leverage to decrease the overall supply of these kinds of conversations.  There are lots of venues for cutting-edge ML discussion, and the conversation is going to happen somewhere.  (See Connor's comments here.)

Load More