All of nostalgebraist's Comments + Replies

chinchilla's wild implications

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

What is the mechanism you're imagining for this speedup?  What happens that would not have happened without this post?

Co... (read more)

Two-year update on my personal AI timelines

Now I’m inclined to think that just automating most of the tasks in ML research and engineering -- enough to accelerate the pace of AI progress manyfold -- is sufficient.

This seems to assume that human labor is currently the limiting bottleneck in AI research, and by a large multiplicative factor.

That doesn't seem likely to me.  Compute is a nontrivial bottleneck even in many small-scale experiments, and in particular is a major bottleneck for research that pushes the envelope of scale, which is generally how new SOTA results and such get made these d... (read more)

chinchilla's wild implications

The correct answer is the annoyingly trivial one: "it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText."

How good is that, though?  Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.

If you're Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways.  Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on M... (read more)

4Vanessa Kosoy12d
Transformers a Turing complete [https://jmlr.org/papers/volume22/20-302/20-302.pdf], so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.
chinchilla's wild implications

Very interesting!

There are a few things in the calculation that seem wrong to me:

  • If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
  • I'd expect much less than 100% of Youtube video time to contain speech.  I don't know what a reasonable discount for this would be, though.
  • In the opposite direction, 1% useful seems too low.  IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.

I... (read more)

chinchilla's wild implications

I definitely think it makes LM --> AGI less likely, although I didn't think it was very likely to begin with.

I'm not sure that the AI interacting with the world would help, at least with the narrow issue described here.

If we're talking about data produced by humans (perhaps solicited from them by an AI), then we're limited by the timescales of human behavior.   The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).

All that text was already there in the... (read more)

Humans provide an untapped wealth of evidence about alignment

I don't have anything especially insightful to contribute, but I wanted to thank you (TurnTrout and Quinton) for this post.  I agree with it, and I often find myself thinking things like this when I read alignment posts by others on LW/AF.

When people present frameworks for thinking about AGIs or generic "intelligent agents," I often want to ask them: "are humans expressible in your framework?"  Often it seems like the answer is "no."

And a common symptom of this is that the framework cannot express entities with human-level capabilities that are a... (read more)

[Link] Training Compute-Optimal Large Language Models

Thinking back to the "inconsistency" from the Kaplan et al papers...

  • In Appendix E of the new paper, we see the loss-vs-compute frontier start to "bend" from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
  • I suspect this bending is the transition from the faster "L(C) law" to the slower "L(D) law."
    • A brief recap of that below:
      • Adding more params can help in two ways: it makes your model's loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
      • As models get bigger, the first
... (read more)
[Link] Training Compute-Optimal Large Language Models

It ought to shorten actual timelines, for the reason you say.  (Except insofar as data sourcing could actually become a practical problem.)

However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed.  (It's the parameter count of a model that uses about as much inference compute as the brain.)

This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline.  It's always wait... (read more)

It Looks Like You're Trying To Take Over The World

I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).

Like, I've read and re-read the first few sections a number of times, and I still can't come up with a mental model of HXU's structure that fits all of the described facts.  By "HXU's structure" I mean things like:

  • The researcher is running an "evolutionary search in auto-ML" method.  How many nested layers of inner/outer loop does this method (explicitly) contain?
  • Where in the nested structure
... (read more)
Hard-Coding Neural Computation

I'm confused by your notation for feed-forward layers.

What justifies re-using the same labels ("apple" etc.) for

  1. the coordinates of  
  2. the coordinates of , i.e. the basis in which the nonlinearity operates

?

If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by , or which vectors/semes they get mapped to by .

But your labels don't correspond to either of these interpretations.  Instead, it looks like you are foll... (read more)

More Christiano, Cotra, and Yudkowsky on AI progress

I agree with Eliezer's recommendation to double-check results in papers that one finds surprising.

So, I looked into the claim of a 10x - 100x gain for transformers, using Table 2 from the paper.  Detailed results are in this Colab.

Briefly, I don't think the claim of 10x - 100x is well supported.  Depending on what exactly you compute, you get anywhere from "no speedup" to "over 300x speedup."  All the estimates you can make have obvious problems, and all show a massive gap between French and German.

In detail:

  • The appearance of a large speedup
... (read more)
larger language models may disappoint you [or, an eternally unfinished draft]

It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence.

All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like "giving someone a free-response math test."

  • They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event)
  • They don't know the material, yet they ace the test: requires an astronomically unlikely coincidence

The distinction I'm meaning to draw is not that ... (read more)

larger language models may disappoint you [or, an eternally unfinished draft]

I'm glad you liked the post!  And, given that you are an avowed "enthusiast," I'm pleasantly surprised that we agree about as many things as we do.

The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.

Thanks for pointing out this argument -- I hadn't thought about it before.  A few thoughts:

Ordinary text generation is also a multi-step process.  (The token length generally isn't ... (read more)

NLP Position Paper: When Combatting Hype, Proceed with Caution

I agree with the critiques you make of specific papers (in section 2), but I'm less convinced by your diagnosis that these papers are attempting to manage/combat hype in a misguided way.

IMO, "underclaiming" is ubiquitous in academic papers across many fields -- including fields unrelated to NLP or ML, and fields where there's little to no hype to manage.  Why do academics underclaim?  Common reasons include:

  1. An incentive to make the existing SOTA seem as bad as possible, to maximize the gap between it and your own new, sparkly, putatively superior
... (read more)
2Sam Bowman10mo
Yeah, this all sounds right, and it's fairly close to the narrative I was using for my previous draft, which had a section on some of these motives. The best defense I can give of the switch to the hype-centric framing, FWIW: * The paper is inevitably going to have to do a lot of chastising of authors. Giving the most charitable possible framing of the motivations of the authors I'm chastising means that I'm less likely to lose the trust/readership of those authors and anyone who identifies with them. * An increasingly large fraction of NLP work—possibly even a majority now—is on the analysis/probing/datasets side rather than model development, and your incentives 1 and 2 don't apply as neatly there. There are still incentives to underclaim, but they work differently. * Practically, writing up that version clearly seemed to require a good deal more space, in an already long-by-ML-standards paper. That said, I agree that this framing is a little bit too charitable, to the point of making implausible implications about some of these authors' motives in some cases, which isn't a good look. I also hadn't thought of the wasted effort point, which seems quite useful here. I'm giving a few talks about this over the next few weeks, and I'll workshop some tweaks to the framing with this in mind.
4gwern10mo
Here's an eyerolling example from yesterday or so: Delphi [https://arxiv.org/abs/2110.07574] boasts about their new ethics dataset of n =millions & model which gets 91% vs GPT-3 at chance-level of 52%. Wow, how awful! But wait, we know GPT-3 does better than chance on other datasets like Hendrycks's ETHICS, how can it do so bad where a much smaller model can do so well? Oh, it turns out that that's zeroshot with their idiosyncratic format. The abstract just doesn't mention that when they do some basic prompt engineering (no p-tuning or self-distillation or anything) and include a few examples (ie. a lot fewer than 'millions'), it gets more like... 84%. Oh.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Most complexity measures give roughly similar values for the (relative) complexity of most objects

 

I'll write mostly about this statement, as I think it's the crux of our disagreement.

The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure.

However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the t... (read more)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.

Consider:

  • If &
... (read more)
2Joar Skalse1y
We’re not saying that discrete complexity measures fully capture what we care about for NNs! We do however think that they are sufficiently relevant to be informative for the bigger picture, even if just as a proxy for what we actually care about. Most complexity measures give roughly similar values for the (relative) complexity of most objects, so our assumption is that if something is the case for a bunch of different tractable complexity measures, then this is also likely to be the case for whatever the “ideal” complexity measure would be in the relevant case. In particular, ifP(f)≃2−K(x)+Cregardless of whether K represents Boolean complexity, or LZ complexity, etc, then this is also likely to be true for the “ideal” complexity measure for neural networks. Also: since we’re estimating various probabilities by sampling, we basically need to discretise the function space. If you have any concrete suggestions for how to get around this then we’re all ears! As for the rest of your comment -- what you’re saying here seems true to me, but I’m not sure I see how any of this is a counterpoint to anything we’re saying?
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

I'm don't think this step makes sense:

Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can't be done for less than 10e15 params is a task which requires 10e15 data points also.

In the picture, it looks like there's something special about having a 1:1 ratio of data to params.  But this is a coincidence due to the authors' choice of units.

They define "one data point" as "one token," which is fine.  But it seems equally defensible to define "one data poin... (read more)

1Daniel Kokotajlo2y
Holy shit, mind blown! Then... how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between... Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as "how long you run the model during training" (which in turn is maybe "how many times the average parameter of the model is activated during training?") Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * "horizon length." I'm very interested to hear your thoughts on Ajeya's methodology. Is my sketch of it above accurate? Do you agree it's a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn't have got more easily with a smaller model--regardless of what the horizon length is, or what your training environment is, or what the task is? ... As for the broader point, what do you think of the Carlsmith report? [https://www.openphilanthropy.org/blog/new-report-brain-computation#Conclusions] The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya's report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance... that would probably make her timelines shorter, funnily enough!
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Actually, I think I spoke too soon about the visualization... I don't think your image of L(D) and L(N) is quite right.

Here is what the actual visualization looks like.  More blue = lower loss, and I made it a contour plot so it's easy to see indifference curves of the loss.

https://64.media.tumblr.com/8b1897853a66bccafa72043b2717a198/de8ee87db2e582fd-63/s540x810/8b960b152359e9379916ff878c80f130034d1cbb.png

In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:

  • If yo
... (read more)
1Daniel Kokotajlo2y
OK, wow, I didn't realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya's methodology was great after all -- my worries have been largely dispelled! Given that the indifference curves are so close to being L-shaped, it seems there'a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can't be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn't have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren't that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params. The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to "within a few OOMs of 10e15." Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can't be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance. So I no longer feel weird about this; I feel like this part of Ajeya's analysis makes sense. But I am now intensely curious as to how many "data points" the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc. Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime,
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

You can't have more D than you have compute, in some sense, because D isn't the amount of training examples you've collected, it's the amount you actually use to train... right? So... isn't this a heuristic for managing compute? It sure seemed like it was presented that way.

This is a subtle and confusing thing about the Kaplan et al papers.  (It's also the subject of my post that I linked earlier, so I recommend you check that out.)

There are two things in the papers that could be called "optimal compute budgeting" laws:

  • A law that assumes a sufficientl
... (read more)
1Daniel Kokotajlo2y
I've read your linked post thrice now, it's excellent, any remaining confusions are my fault. I didn't confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: " The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance. " was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you'd disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin. I'm glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance.

 

The scaling laws from the Kaplan et al papers do tell you this.

The relevant law is , for the early-stopped test loss given parameter count  and data size .  It has the functional form

with .

The result that you should scale  comes from trying to keep the two terms in this formula about the same size.

This is not exactly a heuristic for managing compute (s... (read more)

2Daniel Kokotajlo2y
Huh, thanks, now I'm more confused about the scaling laws than I was before, in a good way! I appreciate the explanation you gave but am still confused. Some questions: --In my discussion with Rohin I said: Do you agree or disagree? My guess is that you'd disagree, since you say: which I take to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D? (But wait, surely you don't think that... OK, yeah, I'm just very confused here, please help!) 2. You say "This is not exactly a heuristic for managing compute (since D is not dependent on compute, it's dependent on how much data you can source)." Well, isn't it both? You can't have more D than you have compute, in some sense, because D isn't the amount of training examples you've collected, it's the amount you actually use to train... right? So... isn't this a heuristic for managing compute? It sure seemed like it was presented that way. 3. Perhaps it would help me if I could visualize it in two dimensions. Let the y-axis be parameter count, N, and the x-axis be data trained on, D. Make it a heat map with color = loss. Bluer = lower loss. It sounds to me like the compute-optimal scaling law Kaplan et al tout is something like a 45 degree line from the origin such that every point on the line has the lowest loss of all the points on an equivalent-compute indifference curve that contains that point. Whereas you are saying there are two other interesting lines, the L(D) line and the L(N) line, and the L(D) line is (say) a 60-degree line from the origin such that for any point on that line, all points straight above it are exactly as blue. And the L(N) line is (say) a 30-degree line from the origin such that for any point on that line, all points straight to the right of it are exactly as blue. This is the picture I currently have in my head, is it correct in your opinion? (And you are saying that probably when
the scaling “inconsistency”: openAI’s new insight

I don't think you're completely missing something.  This is the active learning approach, which gwern also suggested -- see that thread for more.

the scaling “inconsistency”: openAI’s new insight

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.


Sure -- my point to contrast two cases

  1. a counterfactual world with a much larger "regular" web, so WebText and Common Crawl are 1000x their real size
  2. the real world, where we have to go beyond "regular" web scrapes to add orders of magnitude

Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free.  This includes domains the research would never have come up with themselves.

If we switch to manually hunting down large s... (read more)

I think I see 'domain-specific datasets' as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly 'everything'), by millions of people, doing things like uploading banned books for evading the Great Firewall or organizing protests against local officials, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hun... (read more)

the scaling “inconsistency”: openAI’s new insight

What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?

 

IIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss).  This will help if

  1. you get most of the way to intrinsic entropy L(D) on your first pass over D points
  2. you can downsample your full dataset without lowering the total number of examples seen in training, i.e. you have too many points to do one full epoch
... (read more)

The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

Why GPT wants to mesa-optimize & how we might change this

Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?

 

No, it's a more philosophical point.  Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text."

For example, the mere appearance of something like

Title: Why GPT wants to mesa-opti... (read more)

1John Maxwell2y
Your philosophical point is interesting; I have a post in the queue about that. However I don't think it really proves what you want it to. Having John_Maxwell in the byline makes it far more likely that I'm the author of the post. If humans can make useful judgements re: whether this is something I wrote, vs something nostalgebraist wrote to make a point about bylines, I don't see why a language model can't do the same, in principle. A perfectly optimal next-step predictor would not be improved by lookahead or anything else, it's perfectly optimal. I'm talking about computational structures which might be incentivized during training when the predictor is suboptimal. (It's still going to be suboptimal after training with current technology, of course.) In orthonormal's post they wrote [https://www.lesswrong.com/posts/3nDR23ksSQJ98WNDm/developmental-stages-of-gpts] : I suspect that either GPT-4 will still be unable to plan its way to a satisfying resolution, or GPT-4 will develop some kind of internal lookahead (probably not beam search, but beam search could be a useful model for understanding it) which is sufficiently general to be re-used across many different writing tasks. (Generality takes fewer parameters.) I don't know what the relative likelihoods of those possibilities are. But the whole idea of AI safety is to ask what happens if we succeed.
Why GPT wants to mesa-optimize & how we might change this

I'm skeptical that internal beam search would help in language modeling.

Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you're looking.  So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

Weather is like this because of chaotic dynamics.  Language modelin... (read more)

1John Maxwell2y
A system which develops small-L lookahead (for L > 1) may find large-L lookahead to be nearby in programspace. If so, incentivizing the development of small-L lookahead makes it more likely that the system will try large-L lookahead and find it to be useful as well (in predicting chess moves for instance). My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture. Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution. Lookahead is almost certainly going to do better than random guessing, even topic models can do that. Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?
interpreting GPT: the logit lens

I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.

What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they're continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.

I expect GPT and many other neural models are effectively working in such ... (read more)

3Vladimir Mikulik2y
You might want to look into NMF, which, unlike PCA/SVD, doesn't aim to create an orthogonal projection. It works well for interpretability because its components cannot cancel each other out, which makes its features more intuitive to reason about. I think it is essentially what you want, although I don't think it will allow you to find directly the 'larger set of almost orthogonal vectors' you're looking for.
interpreting GPT: the logit lens
One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model.

That's a great idea!

One possible hypothesis that this might let you test
... (read more)
2Evan Hubinger2y
Thanks! I'd be quite excited to know what you find if you end up trying it. I wasn't thinking you would do this with the natural component basis—though it's probably worth trying that also—but rather doing some sort of matrix decomposition on the embedding matrix to get a basis ordered by importance (e.g. using PCA or NMF—PCA is simpler though I know NMF is what OpenAI Clarity usually uses when they're trying to extract interpretable basis elements from neural network activations) and then seeing what the linear model looks like in that basis. You could even just do something like what you're saying and find some sort of basis ordered by the frequency of the tokens that each basis element corresponds to (though I'm not sure exactly what the right way would be to generate such a basis).
interpreting GPT: the logit lens
Maybe lm_head was set to be equal to wte transpose?

Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.

  • In OpenAI's tensorflow code, see lines 154 and 171 of src/model.py. The variable "wte" is defined on 151, then re-used on 171.
  • In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the
... (read more)
1oekenta2y
Thanks for the info. This was a great read, very informative.
interpreting GPT: the logit lens

Interesting, but not (I think?) the direction I was headed in.

I was thinking more about the way the model seems to be managing a tradeoff between preserving the representation of token i and producing the representation of token i+1.

The depth-wise continuity imposed by weight decay means late layers are representing something close to the final output -- in late layers the model is roughly looking at its own guesses, even if they were wrong, which seems suboptimal.

Consider this scenario:

  • The model does poorly at position i, assigning very low probability to
... (read more)
“embedded self-justification,” or something like that

Thanks, the floor/ceiling distinction is helpful.

I think "ceilings as they exist in reality" is my main interest in this post. Specifically, I'm interested in the following:

  • any resource-bound agent will have ceilings, so an account of embedded rationality needs a "theory of having good ceilings"
  • a "theory of having good ceilings" would be different from the sorts of "theories" we're used to thinking about, involving practical concerns at the fundamental desiderata level rather than as a matter of implementi
... (read more)
When does rationality-as-search have nontrivial implications?
But it seems like the core strategy--be both doing object-level cognition and meta-level cognition about how you're doing object-level cognitive--is basically the same.
It remains unclear to me whether the right way to find these meta-strategies is something like "start at the impractical ideal and rescue what you can" or "start with something that works and build new features"; it seems like modern computational Bayesian methods look more like the former than the latter.

I'd argue that there's usually a causal arrow from p... (read more)

Embedded World-Models
OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple board-evaluation heuristic at the leaf nodes, is a pretty decent algorithm in practice.

I've written previously about this kind of argument -- see here (scroll down to the non-blockquoted text). tl;dr we can often describe the same optimum in multiple ways, with each way giving us a different series that approximates the optimum in the limit. Whether any one series does well or poorly when truncated to N terms can't be explained by saying "... (read more)

Embedded World-Models

This post feels quite similar to things I have written in the past to justify my lack of enthusiasm about idealizations like AIXI and logically-omniscient Bayes. But I would go further: I think that grappling with embeddedness properly will inevitably make theories of this general type irrelevant or useless, so that "a theory like this, except for embedded agents" is not a thing that we can reasonably want. To specify what I mean, I'll use this paragraph as a jumping-off point:

Embedded agents don’t have the luxury of stepping outside of th
... (read more)

Thanks, this is a very clear framework for understanding your objection. Here's the first counterargument that comes to mind: Minimax search is a theoretically optimal algorithm for playing chess, but is too computationally costly to implement in practice. One could therefore argue that all that matters is computationally feasible heuristics, and modeling an ideal chess player as executing a minimax search adds nothing to our knowledge of chess. OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple boar... (read more)