Same person as nostalgebraist2point0, but now I have my account back.
Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?
No, it's a more philosophical point. Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text."
For example, the mere appearance of something like
Title: Why GPT wants to mesa-optimize & how we might change this
does not guarantee that the text following it bears that title, or was written by that author. (As I am illustrating right now.)
Of course, one can design datasets where information like this is provided more authoritatively -- say, always at the start of each text, curated for quality, etc. (GPT isn't like that, but Grover and CTRL kind of are, in different ways.)
But even that can only go so far. If the author is "Julius Caesar," does that mean the historical figure, some internet poster with that handle, or any number of other possibilities? A passage of fiction written in a character's voice -- is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character? (Note that the character is a much better answer to the question "who does this sound like?") And doesn't the date matter too, so we know whether this post in the venue "Less Wrong" was on 2010's LW or 2020's?
Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words. You can try to hack in some sidechannels to provide context, but there's no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world. And just as a definitional manner, these sidechannels are modifications to "language modeling," which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).
My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.
Yeah, not for transformers I think.
Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.
capybaralet's point about conservation of expected evidence applies here -- GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.
If we then say "the mechanism for pricing them in is doing internal lookahead," then we are imagining that lookahead operating over some predictor that is otherwise good but hasn't priced in lookahead yet. But I don't know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.
I'm skeptical that internal beam search would help in language modeling.
Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you're looking. So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.
Weather is like this because of chaotic dynamics. Language modeling is like this because
(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn't extrapolate from reading the first (100-X)%, or else they'd just stop and not write the remaining X%.
(b) By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom. So even if you were smart enough to guess what any individual human would say next (!), you don't know which human produced the text you're looking at. (Or even whether it was a human at all.)
Thus (IMO), language modeling is not really about thinking ahead to find some "objectively correct" next move as in Chess/Go. It's more about trying to guess what the author of this text will do in the very next step. The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn't find it very useful.
To make the point concrete, I don't think "orange" is necessarily a bad guess here -- among other things, it would be the correct guess if the author were trying to illustrate the point of your example!
And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis "...", which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in. (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )
I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.
What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they're continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.
I expect GPT and many other neural models are effectively working in such space of nearly orthogonal vectors, and picking/combining elements of it. A decomposition into orthogonal vectors won't really illuminate this. I wish I knew more about this topic -- are there standard techniques?
One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model.
That's a great idea!
One possible hypothesis that this might let you test is whether the information about the input is being stored indirectly via what the model's guess is given that input or whether it's just being stored in parts of the embedding space that aren't very relevant to the output (if it's the latter, the linear model should put a lot of weight on basis elements that have very little weight in the embedding matrix).
Hmm... I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.
But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.
Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they're near-impossible so you don't need to track them closely), and then smoothly subtracted out at the end. There's probably a way to check if that's happening.
Maybe lm_head was set to be equal to wte transpose?
Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.
Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the "head" used to convert the final layer outputs into predictions. The intent is easy swapping between different "heads" with the same "body" beneath.
This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.
Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn't a critique of the latter; it's trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.
Interesting, but not (I think?) the direction I was headed in.
I was thinking more about the way the model seems to be managing a tradeoff between preserving the representation of token i and producing the representation of token i+1.
The depth-wise continuity imposed by weight decay means late layers are representing something close to the final output -- in late layers the model is roughly looking at its own guesses, even if they were wrong, which seems suboptimal.
Consider this scenario:
My sampling idea was something like "let's replace (or interpolate) late activations with embeddings of the actual next token, so the model can see what really happened, even when its probability was low." (This is for sampling specifically because it'd be too slow in training, where you want to process a whole window at once with matrix operations; sampling has to be a loop anyway, so there's no cost to adding stuff that only works as a loop.)
But, thinking about it more, the model clearly can perform well in scenarios like the above, e.g. my plasma example and also many other cases naturally arising in language which GPT handles well.
I have no idea how it does it -- indeed the connection structure feels weirdly adverse to such operations -- but apparently it does. So it's probably premature to assume it can't do this well, and attempt to "help it out" with extra tricks.
Thanks, the floor/ceiling distinction is helpful.
I think "ceilings as they exist in reality" is my main interest in this post. Specifically, I'm interested in the following:
In more detail: it's one thing to be able to assess quick heuristics, and it's another (and better) one to be able to assess quick heuristics quickly. It's possible (maybe) to imagine a convenient situation where the theory of each "speed class" among fast decisions is compressible enough to distill down to something which can be run in that speed class and still provide useful guidance. In this case there's a possibility for the theory to tell us why our behavior as a whole is justified, by explaining how our choices are "about as good as can be hoped for" during necessarily fast/simple activity that can't possibly meet our more powerful and familiar notions of decision rationality.
However, if we can't do this, it seems like we face an exploding backlog of justification needs: every application of a fast heuristic now requires a slow justification pass, but we're constantly applying fast heuristics and there's no room for the slow pass to catch up. So maybe a stronger agent could justify what we do, but we couldn't.
I expect helpful theories here to involve distilling-into-fast-enough-rules on a fundamental level, so that "an impractically slow but working version of the theory" is actually a contradiction in terms.
But it seems like the core strategy--be both doing object-level cognition and meta-level cognition about how you're doing object-level cognitive--is basically the same.
It remains unclear to me whether the right way to find these meta-strategies is something like "start at the impractical ideal and rescue what you can" or "start with something that works and build new features"; it seems like modern computational Bayesian methods look more like the former than the latter.
I'd argue that there's usually a causal arrow from practical lore to impractical ideals first, even if the ideals also influence practice at a later stage. Occam's Razor came before Solomonoff; "change your mind when you see surprising new evidence" came before formal Bayes. The "core strategy" you refer to sounds like "do both exploration and exploitation," which is the sort of idea I'd imagine goes back millennia (albeit not in those exact terms).
One of my goals in writing this post was to formalize the feeling I get, when I think about an idealized theory of this kind, that it's a "redundant step" added on top of something that already does all the work by itself -- like taking a decision theory and appending the rule "take the actions this theory says to take." But rather than being transparently vacuous, like that example, they are vacuous in a more hidden way, and the redundant steps they add tend to resemble legitimately good ideas familiar from practical experience.
Consider the following (ridiculous) theory of rationality: "do the most rational thing, and also, remember to stay hydrated :)". In a certain inane sense, most rational behavior "conforms to" this theory, since the theory parasitizes on whatever existing notion of rationality you had, and staying hydrated is generally a good idea and thus does not tend to conflict with rationality. And whenever staying hydrated is a good idea, one could imagine pointing to this theory and saying "see, there's the hydration theory of rationality at work again." But, of course, none of this should actually count in the "hydration theory's" favor: all the real work is hidden in the first step ("do the most rational thing"), and insofar as hydration is rational, there's no need to specify it explicitly. This doesn't quite map onto the R/S schema, but captures the way in which I think these theories tend to confuse people.
If the more serious ideals we're talking about are like the "hydration theory," we'd expect them to have the appearance of explaining existing practical methods, and of retrospectively explaining the success of new methods, while not being very useful for generating any new methods. And this seems generally true to me: there's a lot of ensemble-like or regularization-like stuff in ML that can be interpreted as Bayesian averaging/updating over some base space of models, but most of the excitement in ML is in these base spaces. We didn't get neural networks from Bayesian first principles.
OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple board-evaluation heuristic at the leaf nodes, is a pretty decent algorithm in practice.
I've written previously about this kind of argument -- see here (scroll down to the non-blockquoted text). tl;dr we can often describe the same optimum in multiple ways, with each way giving us a different series that approximates the optimum in the limit. Whether any one series does well or poorly when truncated to N terms can't be explained by saying "it's a truncation of the optimum," since they all are; these truncations properties are facts about the different series, not about the optimum. I illustrate with different series expansions for π.
Furthermore, it seems like there's a pattern where, the more general the algorithmic problem you want to solve is, the more your solution is compelled to resemble some sort of brute-force search.
You may be right, and there are interesting conversations to be had about when solutions will tend to look like search and when they won't. But this doesn't feel like it really addresses my argument, which is not about "what kind of algorithm should you use" but about the weirdness of the injunction to optimize over a space containing every procedure you could ever do, including all of the optimization procedures you could ever do. There is a logical / definitional weirdness here that can't be resolved by arguments about what sorts of (logically / definitionally unproblematic) algorithms are good or bad in what domains.
This post feels quite similar to things I have written in the past to justify my lack of enthusiasm about idealizations like AIXI and logically-omniscient Bayes. But I would go further: I think that grappling with embeddedness properly will inevitably make theories of this general type irrelevant or useless, so that "a theory like this, except for embedded agents" is not a thing that we can reasonably want. To specify what I mean, I'll use this paragraph as a jumping-off point:
Embedded agents don’t have the luxury of stepping outside of the universe to think about how to think. What we would like would be a theory of rational belief for situated agents which provides foundations that are similarly as strong as the foundations Bayesianism provides for dualistic agents.
Most "theories of rational belief" I have encountered -- including Bayesianism in the sense I think is meant here -- are framed at the level of an evaluator outside the universe, and have essentially no content when we try to transfer them to individual embedded agents. This is because these theories tend to be derived in the following way:
For example, in Solomonoff, S is defined by computability while R is allowed to be uncomputable. In the LIA construction, S is defined by polytime complexity while R is allowed to run slower than polytime. In logically-omniscient Bayes, finite sets of hypotheses can be manipulated in a finite universe but the full Boolean algebra over hypotheses generally cannot.
I hope the framework I've just introduced helps clarify what I find unpromising about these theories. By construction, any agent you can actually design and run is a single element of S (a "practical strategy"), so every fact about rationality that can be incorporated into agent design gets "hidden inside" the individual s∈S, and the only things you can learn from the "ideal theory" R are things which can't fit into a practical strategy.
For example, suppose (reasonably) that model averaging and complexity penalties are broadly good ideas that lead to good results. But all of the model averaging and complexity penalization that can be done computably happens inside some Turing machine or other, at the level "below" Solomonoff. Thus Solomonoff only tells you about the extra advantage you can get by doing these things uncomputably. Any kind of nice Bayesian average over Turing machines that can happen computably is (of course) just another Turing machine.
This also explains why I find it misleading to say that good practical strategies constitute "approximations to" an ideal theory of this type. Of course, since R just says to follow the best strategies in S, if you are following a very good strategy in S your behavior will tend to be close to that of R. But this cannot be attributed to any of the searching over S that R does, since you are not doing a search over S; you are executing a single member of S and ignoring the others. Any searching that can be done practically collapses down to a single practical strategy, and any that doesn't is not practical. Concretely, this talk of approximations is like saying that a very successful chess player "approximates" the rule "consult all possible chess players, then weight their moves by past performance." Yes, the skilled player will play similarly to this rule, but they are not following it, not even approximately! They are only themselves, not any other player.
Any theory of ideal rationality that wants to be a guide for embedded agents will have to be constrained in the same ways the agents are. But theories of ideal rationality usually get all of their content by going to a level above the agents they judge. So this new theory would have to be a very different sort of thing.