# All of gwern's Comments + Replies

"Decision Transformer" (Tool AIs are secret Agent AIs)

Rewards need not be written in natural language as crudely as "REWARD: +10 UTILONS". Something to think about as you continue to write text online.

And what of the dead? I own that I thought of myself, at times, almost as dead. Are they not locked below ground in chambers smaller than mine was, in their millions of millions? There is no category of human activity in which the dead do not outnumber the living many times over. Most beautiful children are dead. Most soldiers, most cowards. The fairest women and the most learned men – all are dead. Their bodi

Agency in Conway’s Game of Life

My immediate impulse is to say that it ought to be possible to create the smiley face, and that it wouldn't be that hard for a good Life hacker to devise it.

I'd imagine it to go something like this. Starting from a Turing machine or simpler, you could program it to place arbitrary 'pixels': either by finding a glider-like construct which terminates at specific distances into a still, so the constructor can crawl along an x/y axis, shooting off the terminating-glider to create stable pixels in a pre-programmed pattern. (If that doesn't exist, then one could... (read more)

gwern's Shortform

2-of-2 escrow: what is the exploding Nash equilibrium? Did it really originate with NashX? I've been looking for the history & real name of this concept for years now and have failed to refind it. Anyone?

I claim that if we're clever enough, we can construct a hypothetical training regime T' which trains the NN to do nearly or exactly the same thing on T, but which injects malign behavior on some different examples. (Someone told me that this is actually an existing area of study; but, I haven't been able to find it yet.)

I assume they're referring to data poisoning backdoor attacks like https://arxiv.org/abs/2010.12563 or https://arxiv.org/abs/1708.06733 or https://arxiv.org/abs/2104.09667

2020 AI Alignment Literature Review and Charity Comparison

That's interesting. I did see YC listed as a major funding source, but given Sam Altman's listed loans/donations, I assumed, because YC has little or nothing to do with Musk, that YC's interest was Altman, Paul Graham, or just YC collectively. I hadn't seen anything at all about YC being used as a cutout for Musk. So assuming the Guardian didn't screw up its understanding of the finances there completely (the media is constantly making mistakes in reporting on finances and charities in particular, but this seems pretty detailed and specific and hard to get... (read more)

Against evolution as an analogy for how humans will create AGI

As described above, I expect AGI to be a learning algorithm—for example, it should be able to read a book and then have a better understanding of the subject matter. Every learning algorithm you’ve ever heard of—ConvNets, PPO, TD learning, etc. etc.—was directly invented, understood, and programmed by humans. None of them were discovered by an automated search over a space of algorithms. Thus we get a presumption that AGI will also be directly invented, understood, and programmed by humans.

For a post criticizing the use of evolution for end to end ML, t... (read more)

4Richard Ngo3moI personally found this post valuable and thought-provoking. Sure, there's plenty that it doesn't cover, but it's already pretty long, so that seems perfectly reasonable. I particularly I dislike your criticism of it as strawmanish. Perhaps that would be fair if the analogy between RL and evolution were a standard principle in ML. Instead, it's a vague idea that is often left implicit, or else formulated in idiosyncratic ways. So posts like this one have to do double duty in both outlining and explaining the mainstream viewpoint (often a major task in its own right!) and then criticising it. This is most important precisely in the cases where the defenders of an implicit paradigm don't have solid articulations of it, making it particularly difficult to understand what they're actually defending. I think this is such a case. If you disagree, I'd be curious what you consider a non-strawmanish summary of the RL-evolution analogy. Perhaps Clune's AI-GA paper? But from what I can tell opinions of it are rather mixed, and the AI-GA terminology hasn't caught on.

Thanks for all those great references!

My current thinking is: (1) Outer-loop meta-learning is slow, (2) Therefore we shouldn't expect to get all that many bits of information out of it, (3) Therefore it's a great way to search for parameter settings in a parameterized family of algorithms, but not a great way to do "the bulk of the real design work", in the sense that programmers can look at the final artifact and say "Man, I have no idea what this algorithm is doing and why it's learning anything at all, let alone why it's learning things very effectively... (read more)

3Adam Shimi3moJust wanted to say that this comment made me add a lot of things on my reading list, so thanks for that (but I'm clearly not well-read enough to go into the discussion).
[AN #142]: The quest to understand a network well enough to reimplement it by hand

It is quite possible that CLIP “knows” that the image contains a Granny Smith apple with a piece of paper saying “iPod”, but when asked to complete the caption with a single class from the ImageNet classes, it ends up choosing “iPod” instead of “Granny Smith”. I’d caution against saying things like “CLIP thinks it is looking at an iPod”; this seems like too strong a claim given the evidence that we have right now.

Yes, it's already been solved. These are 'attacks' only in the most generous interpretation possible (since it does know the difference), and ... (read more)

What happens to variance as neural network training is scaled? What does it imply about "lottery tickets"?

'Variance' is used in an amusing number of ways in these discussions.You use 'variance' in one sense (the bias-variance tradeoff), but "Explaining Neural Scaling Laws", Bahri et al 2021 talks about a difference kind of variance limit in scaling, while "Learning Curve Theory", Hutter 2001's toy model provides statements on yet others kinds of variances about scaling curves themselves (and I think you could easily dig up a paper from the neural tangent kernel people about scaling approximating infinite width models which only need to make infinitesimally sma... (read more)

2020 AI Alignment Literature Review and Charity Comparison

OpenAI was initially funded with money from Elon Musk as a not-for-profit.

This is commonly said on the basis of his $1b pledge, but AFAICT Musk wound up contributing little or nothing before he resigned ~2018. If you look at the OA Form 990s, Musk is never listed as a donor, only a board member; the only entities that are listed as contributing money or loans are Sam Altman, Y Combinator Research, and OpenAI LP. He's definitely given some money, and I don't think the 990 absence means much. From here: in 2016, the IRS was still processing OpenAI’s non-profit status, making it impossible for the organization to receive charitable donations. Instead, the Musk Foundation gave$10m to another young charity, YC.org. [...] The Musk Foundation’s grant accounted for the majority of YC.org’s revenue, and almost all of its own funding, when it passed along $10m to OpenAI later that year. Also, when he quit in 2018, OpenAI wrote "Elon Musk will depart the OpenAI Board but ... (read more) This is commonly said on the basis of his$1b pledge

Wasn't it supposed to be a total of $1b pledged, from a variety of sources, including Reid Hoffman and Peter Thiel, rather than$1b just from Musk?

EDIT: yes, it was.

Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed 1 billion, although we expect to only spend a tiny fraction of this in the next few years. https://openai.com/blog/introducing-openai/ Extrapolating GPT-N performance Finally, the scramble task is about shuffling around letters in the right way, and arithmetic is about adding, subtracting, dividing, and multiplying numbers. The main interesting thing about these tasks is that performance doesn’t improve at all in the beginning, and then starts improving very fast. This is some evidence that we might expect non-linear improvements on particular tasks, though I mostly interpret it as these tasks being quite narrow, such that when a model starts getting the trick, it’s quite easy to systematically get right. To beat my u... (read more) the scaling “inconsistency”: openAI’s new insight I think I see 'domain-specific datasets' as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly 'everything'), by millions of people, doing things like uploading banned books for evading the Great Firewall, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce met... (read more) the scaling “inconsistency”: openAI’s new insight Yes, my hypothesis is that active learning should have a different asymptotic because in a lot of admittedly-simple scenarios like logistic regression, active learning has a much nicer asymptotic. Right now, it's not too hard to run in <=1 epoch, and GPT-3 did, and that's using just CC and books1/2/WP. There's loads of other text datasets. (I think someone in EleutherAI was saying that Literotica alone was 500GB...?) Even if active learning 'runs out' of useful data before available compute, that will (a) save a whole lot of compute/time. and (b) tell u... (read more) I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc. Sure -- my point to contrast two cases 1. a counterfactual world with a much larger "regular" web, so WebText and Common Crawl are 1000x their real size 2. the real world, where we have to go beyond "regular" web scrapes to add orders of magnitude Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free. This includes domains the research would never have come up with themselves. If we switch to manually hunting down large s... (read more) the scaling “inconsistency”: openAI’s new insight This makes sense to me and is what I've been considering as the implication of sample-efficiency (one of the blessings of scale), coming at it from another direction of meta-learning/Bayesian RL: if your model gets more sample-efficient as it gets larger & n gets larger, it's because it's increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can't squeeze bl... (read more) 3David Krueger5moI found this confusing. It sort of seems like you're assuming that a Bayes-optimal learner achieves the Bayes error rate (are you ?), which seems wrong to me. * What do you mean "the Bayes-limit"? At first, I assumed you were talking about the Bayes error rate (https://en.wikipedia.org/wiki/Bayes_error_rate), but that is (roughly) the error you coule expect to achieve with infinite data, and we're still talking about finite data. * What do you mean "Bayes-optimal learner"? I assume you just mean something that performs Bayes rule exactly (so depends on the prior/data). * I'm confused by you talking about "approach[ing] the intrinsic entropy"... it seems like the figure in OP shows L(C) approaching L(D). But is L(D) supposed to represent intrinsic entropy? should we trust it as an estimate of intrinsic entropy? I also don't see how active learning is supposed to help (unless you're talking about actively generating data)... I thought the whole point you were trying to make is that once you reach the Bayes error rate there's literally nothing you can do to keep improving without more data. You talk about using active learning to throw out data-points... but I thought the problem was not having enough data? So how is throwing out data supposed to help with that? 3nostalgebraist7moIIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss). This will help if 1. you get most of the way to intrinsic entropy L(D) on your first pass over D points 2. you can downsample your full dataset without lowering the total number of examples seen in training, i.e. you have too many points to do one full epoch over them I can imagine this regime becoming the typical one for non-text modalities like video that have huge data with lots of complex redundancy (which the model will learn to compress). With text data, though, I'm concerned that (2) will fail soon. The number of train steps taken by GPT-3 was the same order of magnitude as the size of Common Crawl. I haven't seen convincing evidence that comparably good/diverse text datasets can be constructed which are 10x this size, 100x, etc. The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them. Mesa-Search vs Mesa-Control Because you don't train the inputs, you're trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it's abusing the term 'stochastic' (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don't understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can't meta-learn in the same way as RNNs internally. 1Vanessa Kosoy8moI am not sure what do you mean by "stop cold?" It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic. As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here? Mesa-Search vs Mesa-Control An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it's not, I don't see why that would make a difference, or why a Transformer couldn't be 'not deterministic' in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can't learn 'Monte Carlo iterations' internally (say, one per head). 1Vanessa Kosoy8moI already conceded [https://www.lesswrong.com/posts/WmBukJkEFM72Xr397/mesa-search-vs-mesa-control?commentId=pvjpMDpyzp7icAFEZ] a Transformer can be made stochastic. I don't see a problem with backproping: you treat the random inputs as part of the environment, and there's no issue with the environment having stochastic parts. It's stochastic gradient descent, after all. Reply to Jebari and Lundborg on Artificial Superintelligence I skimmed the paper when they announced it on Twitter. It seemed like it fundamentally ignores every possibility vaguely like mesa-optimization or imitation learning, and can't deal with things like, say, GPT-3 meta-learning agency to better predict data derived from agents (ie. humans). They leave themselves an out by handwaving away all such inconveniences as 'iron ore agents', but then it's thoroughly useless and circular; "what's an iron ore agent?" "It's one which has dangerous outcomes due to hidden agency." "OK, which agents are those, how can you tell AlphaZero from GPT-3 from AGI?" "Well, try them and see!" [AN #120]: Tracing the intellectual roots of AI and AI alignment The authors then develop their own method, Maia. They talk about it as a “modification of the AlphaZero architecture”, but as far as I can tell it is simply behavior cloning using the neural net architecture used by Leela. As you might expect, this does significantly better, and finally satisfies the property we would intuitively want: the best predictive model for a human of some skill level is the one that was trained on the data from humans at that skill level. Yeah, I think that's all they mean: the CNN and input/output are the same as Leela the same... (read more) 2Rohin Shah8moFair point. In my ontology, "behavior cloning" is always with respect to some expert distribution, so I see the stratified samples as "several instances of behavior cloning with different expert distributions", but that isn't a particularly normal or accepted ontology. Yeah it does seem like this would have worked better -- if nothing else, the predictions could be more precise (rather than specifying the bucket in which the current player falls, you can specify their exact ELO instead). Environments as a bottleneck in AGI development "Blessings of scale" observations aside, it seems like right now, environments are not the bottleneck to DL/DRL work. No one failed to solve Go because gosh darn it, they just lacked a good Go simulator which correctly implemented the rules of the game; the limits to solving ALE-57 (like Montezuma's Revenge) in general or as a single multi-task agent do not seem to be lack of Atari games where what we really need is ALE-526*; Procgen performance is not weak because of insufficient variation in levels; OpenAI Universe failed not for lack of tasks, to say th... (read more) The fact that progress on existing environments (Go, ALE-57, etc) isn't bottlenecked by environments doesn't seem like particularly useful evidence. The question is whether we could be making much more progress towards AGI with environments that were more conducive to developing AGI. The fact that we're running out of "headline" challenges along the lines of Go and Starcraft is one reason to think that having better environments would make a big difference - although to be clear, the main focus of my post is on the coming decades, and the claim that enviro... (read more) Why GPT wants to mesa-optimize & how we might change this It still is, it's just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation. Why GPT wants to mesa-optimize & how we might change this Why is beam search missing? One possibility is that GPT-3 already does internal lookahead. OpenAI tried beam search, found it didn't improve text generation, and didn't bother adding it as an option. In other words, GPT-3 is already mesa-optimizing 😲 Beam search has never worked for likelihood-trained NNs, since at least char-RNNs back in 2015. Beam search does trigger repetition and other pathologies in GPT, see "The Curious Case of Neural Text Degeneration", Holtzman et al 2019. And while unlikelihood training seems to help, it's not a silver bullet, and is a bit ad hoc (especially if you think of it in terms of reinforcement learning). 3David Krueger9moSeq2seq used beam search and found it helped (https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43155.pdf). [(https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43155.pdf).] It was standard practice in the early days of NMT; I'm not sure when that changed. This blog post gives some insight into why beam search might not be a good idea, and is generally very interesting: https://benanne.github.io/2020/09/01/typicality.html [https://benanne.github.io/2020/09/01/typicality.html] [AN #116]: How to make explanations of neurons compositional The composition paper seems to exemplify what I talk about as my intuition for how NNs work. The models are both very small and trained on little data, but image classification seems to be much easier than NLP (which is why the DL revolution came to image classification many years before NLP), so it's enough to train the CNN to have fairly meaningful disentangled representations of the kind we expect; their RNN model, however, continues to grope through relatively superficial associations and tricks, as the text database is relatively tiny. I'd predict tha... (read more) 6Rohin Shah9moYup, I generally agree (both with the three predictions, and the general story of how NNs work). interpreting GPT: the logit lens Doing it with GPT-3 would be quite challenging just for compute requirements like RAM. You'd want to test this out on GPT-2-117M first, definitely. If the approach works at all, it should work well for the smallest models too. interpreting GPT: the logit lens I think this might suggest there is some fundamentally better way to do sampling from GPT models? I'm having trouble writing out the intuition clearly, so I'll leave it for later posts. Unroll the sampling process: hook up all the individual GPT instances into a single long model, bypass the discretizing/embedding layers to make it differentiable end-to-end, and do gradient ascent to find the sequence which maximizes likelihood conditional on the fixed input. 4nostalgebraist10moInteresting, but not (I think?) the direction I was headed in. I was thinking more about the way the model seems to be managing a tradeoff between preserving the representation of token i and producing the representation of token i+1. The depth-wise continuity imposed by weight decay means late layers are representing something close to the final output -- in late layers the model is roughly looking at its own guesses, even if they were wrong, which seems suboptimal. Consider this scenario: * The model does poorly at position i, assigning very low probability to the true token residing at i+1. * To retain a clear view of the input sequence, the model now needs to "keep around" the true token at i+1, since its own guess is a poor proxy. * But early layers don't know that: they can't "look up" and notice the poor prediction. So they just treat i+1 like any other position. (I.e. there's no way to implement a selective "copy when we got it wrong" mechanism) * In late layers, position i+1 has been converted into a guess about i+2 by the earlier layers, so we can't rely on it to tell us what really occupied i+1. * And position i has been converted to a bad guess about position i+1, so if we use it as a proxy for i+1 we'll do poorly. My sampling idea was something like "let's replace (or interpolate) late activations with embeddings of the actual next token, so the model can see what really happened, even when its probability was low." (This is for sampling specifically because it'd be too slow in training, where you want to process a whole window at once with matrix operations; sampling has to be a loop anyway, so there's no cost to adding stuff that only works as a loop.) But, thinking about it more, the model clearly can perform well in scenarios like the above, e.g. my plasma example and also many other cases naturally arising in language which GPT handles well. I have no idea how it does it -- indeed the connection structure feels weird 0oceaninthemiddleofanisland10moHow far away is this from being implementable? Forecasting Thread: AI Timelines I was looking at the NIPS growth numbers last June and I made a joke: AI researcher anthropics: 'researchers [should] tend to think AI is ~20 years away because given exponential growth of researchers & careers of ~30 years, the final generation of researchers will make up a majority of all researchers, hence, by SSA+Outside View, one must assume 20 years.' (Of course, I'm making a rather carbon-chauvinistic assumption here that it's only human researchers/researcher-years which matter.) SDM's Shortform I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took? Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup: Defenders o ... (read more) 2Sammy Martin10moI don't think that excuse works in this case - I didn't give it a 'long-winded frame', just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the 'cranberry/grape juice kills you' outcome never arose. So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I'll point out didn't really attempt any sophisticated prompt programming either - that was literally the first sentence I thought of! Matt Botvinick on the spontaneous emergence of learning algorithms The argument that these and other meta-RL researchers usually make is that (as indicated by the various neurons which fluctuate, and I think based on some other parts of their experiments which I would have to reread it to list) what these RNNs are learning is not just a simple play-the-winner heuristic (which is suboptimal, and your suggestion would require only 1 neuron to track the winning arm) but amortized Bayesian inference where the internal dynamics are learning the sufficient statistics of the Bayes-optimal solution to the POMDP (where you're unsu... (read more) Mesa-Search vs Mesa-Control And the Transformer can recompute whatever function the RNN is computing over its history, no, as I said? Whatever a RNN can do with its potentially limited access to history, a Transformer can recompute with its full access to history as if it were the unrolled RNN. It can recompute that for every bit, generate the next one, and then recompute on the next step with that as the newest part of its history being conditioned on. 1Vanessa Kosoy10moNo, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many "Monte Carlo" iterations until it produces the right history. Mesa-Search vs Mesa-Control Er, maybe your notation is obscuring this for me, but how does that follow? Where is the RNN getting this special randomness from? Why aren't the internal activations of a many-layer Transformer perfectly adequate to first encode, 'storing z', and then transform? 1Vanessa Kosoy10moI'm assuming that either architecture can use a source of random bits. The transformer produces one bit at a time, computing every bit from the history so far. It doesn't have any state except for the history. At some stage of the game the history consists of y only. At this stage the transformer would have to compute z from y in order to win. It doesn't have any activations to go on besides those that can be produced from y. Mesa-Search vs Mesa-Control I'm a little confused as to why there's any question here. Every algorithm lies on a spectrum of tradeoffs from general to narrow. The narrower a class of solved problems, the more efficient (in any way you care to name) an algorithm can be: a Tic-Tac-Toe solver is going to be a lot more efficient than AIXI. Meta-learning works because the inner algorithm can be far more specialized, and thus, more performant or sample-efficient than the highly general outer algorithm which learned the inner algorithm. For example, in Dactyl, PPO trains a RNN to adapt to man... (read more) Here on LW / AF, "mesa optimization" seems to only apply if there's some sort of "general" learning algorithm, especially one that is "using search", for reasons that have always been unclear to me. Some relevant posts taking the opposite perspective (which I endorse): Is the term mesa optimizer too narrow? Why is pseudo-alignment "worse" than other ways ML can fail to generalize? Mesa-Search vs Mesa-Control But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the “learned information” from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it? I think that's exactly it. There's no real difference between a history, and a recurrence. A recurrence is a (lossy) function of a history, so anything a recurrent hidden state can encode, a sufficiently large/deep feedforward model given access to the full history should be able to internally represent as we... (read more) 1David Krueger9moPractically speaking, I think the big difference is that the history is outside of GPT-3's control, but a recurrent memory would be inside its control. There's no real difference between a history, and a recurrence. That's true for unbounded agents but false for realistic (bounded) agents. Considering the following two-player zero-sum game: Player A secretly writes some , then player B says some and finally player B says some . Player A gets reward unless where is a fixed one-way function. If , player A gets a reward in which is the fraction of bits and have in common. The optimal strategy for player A is producing a random sequence. The op... (read more) Matt Botvinick on the spontaneous emergence of learning algorithms Learning still happening after weights are frozen? That’s crazy. I think it’s a big deal because it is evidence for mesa-optimization being likely and hard to avoid. Sure. We see that elsewhere too, like Dactyl. And of course, GPT-3. 3Daniel Kokotajlo10moHuh, thanks. Measuring hardware overhang Also, older 32-bit CPUs are capped at 4 GB of RAM, making execution of larger models impossible. Slower, not impossible. I don't think any of the chess or Go models have model sizes >1GB, and even if they did, you don't have to load the entire model into RAM, they're just feedforward CNNs, you only need to be able to fit one layer at a time. With appropriate tricks you could probably even slowly train the models, like https://arxiv.org/abs/2002.05645v5 1hippke10moRight. My experiment used 1 GB for Stockfish, which would also work on a 486 machine (although at the time, it was almost unheard of [https://www.vogons.org/viewtopic.php?t=56970]...) Are we in an AI overhang? But it can. That's the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can 'train' GPT-3 without even any gradient steps - just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost. With NNs, 'foom' is not merely possible, it's the default. If you train a model, then as soon as it's done you get, among other things: • the ability to run thousands of copies in parallel on the same hardware • in a context like A ... (read more) Does the lottery ticket hypothesis suggest the scaling hypothesis? I wouldn't say the scaling hypothesis is purely about Transformers. Quite a few of my examples are RNNs, and it's unclear how much of a difference there is between RNNs and Transformers anyway. Transformers just appear to be a sweet spot in terms of power while being still efficiently optimizable on contemporary GPUs. CNNs for classification definitely get better with scale and do things like disentangle & transfer & become more robust as they get bigger (example from today), but whether they start exhibiting any meta-learning specifically I don't know. Are we in an AI overhang? Look at, for example, Moravec. His extrapolation assumes that supercomputer will not be made available for AI work until AI work has already been proven successful (correct) and that AI will have to wait for hardware to become so powerful that even a grad student can afford it with1k (also correct, see AlexNet), and extrapolating from ~1998, estimates:

At the present rate, computers suitable for humanlike robots will appear in the 2020s.

Guess what year today is.

Are we in an AI overhang?

GPT-3 based text embedding should be extremely useful for creating summaries of arbitrary text (such as, web pages or ad text) which can be fed into the existing Google search/ad infrastructure. (The API already has a less-known half, where you upload sets of docs and GPT-3 searches them.) Of course, they already surely use NNs for embeddings, but at Google scale, enhanced embeddings ought to be worth billions.

Are we in an AI overhang?

Text embeddings for knowledge graphs and ads is the most immediately obvious big bucks application.

2Daniel Kokotajlo1yCan you explain more?
Are we in an AI overhang?

What makes you think there will be small businesses at that point, or that anyone would care what these hypothetical small businesses may or may not be doing?

Are we in an AI overhang?

As an aside, though it's not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.

The GPT architecture isn't even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT's context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model... (read more)

Are we in an AI overhang?

As noted, the electricity cost of running GPT-3 is quite low, and even with the capital cost of GPUs being amortized in, GPT-3 likely doesn't cost dollars to run per hundred pages, so scaled up ones aren't going to cost millions to run either. (But how much would you be willing to pay for the right set of 100 pages from a legal or a novel-writing AI? "Information wants to be expensive, because the right information can change your life...") GPT-3 cost millions of dollars to train, but pennies to run.

That's the terrifying thing about NNs and what I dub the ... (read more)

4Christian Kleineidam1yI'm not sure why that's terrifying. It seems reassuring to me because it means that there's no way for the NN to suddenly go FOOM because it can't just quickly retrain.
Can you get AGI from a Transformer?

MuZero does not do MCTS and still outperforms.

It does do (a variant of) MCTS. Check it out for yourself. The paper is here:

https://arxiv.org/pdf/1911.08265.pdf

Appendix B, page 12:

"We now describe the search algorithm used by MuZero. Our approach is based upon Monte-Carlo tree search with upper confidence bounds, an approach to planning that converges asymptotically to the optimal policy in single agent domains and to the minimax value function in zero sum games [22]."

Can you get AGI from a Transformer?

I'm going to stop at your very first claim and observe: MuZero.

2LGS1yYou are aware that MuZero has tree search hardcoded into it, yes? How does that contradict claim 1?
TurnTrout's shortform feed

DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for... (read more)