gwern

Comments

Are we in an AI overhang?

But it can. That's the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can 'train' GPT-3 without even any gradient steps - just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost.

With NNs, 'foom' is not merely possible, it's the default. If you train a model, then as soon as it's done you get, among other things:

  • the ability to run thousands of copies in parallel on the same hardware

    • in a context like AlphaGo, I estimate several hundred ELO strength gains if you reuse the same hardware to merely run tree search with exact copies of the original model
  • meta-learning / transfer-learning to any related domain, cutting training requirements by orders of magnitude

  • model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)

  • reuse of the model elsewhere to instantly power up other models (eg use of text or image embeddings for a DRL agent)

  • learning-by-doing/learning curve effects (highest in information technologies), so the next from-scratch model may be much cheaper (eg OA5 got a, what was it, 5x cost reduction for the second model OA trained from scratch based on the lessons of the first?)

    • baseline for engineering much more efficient ones by ablating and comparing with the original
Does the lottery ticket hypothesis suggest the scaling hypothesis?

I wouldn't say the scaling hypothesis is purely about Transformers. Quite a few of my examples are RNNs, and it's unclear how much of a difference there is between RNNs and Transformers anyway. Transformers just appear to be a sweet spot in terms of power while being still efficiently optimizable on contemporary GPUs. CNNs for classification definitely get better with scale and do things like disentangle & transfer & become more robust as they get bigger (example from today), but whether they start exhibiting any meta-learning specifically I don't know.

Are we in an AI overhang?

Look at, for example, Moravec. His extrapolation assumes that supercomputer will not be made available for AI work until AI work has already been proven successful (correct) and that AI will have to wait for hardware to become so powerful that even a grad student can afford it with $1k (also correct, see AlexNet), and extrapolating from ~1998, estimates:

At the present rate, computers suitable for humanlike robots will appear in the 2020s.

Guess what year today is.

Are we in an AI overhang?

GPT-3 based text embedding should be extremely useful for creating summaries of arbitrary text (such as, web pages or ad text) which can be fed into the existing Google search/ad infrastructure. (The API already has a less-known half, where you upload sets of docs and GPT-3 searches them.) Of course, they already surely use NNs for embeddings, but at Google scale, enhanced embeddings ought to be worth billions.

Are we in an AI overhang?

Text embeddings for knowledge graphs and ads is the most immediately obvious big bucks application.

Are we in an AI overhang?

What makes you think there will be small businesses at that point, or that anyone would care what these hypothetical small businesses may or may not be doing?

Are we in an AI overhang?
Promoted by Raemon

As far as I can tell, this is what is going on: they do not have any such thing, because GB and DM do not believe in the scaling hypothesis the way that Sutskever, Amodei and others at OA do.

GB is entirely too practical and short-term focused to dabble in such esoteric & expensive speculation, although Quoc's group occasionally surprises you. They'll dabble in something like GShard, but mostly because they expect to be likely to be able to deploy it or something like it to production in Google Translate.

DM (particularly Hassabis, I'm not sure about Legg's current views) believes that AGI will require effectively replicating the human brain module by module, and that while these modules will be extremely large and expensive by contemporary standards, they still need to be invented and finetuned piece by piece, with little risk or surprise until the final assembly. That is how you get DM contraptions like Agent57 which are throwing the kitchen sink at the wall to see what sticks, and why they place such emphasis on neuroscience as inspiration and cross-fertilization for reverse-engineering the brain. When someone seems to have come up with a scalable architecture for a problem, like AlphaZero or AlphaStar, they are willing to pour on the gas to make it scale, but otherwise, incremental refinement on ALE and then DMLab is the game plan. They have been biting off and chewing pieces of the brain for a decade, and it'll probably take another decade or two of steady chewing if all goes well. Because they have locked up so much talent and have so much proprietary code and believe all of that is a major moat to any competitor trying to replicate the complicated brain, they are fairly easygoing. You will not see DM 'bet the company' on any moonshot; Google's cashflow isn't going anywhere, and slow and steady wins the race.

OA, lacking anything like DM's long-term funding from Google or its enormous headcount, is making a startup-like bet that they know an important truth which is a secret: "the scaling hypothesis is true" and so simple DRL algorithms like PPO on top of large simple architectures like RNNs or Transformers can emerge and meta-learn their way to powerful capabilities, enabling further funding for still more compute & scaling, in a virtuous cycle. And if OA is wrong to trust in the God of Straight Lines On Graphs, well, they never could compete with DM directly using DM's favored approach, and were always going to be an also-ran footnote.

While all of this hypothetically can be replicated relatively easily (never underestimate the amount of tweaking and special sauce it takes) by competitors if they wished (the necessary amounts of compute budgets are still trivial in terms of Big Science or other investments like AlphaGo or AlphaStar or Waymo, after all), said competitors lack the very most important thing, which no amount of money or GPUs can ever cure: the courage of their convictions. They are too hidebound and deeply philosophically wrong to ever admit fault and try to overtake OA until it's too late. This might seem absurd, but look at the repeated criticism of OA every time they release a new example of the scaling hypothesis, from GPT-1 to Dactyl to OA5 to GPT-2 to iGPT to GPT-3... (When faced with the choice between having to admit all their fancy hard work is a dead-end, swallow the bitter lesson, and start budgeting tens of millions of compute, or instead writing a tweet explaining how, "actually, GPT-3 shows that scaling is a dead end and it's just imitation intelligence" - most people will get busy on the tweet!)

What I'll be watching for is whether orgs beyond 'the usual suspects' (MS ZeRO, Nvidia, Salesfore, Allen, DM/GB, Connor/LibreAI, FAIR) start participating or if they continue to dismiss scaling.

Are we in an AI overhang?

As an aside, though it's not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.

The GPT architecture isn't even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT's context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model) at the few-shot meta-learning task he tried.

Expanding to 2048 BPEs probably buys GPT-3 more headroom (more useful data to learn from, and more for the meta-learning to condition on), and expanding to efficient attentions/recurrency/memory will enable even better prediction performance, with unknown meta-learning or generalization consequences.

(The problem there is the tradeoff between compute efficiency of training and better architectures. It's not obvious where you want to go: GShard, for example, takes the POV that even GPT is too fancy and slow and inefficient to train on existing hardware, and goes with the even more drastically parameter-inefficient - but efficient to train on GPUs! - mixture-of-expert small Transformers approach.)

Are we in an AI overhang?

As noted, the electricity cost of running GPT-3 is quite low, and even with the capital cost of GPUs being amortized in, GPT-3 likely doesn't cost dollars to run per hundred pages, so scaled up ones aren't going to cost millions to run either. (But how much would you be willing to pay for the right set of 100 pages from a legal or a novel-writing AI? "Information wants to be expensive, because the right information can change your life...") GPT-3 cost millions of dollars to train, but pennies to run.

That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN. (This is not true of many paradigms, particularly ones where there's less of a distinction between training and running, but it is of NNs.) This is part of why there's a hardware overhang - once you have the hardware to create an AGI NN, you then by definition already have the hardware to run orders of magnitude more copies or more cheaply or bootstrap it into a more powerful agent.

Can you get AGI from a Transformer?

I'm going to stop at your very first claim and observe: MuZero.

Load More