[ Question ]

Probability that other architectures will scale as well as Transformers?

by Daniel Kokotajlo1 min read28th Jul 20203 comments

11

GPTAI TakeoffAI TimelinesAI
Frontpage

GPT-1, 2, and 3 have shown impressive scaling properties. How likely is it that, in the next five years, many other architectures will also be shown to get substantially better as they get bigger? EDIT I am open to discussion of better definitions of the scaling hypothesis. For example, maybe Gwern means something different here in which case I'm also interested in that.

New Answer
Ask Related Question
New Comment

1 Answers

For some reason here on LW there's a huge focus on "architecture". I don't get it. Here's how I at-this-moment think of the scaling hypothesis:

Weak scaling hypothesis: For a task that has not yet been solved, if you increase data and model capacity, and tune the learning algorithm to make use of it (like, hyperparameter tuning and such, not a fundamentally new algorithm), then performance will improve.

This seems fairly uncontroversial, I think? This probably breaks down in some edge cases (e.g. if you have 1-layer neural net that you keep making wider and wider) but seems broadly correct to me. It's mostly independent of the architecture (as long as it is possible to increase model capacity). Note also the common wisdom in ML that it's far more important what your data is than what your model / learning algorithm are.

What the architecture can influence is where your performance starts out at, and the rate at which it scales, which matters for:

Strong scaling hypothesis: (Depends on weak scaling hypothesis) There is a sufficiently difficult task T and an architecture A that we know of for that task, such that 1. "solving" T would lead to AGI, 2. it is conceptually easy to scale up the model capacity for A, 3. it is easy to get more data for T, and 4. scaling up a) model capacity and b) data will lead to "solving" T on some not-crazy timescale and resource-scale.

According to me, it is hard to find T that satisfies 1, 3 and 4b, it is trivial to satisfy 2, and hard to find an architecture that satisfies 4a. OpenAI's big contribution here is believing and demonstrating that T="predict language" might satisfy 1, 3 and 4b. I know of no other such T (though multiagent environments are a candidate).

What about 4a? According to me, it just so happens that Transformers are the best architecture for T="predict language", and so that's what we saw get scaled up, but I'd expect you'd see the same pattern of scaling (but not the same absolute performance) from other architectures as well. (For example, I suspect RNNs would also satisfy 4a.) I think the far more interesting question is whether we'll see other tasks T that could plausibly satisfy 1, 3, and 4b.

Thanks! It sounds like you are saying the task is more important than the architecture, so we should talk less about architectures and more about tasks.

That seems plausible to me, with the caveat that I think it's still worth talking about architecture sometimes. For example, when thinking about the safety or generalization properties of a system the architecture might be more important, no?

If I could go back in time, I'd change the question to be about "Architecture+training setups" instead of just "architectures."

2Rohin Shah7moYes, that's right. I'd be pretty surprised if this were the case after conditioning on the raw capabilities of the architecture, though I can't rule it out.

1 Related Questions

1Answer by G Gordon Worley III7moMost systems eventually face scaling bottlenecks. In fact, unless your system is completely free of coordination, it definitely has bottlenecks even if you haven't scaled large enough to hit them. And since Transformers definitely require some coordination since no matter how large the models are and how much parallelism their hardware supports they still produce a single reduced output, we should expect that there are some scaling limits on Transformers that at some size will prevent them for effectively taking advantage of having a larger network. Further, you point at this a bit, but most systems also experiencing diminishing returns on performance for additional resources because of these constraints. Transformers may just be special in that they have yet to start hitting diminishing returns because we haven't yet run up against their coordination bottlenecks, although that doesn't make them too special since we should expect them to still have them lying in wait somewhere, just like they do in every other system that is not coordination free.