Brain-inspired AGI and the "lifetime anchor"

[-]gwern4y50

The parallelization discussion seems offbase to me. While it is of course important that any individual instance runs not too absurdly slowly, how much faster than realtime it runs isn't that important, because you would be running many of them in parallel, no? AlphaZero trained in a few wallclock hours not by blazing through games in mere nanoseconds, but by having hundreds or thousands of actors in parallel playing through games at a reasonable speed like 0.05s per turn or something. Or OA5 used minibatches of millions of experiences, and GPT-3 had minibatches of like millions of tokens, IIRC.

If we look at the gradient noise scale, the more complicated the 'task' (ie set of tasks), the larger the batch size you need/can use before you are just wasting compute by overly-precisely estimating the gradient for the next update. Presumably any AGI would be training on a lot of tasks as complicated as Go or English text or DoTA2 or more complicated: generative and discriminatory multimodal training on text, video, and photos, DRL training on a bazillion games and procedurally-generated tasks, and so on, and so the optimal minibatch size would be quite large... Unless the hardware overhang is vastly more extreme than anyone anticipates (in which case the debate would be moot for other reasons), it seems like the most plausible answer for "how much parallel hardware can my seed AGI use?" is going to be "how much ya got?".

This doesn't guarantee a fast wallclock, of course, but it's worth noting that in the limit of (full-batch, not stochastic minibatching) gradient descent, you can generally take large steps and converge in relatively few serial iterations compared to SGD. (Bunch of papers on scaling up CNNs to training on thousands of GPUs simultaneously to converge in minutes to seconds rather than days or weeks on smaller but more efficient clusters; yesterday I saw Geiping et al 2021 whose CNN requires 3,000 serial fullbatch iterations vs SGD's 117,000 serial minibatch iterations, so hypothetically, you could finish in 39x less wallclock if you had ~unlimited compute.)

So even for an incredibly complicated family of tasks, as long as the individual instances can be run at all, the wallclock is potentially quite low because you have model parallelism out the wazoo within and across all of the tasks & modalities & problems, and only need to take relatively few serial updates.

[-]Steven Byrnes4y40

Thanks, that's really helpful. I'm going to re-frame what you're saying in the form of a question:

The parallel-experiences question:

Take a model which is akin to an 8-year-old's brain. (Assume we deeply understand how the learning algorithm works, but not how the trained model works.) Now we make 10 identical copies of that model. For the next hour, we tell one copy to read a book about trucks, and we tell another copy to watch a TV show about baking, and we tell a third copy to build a sandcastle in a VR environment, etc. etc., all in parallel.

At the end of the hour, is it possible to take ALL the things that ALL ten copies learned, and combine them into one model—one model that now has new memories/skills pertaining to trucks AND baking AND sandcastles etc.—and it's no worse than if the model had done those 10 things in series?

What's the answer to this question?

Here are three possibilities:

How an ML practitioner would probably answer this question: I think they would say "Yeah duh, we've been doing that in ML since forever." For my part, I do see this as some evidence, but I don't see it as definitive evidence, because the premise of this post (see Section 1) is that the learning algorithms used by ML practitioners today are substantially different from the within-lifetime learning algorithm used in the brain.
How a biologist would probably answer this question: I think they would say the exact opposite: "No way!! That's not something brains evolved to do, there's no reason to expect it to be possible and every reason to think it isn't. You're just talking sci-fi nonsense."
- (Well, they would acknowledge that humans working on a group project could go off and study different topics, and then talk to each other and hence teach each other what they've learned. But that's kind of a different thing than what we're talking about here. In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don't see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.)
How I would answer this question: Well I hadn't thought about it until now, but I think I'm in between. On the one hand, I do think there are some things that need to be learned serially in the human brain learning algorithm. For example, there's a good reason that people learn multiplication before exponentiation, and exponentiation before nonabelian cohomology, etc. But if the domains are sufficiently different, and if we merge-and-re-split frequently enough, then I'm cautiously optimistic that we could do parallel experiences to some extent, in order to squeeze 30 subjective years of experience into <30 serial subjective years of experience. How much less than 30, I don't know.

Anyway, in the article I used the biologist answer: "the human brain within-lifetime learning algorithm is not compatible with parallel experiences". So that would be the most conservative / worst-case assumption.

I am editing the article to note that this is another reason to suspect that training might be faster than the worst-case. Thanks again for pointing that out.

[-]gwern4y50

The biologist answer there seems to be question-begging. What reason is there to think it isn't? Animals can't split and merge themselves or afford the costs or store datasets for exact replay etc, so they would be unable to do that whether or not it was possible, and so they provide zero evidence about whether their internal algorithms would be able to do it. You might argue that there might be multiple 'families' of algorithms all delivering animal-level intelligence, some of which are parallelizable and some not, and for lack of any incentive animals happened to evolve a non-parallelizable one, but this is pure speculation and can't establish that the non-parallelizable one is superior to the others (much less is the only such family).

From the ML or statistics view, it seems hard for parallelization in learning to not be useful. It's a pretty broad principle that more data is better than less data. Your neurons are always estimating local gradients with whatever local learning rule they have, and these gradients are (extremely) noisy, and can be improved by more datapoints or rollouts to better estimate the update that jointly optimizes all of the tasks; almost by definition, this seems superior to getting less data one point at a time and doing noisy updates neglecting most of the tasks.

If I am a DRL agent and I have n hypotheses about the current environment, why am I harmed by exploring all n in parallel with n copied agents, observing the updates, and updating my central actor with them all? Even if they don't produce direct gradients (let's handwave an architecture where somehow it'd be bad to feed them all in directly, maybe it's very fragile to off-policyness), they are still producing observations I can use to update my environment model for planning, and I can go through them and do learning before I take any more actions. (If you were in front of a death maze and were watching fellow humans run through it and get hit by the swinging blades or acid mists or ironically-named boulders, you'd surely appreciate being able to watch as many runs as possible by your fellow humans rather than yourself running it.)

In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don't see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.

If we look at some of these algorithms, it's even less compelling to argue that there's some deep intrinsic reason we want to lock learning to small serial steps - look at expert iteration in AlphaZero, where the improved estimates that the NN is repeatedly retrained on don't even come from the NN itself, but an 'expert' (eg a NN + tree search); what would we gain by ignoring the expert's provably superior board position evaluations (which would beat the NN if they played) and forcing serial learning? At least, given that MuZero/AlphaZero are so good, this serial biological learning process, whatsoever it may be, has failed to produce superior results to parallelized learning, raising questions about what circumstances exactly yield these serial-mandatory benefits...

[-]Steven Byrnes4y20

The biologist answer there seems to be question-begging

Yeah, I didn't bother trying to steelman the imaginary biologist. I don't agree with them anyway, and neither would you.

(I guess I was imagining the biologist belonging to the school of thought (which again I strongly disagree with) that says that intelligence doesn't work by a few legible algorithmic principles, but is rather a complex intricate Rube Goldberg machine, full of interrelated state variables and so on. So we can't just barge in and make some major change in how the step-by-step operations work, without everything crashing down. Again, I don't agree, but I think something like that is a common belief in neuroscience/CogSci/etc.)

it seems hard for parallelization in learning to not be useful … why am I harmed …

I agree with "useful" and "not harmful". But an interesting question is: Is it SO helpful that parallelization can cut the serial (subjective) time from 30 years to 15 years? Or what about 5 years? 2 years? I don't know! Again, I think at least some brain-like learning has to be serial (e.g. you need to learn about multiplication before nonabelian cohomology), but I don't have a good sense for just how much.

[-]Rohin Shah4y20

ASSUMPTION 1: There’s a “secret sauce” of human intelligence, and it looks like a learning algorithm (and associated inference algorithm).
ASSUMPTION 2: It’s a fundamentally different learning algorithm from deep neural networks. I don’t just mean a different neural network architecture, regularizer, etc. I mean really different, like “involving probabilistic program inference algorithms” or whatever.
ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.
ASSUMPTION 4: We'll eventually figure out this “secret sauce” and get Transformative AI (TAI).

These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true?

I can imagine justifying assumption 2, and maybe also assumption 1, using biology knowledge that I don't have. I don't see how you justify assumptions 3 and 4. Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.

[-]Steven Byrnes4y20

Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.

Yup, "time until AGI via one particular path" is always an upper bound to "time until AGI". I added a note, thanks.

These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true?

The only thing I'm arguing in this particular post is "IF assumptions THEN conclusion". This post is not making any argument whatsoever that you should put a high credence on the assumptions being true. :-)

[-]Jsevillamol4y10

ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.

Can you clarify what you mean by this assumption? And how is your argument dependent on it?

Is the point that the "secret sauce" algorithm is something that humans can plausibly come up with by thinking hrd about it? As opposed maybe to a evolution-designed nightmare that humans cannot plausibly design except by brute forcing it?

[-]Steven Byrnes4y30

Yes, what you said. The opposite of "a human-legible learning algorithm" is "a nightmarishly-complicated Rube-Goldberg-machine learning algorithm".

If the latter is what we need, we could still presumably get AGI, but it would involve some automated search through a big space of many possible nightmarishly-complicated Rube-Goldberg-machine learning algorithms to find one that works.

That would be a different AGI development story, and thus a different blog post. Instead of "humans figure out the learning algorithm" as an exogenous input to the path-to-AGI, which is how I treated it, it would instead be an output of that automated search process. And there would be much more weight on the possibility that the resulting learning algorithm would be wildly different than the human brain's, and hence more uncertainty in its computational requirements.

^{^}

Warning: FLOP is only one of several inputs to an algorithms. Another input worth keeping in mind is memory. In particular, the human neocortex has ≈ $10^{14}$ synapses. How this number translates into (for example) GB of GPU memory is complicated, and I have some uncertainty, but I think my Section 6.2 scenario (involving an AlphaStar-like 400 chips) does seem to be in the right general ballpark for not only FLOP but also memory storage.

^{^}

I assumed the axons and dendrites are locally isotropic (equally likely to go any direction); that gives a factor of 2 from averaging cos θ over a hemisphere.

^{^}

I asked Nick Turner and he kindly downloaded three random little volumes from this dataset and counted how many things crossed the z=0 plane, as a very rough estimate. By the way, it was mostly axons not dendrites, like 10:1 ratio or something, in case anyone’s interested.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

32

Brain-inspired AGI and the "lifetime anchor"

32

1. Assumptions for this post

2. Thesis and outline

3. Background: The “Lifetime Anchor” in Ajeya Cotra's draft report

4. Why Ajeya puts very little weight on the Lifetime Anchor, and why I disagree

5. Why Ajeya thinks the computer-vs-brain inefficiency factor should be >>1, and why I disagree

5.1 …And indeed why the computer-vs-brain inefficiency factor should be <<1!

6. Some other timeline-relevant considerations

6.1 How long does it take to get from janky grad-student code to polished, scalable, parallelized, hardware-accelerated, turn-key learning algorithms?

6.2 How long (wall-clock time) does it take to train one of these models?

6.3 How many full-length training runs do we need?

7. Conclusion