Meta learning to gradient hack

Is the meta-learned net able to learn any other function at all and is not just frozen, or is the meta-learned stability tailored to protecting against specific tasks like sin(x)?

Brain-inspired AGI and the "lifetime anchor"

The biologist answer there seems to be question-begging. What reason is there to think it isn't? Animals can't split and merge themselves or afford the costs or store datasets for exact replay etc, so they would be unable to do that whether or not it was possible, and so they provide zero evidence about whether their internal algorithms would be able to do it. You might argue that there might be multiple 'families' of algorithms all delivering animal-level intelligence, some of which are parallelizable and some not, and for lack of any incentive animals happened to evolve a non-parallelizable one, but this is pure speculation and can't establish that the non-parallelizable one is superior to the others (much less is the only such family).

From the ML or statistics view, it seems hard for parallelization in learning to not be useful. It's a pretty broad principle that more data is better than less data. Your neurons are always estimating local gradients with whatever local learning rule they have, and these gradients are (extremely) noisy, and can be improved by more datapoints or rollouts to better estimate the update that jointly optimizes all of the tasks; almost by definition, this seems superior to getting less data one point at a time and doing noisy updates neglecting most of the tasks.

If I am a DRL agent and I have n hypotheses about the current environment, why am I harmed by exploring all n in parallel with n copied agents, observing the updates, and updating my central actor with them all? Even if they don't produce direct gradients (let's handwave an architecture where somehow it'd be bad to feed them all in directly, maybe it's very fragile to off-policyness), they are still producing observations I can use to update my environment model for planning, and I can go through them and do learning before I take any more actions. (If you were in front of a death maze and were watching fellow humans run through it and get hit by the swinging blades or acid mists or ironically-named boulders, you'd surely appreciate being able to watch as many runs as possible by your fellow humans rather than yourself running it.)

In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don't see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.

If we look at some of these algorithms, it's even less compelling to argue that there's some deep intrinsic reason we want to lock learning to small serial steps - look at expert iteration in AlphaZero, where the improved estimates that the NN is repeatedly retrained on don't even come from the NN itself, but an 'expert' (eg a NN + tree search); what would we gain by ignoring the expert's provably superior board position evaluations (which would beat the NN if they played) and forcing serial learning? At least, given that MuZero/AlphaZero are so good, this serial biological learning process, whatsoever it may be, has failed to produce superior results to parallelized learning, raising questions about what circumstances exactly yield these serial-mandatory benefits...

Brain-inspired AGI and the "lifetime anchor"

The parallelization discussion seems offbase to me. While it is of course important that any individual instance runs not too absurdly slowly, how much faster than realtime it runs isn't that important, because you would be running many of them in parallel, no? AlphaZero trained in a few wallclock hours not by blazing through games in mere nanoseconds, but by having hundreds or thousands of actors in parallel playing through games at a reasonable speed like 0.05s per turn or something. Or OA5 used minibatches of millions of experiences, and GPT-3 had minibatches of like millions of tokens, IIRC.

If we look at the gradient noise scale, the more complicated the 'task' (ie set of tasks), the larger the batch size you need/can use before you are just wasting compute by overly-precisely estimating the gradient for the next update. Presumably any AGI would be training on a lot of tasks as complicated as Go or English text or DoTA2 or more complicated: generative and discriminatory multimodal training on text, video, and photos, DRL training on a bazillion games and procedurally-generated tasks, and so on, and so the optimal minibatch size would be quite large... Unless the hardware overhang is vastly more extreme than anyone anticipates (in which case the debate would be moot for other reasons), it seems like the most plausible answer for "how much parallel hardware can my seed AGI use?" is going to be "how much ya got?".

This doesn't guarantee a fast wallclock, of course, but it's worth noting that in the limit of (full-batch, not stochastic minibatching) gradient descent, you can generally take large steps and converge in relatively few serial iterations compared to SGD. (Bunch of papers on scaling up CNNs to training on thousands of GPUs simultaneously to converge in minutes to seconds rather than days or weeks on smaller but more efficient clusters; yesterday I saw Geiping et al 2021 whose CNN requires 3,000 serial fullbatch iterations vs SGD's 117,000 serial minibatch iterations, so hypothetically, you could finish in 39x less wallclock if you had ~unlimited compute.)

So even for an incredibly complicated family of tasks, as long as the individual instances can be run at all, the wallclock is potentially quite low because you have model parallelism out the wazoo within and across all of the tasks & modalities & problems, and only need to take relatively few serial updates.

Can you get AGI from a Transformer?

One paper I forgot that bears especially on the question of why you use planning: "On the role of planning in model-based deep reinforcement learning", Hamrick et al 2020:

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. We show that deep, precise planning is often unnecessary to achieve high reward in many domains, with 2-step planning exhibiting surprisingly strong performance even in Go. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.

Pathways: Google's AGI

It might be more useful to discuss Google's dense GPT-like LaMDA-137b instead, because there's so little information about Pathways or MUM. (We also know relatively little about the Wu Dao series of competing multimodal sparse models.) Google papers refuse to name it when they use LaMDA, for unclear reasons (it's not like they're fooling anyone), but they've been doing interesting OA-like research with it: eg "Program Synthesis with Large Language Models", "Finetuned Language Models Are Zero-Shot Learners", or text style transfer.

Redwood Research’s current project

Controlling the violence latent would let you systematically sample for it: you could hold the violence latent constant, and generate an evenly spaced grid of points around it to get a wide diversity of violent but stylistically/semantically unique. Kinds of text which would be exponentially hard to find by brute force sampling can be found this way easily. It also lets you do various kinds of guided search or diversity sampling, and do data augmentation (encode known-violent samples into their latent, hold the violent latent constant, generate a bunch of samples 'near' it). Even if the violence latent is pretty low quality, it's still probably a lot better as an initialization for sampling than trying to brute force random samples and running into very rapidly diminishing returns as you try to dig your way into the tails.

And if you can't do any of that because there is no equivalent of a violent latent or its equivalent is clearly too narrow & incomplete, that is pretty important, I would think. Violence is such a salient category, so frequent in fiction and nonfiction (news), that a generative model which has not learned it as a concept is, IMO, probably too stupid to be all that useful as a 'model organism' of alignment. (I would not expect a classifier based on a failed generative model to be all that useful either.) If a model cannot or does not understand what 'violence' is, how can you hope to get a model which knows not to generate violence, can recognize violence, can ask for labels on violence, or do anything useful about violence?

Redwood Research’s current project

Similarly, you might think that a promising approach is to look for snippets which cause the generator to generate violent completions with particularly high probability, reasoning that if the classifier says that the first 99 completions were bad but that the 100th was good, there’s perhaps an unusually high chance that it’s wrong about that 100th completion. And again, you can take this into account at eval time, by increasing the conservatism of your classifier based on how many completions it has rejected already...Try cleverer approaches to look for model mistakes, TBD. We’ve done a couple of things here but nothing has panned out super well yet.

Have you tried any of the guided generation approaches like GeDI to make the model generate only violent completions and then calling in the human oracles on all of those guided completions which the classifier misses? Or looking for a 'violence' latent?

[Book Review] "The Alignment Problem" by Brian Christian

I'm curious what animal I would get classified as if people who look like me were removed from Google Photos training dataset. (I hope it's a meerkat.)

If anyone was wondering, no journalists bothered reporting this, but that system classified white people as 'dogs' and 'seals'.

Can you get AGI from a Transformer?

Yeah, I didn't want to just nitpick over "is this tree search a MCTS or not", which is why I added in #2-4, which address the steelman - even if you think MuZero is using MCTS, I think that doesn't matter because one doesn't need any tree search at all, so a fortiori that question doesn't matter.

(I also think the MuZero paper is generally confusing and poorly-written, and that's where a lot of confusion is coming from. I am not the only person to read it through several times and come away confused about multiple things, and people trying to independently reimplement MuZero tell me that it seems to leave out a lot of details. There's been multiple interesting followup papers, so perhaps reading them all together would clarify things.)

Yes, so on your spectrum of #1-6, I would put myself at closer to 3 than 2. I would say that while we have the global compute capacity now to scale up what are the moral equivalents of contemporary models to what the scaling laws would predict is human-equivalence (assuming, as seems likely but far from certain, that they more or less hold - we haven't seen any scaling law truly break yet), at the hundreds of trillions to quadrillion parameter regime of Transformers or MLPs, this is only about the compute for a single training run. The hardware exists and the world is wealthy enough to afford it if it wanted to (although it doesn't).

But we actually need the compute for the equivalent of many runs. The reason hardware progress drives algorithmic software progress is because we are absolutely terrible at designing NNs, and are little more than monkeys banging at giant black boxes with trial-and-error, confabulating or retrospectively cherrypicking theories to explain the observed results. Thus we need enough compute to blow on enough runs that a grad student can go 'what if I added a shortcut connection? Oh' or 'these MLP things never work beyond 3 or 4 layers, everyone knows that... but what if I added any kind of normalization, the way we normalize every other kind of NN? Oh' and figure out the right detail which makes it Just Work.

So, we will need a lot of algorithmic efficiency beyond the bare minimum of '1 training run, once', to afford all the slightly-broken training runs.

(Unless we get 'lucky' and the prototyping small runs are so accurate and the code so solid that you can prototype at a tiny scale and do 1 run; I tend to disbelieve this because there's so many issues that always come up as you move several magnitudes, both at just the code level and training.)

On the other hand, it is something that humans deliberately added to the code.

/shrug. If you don't like TreeQN example, I have others! Just keep making the NN deeper (and/or more recurrent, same thing really, when unrolled...), and it'll keep approximating the value function better at fairly modest additional cost compared to 'real' tree search. (After all, the human brain can't have any symbolic discrete tree in it either, it just passes everything forward for the initial glance and then recurs for System 2 thinking through the game tree.)

I see symbolic vs neural as a bias-variance continuum, per the Bitter Lesson: symbolic learns quickly for little compute, but then it tops out, and eventually, the scissors cross, and the more neural you go, the better it gets. So the question ultimately becomes one of budgets. What's your budget? How much constant-factor performance optimization and ultimate ceiling do you need, and how much hand-engineering of that specialized complicated symbolic architecture are you willing to buy? If you have little compute and don't mind attaining less than superhuman performance and buying a lot of complicated domain-specific code, you will move far down the symbolic end; if you have lots of compute and want the best possible generic code...

and less apt to believe that it's feasible for something like AutoML-Zero to search through the whole space of things that you can do with this toolkit, and less apt to describe the space of things you can build with this toolkit as "algorithms similar to DNNs".

But that's where the scaling laws become concerning. Can AutoML-Zero successfully search for "code to implement MCTS with pUCT exploration heuristic and domain-specific tuned hyperparameters with heavy playouts using a shallow MLP for value approximation"? Probably no. That's complex, specialized, and fragile (a half-working version doesn't work at all). Can AutoML-Zero learn "add 10 moar layers to $DEFAULT_NN lol"? ...Probably yes.

Can you get AGI from a Transformer?

No. I am very familiar with the paper, and MuZero does not use MCTS, nor does it support the claims of OP.

First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.*

Second, using MCTS for the planning is not necessary. As they note, any kind of planning algorithm, not just MCTS would work ("For example, a naive search could simply select the k step action sequence that maximizes the value function. More generally, we may apply any MDP planning algorithm to the internal rewards and state space induced by the dynamics function.")

Third, NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion. What more do you want? I feel that we should at least acknowledge that TreeQN exists, wasn't insuperably hard to create, and, inasmuch as it runs on current hardware at all, doesn't seem to entail 'a factor of a million slowdown'. (VIN/VPN/Predictron might count as examples here too? There's a lot of model-based RL work which make the NN learn part of the planning process, like Imagination-based Planner or MCTSnets.)

Fourth, planning is not necessary at all for the NN to compute results just as strong as tree search would: just like regular AlphaZero, the policy network on its own, with no rollouts or trees involved of any sort, is very strong, and they show that it increases greatly in strength over training. We also have the scaling law work of Andy Jones, verifying the intuition that anything tree search does can be efficiently distilled into a non-tree-search model trained for longer. (I would also point out the steeply diminishing returns to both depth & number of iterations: AlphaZero or Master, IIRC, used only a few TPUs because the tree-search was a simple one which only descended a few plies; you can also see in the papers like the MuZero appendix referenced that most of the play strength comes from just a few iterations, and they don't even evaluate at more than 800, IIRC. It seems like what tree search does qualitatively is correct the occasional blind spot where the NN thinks forward a few moves for its best move and goes 'oh shit! That's actually a bad idea!'. It's not doing anything super-impressive or subtle. It's just a modest local policy iteration update, if you will. But the NN is what does almost all of the work.) This alone is completely fatal to OP's claims that tree search is an example of useful algorithms neural nets cannot do and that adding orders of magnitude more compute would not make a difference (it totally would - the exact scaling exponent for Go/ALE is unknown but I'd bet that anything you can do with MuZero+tree-search can be done with a larger MuZero's policy alone given another order or three of compute).

So, MuZero does not use MCTS; the symbolic tree planning algorithm(s) it uses are not that important; to the extent that explicit tree planning is useful it can be done in a pure neural fashion; and relatively modest (as these things go) increases in compute can obviate the need for even pure neural tree search.

This refutes Byrne's use of tree search as an example of "Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty." Tree search is not an example because it already has been cast into DNN form without exorbitant performance penalty.

* for more on what AlphaZero MCTS "really" is, & come to mind.

Load More