gwern

Comments

Should we rely on the speed prior for safety?

A possible way this solution wouldn’t hold is if we consider the case of a compressed lookup table. Meta-learning could have a shorter description length because it compresses knowledge as opposed to a classical lookup table which adds a new member for each new corresponding input. A compressed lookup table could potentially have a shorter description length than GLUT, even one that grows logarithmically, implying that the speed prior wouldn’t necessarily favour meta-learning, but this requires further investigation.

Why isn't the learned 'compression' algorithm itself an inner algorithm that the meta-learner is learning and thus incentivized? "Compression = intelligence."

A 'dumb' off the shelf compressor like xz isn't going to be able to do much with a few petabytes of chess endgame databases; but there are generic algorithms which can compute endgame databases and those programs are more like kilobytes in size. A space-time weighted algorithm would then probably look something like endgame databases for the most frequent endgames (megabytes), and then on the fly dynamic programming to compute everything missing with speedups from the cached databases.

This is generic to many board games, and so a meta-learner tasked with many board games should learn the template and how to specialize it to specific games.

AI Tracker: monitoring current and near-future risks from superscale models

It's worth noting that aside from the ridiculous situation where Googlers aren't allowed to name LaMDA (despite at least 5 published papers so far), Google has been very coy about MUM & Pathways (to the point where I'm still not sure if 'Pathways' is an actual model that exists, or merely an aspirational goal/name of a research programme). You also have the situation where models like LG's new 300b Exaone is described in a research paper which makes no mention of Exaone (the Korean coverage briefly mentions the L-Verse arch, but none of the English coverage does), or where we still have little idea what the various Wudao models (the MoE recordholders...?) do. And how about that Megatron-NLG-500b, eh? Is it cool, or not? A blog post, and one paper about how efficiently it can censor tweets, is not much to evaluate it on.

And forget about real evaluation! I'm sure OA's DALL-E successor, GLIDE, is capable of very cool things, which people would find if they could poke at it to establish things like CLIP's* ability to do visual analogies or "the Unreal Engine prompt"; but we'll never know because they aren't going to release it, and if they do, it'll be locked behind an API where you can't do many of the useful things like backpropping through it.

We are much more ignorant about the capabilities of the best models today than we were a year ago.

Increasingly, we're gonna need range/interval notation and survival/extremes analysis to model things, since exact dates, benchmarks, petaflops/days, and parameter counts will be unavailable. Better start updating your data model & graphs now.

* One gets the impression that if OA had realized just how powerful CLIP was for doing more than just zero-shot ImageNet classification or re-ranking DALL-E samples, they probably wouldn't've released the largest models of it. Another AI capabilities lesson: even the creators of something don't always know what it is capable of. "Attacks only get better."

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?


My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD.

I also disagree with B1, gradual change bias. This seems to me to be either obviously wrong, or limited to safety/capability-irrelevant definitions like "change in KL of overall distribution of output from small vanilla supervised-learning NNs", and does definitely does not generalize to even small amounts of confidence in assertions like "SGD updates based on small n cannot meaningfully change a very powerful NN's behavior in a real-world-affecting way".

First, it is pervasive in physics and the world in general that small effects can have large outcomes. No snowflake thinks it's responsible for the avalanche, but one was. Physics models often have chaotic regimes, and models like the Ising spin glass model (one of the most popular theoretical models for neural net analysis for several decades) are notorious for how tiny changes can cause total phase shifts and regime changes. NNs themselves are often analyzed as being on the 'edge of chaos' in various ways (the exploding gradient problem, except actual explosions). Breezy throwaway claims about 'oh, NNs are just smooth and don't change much on one update' are so much hot air against this universal prior.

  • As an example, consider the notorious fragility of NNs to (hyper)parameters: one set will fail completely, wildly diverging. Another, similar set, will suddenly work. Edge of chaos. Or consider the spikiness of model capabilities over scaling and how certain behaviors seem to abruptly emerge at certain points (eg 'grokking', or just the sudden phase transitions on benchmarks where small models are flatlined but then at a threshold, the model suddenly 'gets it'). This parallels similar abrupt transitions in human psychology, like Piagetian levels. In grokking, the breakthrough appears to happen almost instantaneously; for humans and large model capabilities, we lack detailed benchmarking which would let us say that "oh, GPT-3 understands logic at exactly iteration #501,333", but it should definitely make you think about assumptions like "it must take countless SGD iterations for each of these capabilities to emerge." (This all makes sense if you think of large NNs as searching over complexity-penalized ensembles of programs, and at some point switching from memorization-intensive programs to the true generalizing program; see my scaling hypothesis writeup & jcannell's writings.)

Second, the pervasive existence of adversarial examples should lead to extreme doubt on any claims of NN 'smoothness'. Absolutely and perceptually tiny shifts in image inputs lead to almost arbitrarily large wacky changes in output distribution. These may be unnatural, but they exist. If tweaking a pixel here and there by a quantum can totally change the output from 'cat' to 'dog', why can't an SGD update, which can change every single parameter, have similar effects? (Indeed, you link the isoperimetry paper, which claims that this is logically necessary for all NNs which are too small for their problem, where for ImageNet I believe they ballpark it as "even the largest contemporary ImageNet models are still 2 OOMs too small to begin to be robust".)

Third, small changes in outputs can mean large changes in behavior. Actions and choices are inherently discrete where small changes in the latent beliefs can have arbitrarily large behavior changes (which is why we are always trying to move away from actions/choices towards smoother easier relaxations where we can pretend everything is small and continuous). Imagine a NN estimating Q-values for taking 2 actions, "Exterminate all humans" and "usher in the New Jerusalem", which normalize to 0.4999999 and 0.5000001 respectively. It chooses the argmax, and acts friendly. Do you believe that there is no SGD update which after adjusting all of the parameters, might reverse the ranking? Why, exactly?

Fourth, NNs are often designed to have large changes based on small changes, particularly in meta-learning or meta-reinforcement-learning. In prompt programming or other few-shot scenarios, we radically modify the behavior of a model with potentially trillions of parameters by merely typing in a few words. In neural meta-backdoors/data poisoning, there are extreme shifts in output for specific prespecified inputs (sort of the inverse of adversarial examples), which are concerning in part because that's where a gradient-hacker could store arbitrary behavior (so a backdoored NN is a little like Light in Death Note: he has forgotten everything & acts perfectly innocent... until he touches the right piece of paper and 'wakes up'). In cases like MAML, the second-order training is literally designed to create a NN at a saddle point where a very small first-order update will produce very different behavior for each potential new problem; like a large boulder perched at the tip of a mountain which will roll off in any direction at the slightest nudge. In meta-reinforcement-learning, like RNNs being trained to solve a t-maze which periodically flips the reward function, the very definition of success is rapid total changes in behavior based on few observations, implying the RNN has found a point where the different attractors are balanced and the observation history can push it towards the right one easily. These NNs are possible, they do exist, and given the implicit meta-learning we see emerge in larger models, they may become increasingly common and exist in places the user does not expect.

So, I see loads of reasons we should worry about an advanced AI system maneuvering from one strategy to another after a single update, both in general priors and based on what we observe of past NNs, and good reason to believe that scaling greatly increases the dangers there. (Indeed, the update need not even be of the 'parameters', updates to the hidden state / internal activations are quite sufficient.)

More Christiano, Cotra, and Yudkowsky on AI progress

If you said "actually people will be using methods that flounder at a compute budget of 1e25 flops, but people will be doing AGI research with 1e30 flops, and the speedup will be > 1 OOM" then I agree that will give you a hard takeoff, but that's what I'm saying transformers aren't a good example of.

Why not? Here we have a pretty clean break: RNNs are not a tweak or two away from Transformers. We have one large important family of algorithms, which we can empirically demonstrate do not absorb usefully the compute which another later discretely different family does, and which is responsible for increasingly more compute, and the longer that family of improvements was forgone, the more compute overhang there would've been to exploit.

In a world where Transformers did not exist, we would not be talking about GPRNN-3 as a followup to GPRNN-2, which followupped OA's original & much-unloved GPT-1 RNN. What would happen is that OA would put $10m into GPRNN-3, observe that it didn't go anywhere (hard to eyeball the curves but I wonder if it'd work even as well as GPT-2 did?), and the status quo of <100m-parameter RNNs would just keep going. There would not be any Switch Transformer, any WuDao, any HyperClova, any Pangu-Alpha, any Pathways/LaMDA/MUM, FB's scaleup program in audio & translation wouldn't be going... (There probably wouldn't be any MLP renaissance either, as everyone seems to get there by asking 'how much of a Transformer do we need anyway? how much can I ablate away? hm, looks like "all of it" when I start with a modern foundation with normalized layers?') We know what would've happened without Transformers: nothing. We can observe the counterfactual by simply looking: no magic RNNs dropped out of the sky merely to 'make line go straight brrr'. It would simply be yet another sigmoid ending and an exciting field turning into a 'mature technology': "well, we scaled up RNNs and they worked pretty well, but it'll require new approaches or way more compute than we'll have for decades to come, oh well, let's dick around until then." Such a plateau would be no surprise, any more than it ought to be surprising that in 2021 you or I are not flying around on hypersonic rocket-jet personal pod cars the way everyone in aerospace was forecasting in the 1950s by projecting out centuries of speed increases.

More Christiano, Cotra, and Yudkowsky on AI progress

And I’d reject LSTM → transformer or MoE as an example because the quantitative effect size isn’t that big.

But if something like that made the difference between “this algorithm wasn’t scaling before, and now it’s scaling,” then I’d be surprised.

Hold on, why doesn't LSTM→Transformer count? You've basically never seen a LSTM RNN larger than 100m parameters, I think, and the reason is that their scaling exponent looks bad and past 100m they're floundering: https://www.gwern.net/images/ai/gpt/2020-kaplan-figure7-rnnsvstransformers.png (Kaplan) Or https://arxiv.org/abs/2106.09488#amazon which fits proper scaling laws to the LSTM RNNs & Transformers, and finds that Transformers are already twice as efficient in the range tested (in terms of reducing loss), and getting better asymptotically (better slope: −0.167 vs −0.197*). I doubt you could train a RNN the size of GPT-3 at all, and if you did, it would cost much more (as the 'AI and Compute' trendline has stopped).

* I admit this is not very impressive, but the acoustic scaling paper has the problem that it's almost at the irreducible loss asymptote already: they hit a loss of 0.32 at only 0.1 petaflop-s/day but linf is apparently 0.30. (Meanwhile, language models like GPT-3 at 3640 petaflop-s/day are still very far from their irreducible loss.) So while the Transformer would only have a 6.38× advantage if I loosely copy over exponents and imagine scaling by 36400 and compare 1 - 36400^(-0.197) = 0.873 and 1 - (36400*x)^(-0.167) = 0.873, I think this lowerbounds the Transformer advantage in a floor effect way: their acoustic modeling problem is just 'too easy' to really show the difference.

Misc. questions about EfficientZero

One thing to note is that you don't know how many games humans are playing in their head in some sense. We don't have access to that kind of information about our own algorithms. Even if you think we don't because we don't consciously experience/remember them, that's obviously wrong. Every time you have a thought pop out of nowhere or an eureka! moment from the incubation effect, or every time you have a Tetris effect dream (or all the experience-replay hippocampus neuroscience), you see how it feels to have powerful subconscious algorithms churning away on difficult problems without you having any awareness of it: nothing. But they still take wallclock years to reach levels of performance that something like AlphaZero does in hours...

Misc. questions about EfficientZero

It's a little bit less dramatic than that: the model-based simulation playing is interleaved with the groundtruth environment. It's more like you spend a year playing games in your head, then you play 1 30s bullet chess match with Magnus Carlsen (madeup ratio), then go back to playing in your head for another year. Or maybe we should say, "you clone yourself a thousand times, and play yourself at correspondence chess timescales for 1 game per pair in a training montage, and then go back for a rematch".

(The scenario where you play for 15 minutes at the beginning, and then pass a few subjective eons trying to master the game with no further real data, would correspond to sample-efficient "offline reinforcement learning" (review), eg https://arxiv.org/abs/2104.06294#deepmind / https://arxiv.org/abs/2111.05424 , https://arxiv.org/abs/2006.13888 https://arxiv.org/abs/2006.13888 . Very important in its own right, of course, but poses challenges of its own related to quality of those 15 minutes - what if your 15 minute sample doesn't include any games which happen to use en passant, say? How could you ever learn that should be part of your model of chess? When you interleave and learn online, Carlsen might surprise you with an en passant a few games in, and then all your subsequent imagined games can include that possible move and you use it yourself.).

But it would maybe deflate some of the broader conclusions that people are drawing from EfficientZero, like how AI compares to human brains…

I don't think it does. If humans can't learn efficiently by imagining hypothetical games like machines can, so much the worse for them. The goal is to win.

Biology-Inspired AGI Timelines: The Trick That Never Works

What Moravec says is merely that $1k human-level compute will become available in the '2020s', and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don't hit cheap human-compute until the end of the decade. He also doesn't commit to how long it takes to turn compute into powerful systems, it's more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn't start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.

$100/$50 rewards for good references

What we'd want is some neural-net style design that generates the coin reward and the move-right reward just from the game data, without any previous knowledge of the setting.

So you're looking for curriculum design/exploration in meta-reinforcement-learning? Something like Enhanced POET/PLR/REPAIRED but where it's not just moving-right but a complicated environment with arbitrary reward functions (eg. using randomly initialized CNNs to map state to 'reward')? Or would hindsight or successor methods count as they relabel rewards for executed trajectories? Would relatively complex generative games like Alchemy or LIGHT count? Self-play, like robotics self-play?

Load More