Here's what the curves look like if you fit them to the PaLM data-points as well as the GPT-3 data-points.
Keep in mind that this is still based on Kaplan scaling laws. The Chinchilla scaling laws would predict faster progress.
First I gotta say: I thought I knew the art of doing quick-and-dirty calculations, but holy crap, this methodology is quick-and-dirty-ier than I would ever have thought of. I'm impressed.
But I don't think it currently gets to right answer. One salient thing: it doesn't take into account Kaplan's "contradiction". I.e., Kaplan's laws already suggested that once we were using enough FLOP, we would have to scale data faster than we have to do in the short term. So when I made my extrapolations, I used a data-exponent that was larger than the one that's represented in that graph.
I now tried to do figure out the answer to this question using Chinchilla's loss curves and Kaplan's adjusted-for-contradiction loss curves, but I realised...
...that Chinchilla's "loss" and Kaplan's "loss" are pretty incomparable.
It's unsurprising that they're somewhat different (they might have used different datasets or something, when evaluating the loss), but I am surprised that Chinchilla's curves uses an additive term that predicts that loss will never go below 1.69. What happened with the claims that ideal text-prediction performance was like 0.7? (E.g. see here for me asking why gwern estimates 0.7, and gwern responding.)
Anyway, this makes it very non-obvious to me how to directly translate my benchmark extrapolations to a chinchilla context. Given that their "loss" is so different, I don't know what I could reasonably assume about the relationship between [benchmark performance as a function of chinchilla!loss] and [benchmark performance as a function of gpt-3!loss].
Ok so I tried running the numbers for the neural net anchor in my bio-anchors guesstimate replica.
Previously the neural network anchor used an exponent (alpha) of normal(0.8, 0.2) (first number is mean, second is standard deviation). I tried changing that to normal(1, 0.1) (smaller uncertainty because 1 is a more natural number, and some other evidence was already pointing towards 1). Also, the model previously said that a 1-trillion parameter model should be trained with 10^normal(11.2, 1.5) data points. I changed that to have a median at 21.2e12 parameters, since that's what the chinchilla paper recommends for a 1-trillion parameter models. (See table 3 here.)
The result of this is to increase the median compute needed by ~2.5 OOMs. The 5th percentile increases ~2 OOMs and the 95th percentile increases ~3.5 OOMs.
Depends on how you were getting to that +N OOMs number.
If you were looking at my post, or otherwise using the scaling laws to extrapolate how fast AI was improving on benchmarks (or subjective impressiveness), then the chinchilla laws means you should get there sooner. I haven't run the numbers on how much sooner.
If you were looking at Ajeya's neural network anchor (i.e. the one using the Kaplan scaling-laws, not the human-lifetime or evolution anchors), then you should now expect that AGI comes later. That model anchors the number of parameters in AGI to the number of synapses in the human brain, and then calculates how much compute you'd need to train a model of that size, if you were on the compute-optimal trajectory. With the chinchilla scaling laws, you need more data to train a compute-optimal model with a given number of parameters (data is proportional to parameters instead of parameters^0.7). So now it seems like it's going to be more expensive to train a compute-optimal model with 10^15 parameters, or however many parameteres AGI would need.
In fact, if we think of pseudo-inputs as predicates that constrain X, we can approximate the probability of unacceptable behavior during deployment as
P(C(M,x) | x∼deploy)≈maxα∈XpseudoP(α(x) | x∼deploy)⋅ P(C(M,x) | α(x), x∼deploy) such that, if we can get a good implementation of P, we no longer have to worry as much about carefully constraining Xpseudo, as we can just let P's prior do that work for us.
Where footnote 7 reads:
Note that this approximation is tight if and only if there exists some α∈Xpseudo such that α(x)↔C(M,x)
I think the "if" direction is right, here, but the "only if" direction is wrong. For example, the approximation is also tight in the case where Xpseudo only has a single element alpha such that alpha(x) is true for all x.
I think the approximation is tight if and only if any of the α∈Xpseudo that maximizes the expression fulfils C(M,x) –> α(x).
I'm at like 30% on fast takeoff in the sense of "1 year doubling without preceding 4 year doubling" (a threshold roughly set to break any plausible quantitative historical precedent).
Huh, AI impacts looked at one dataset of GWP (taken from wikipedia, in turn taken from here) and found 2 precedents for "x year doubling without preceding 4x year doubling", roughly during the agricultural evolution. The dataset seems to be a combination of lots of different papers' estimates of human population, plus an assumption of ~constant GWP/capita early in history.
I agree that i does slightly worse than t on consistency-checks, but i also does better on other regularizers you're (maybe implicitly) using like speed/simplicity, so as long as i doesn't do too much worse it'll still beat out the direct translator.
Any articulable reason for why i just does slightly worse than t? Why would a 2N-node model fix a large majority of disrepancys between an N-node model and a 1e12*N-node model? I'd expect it to just fix a small fraction of them.
I think this rapidly runs into other issues with consistency checks, like the fact that the human is going to be systematically wrong about some correlations, so i potentially is more consistent than t.
Yeah, if you can get better-looking consistency than the direct translator in some cases, I agree that a sufficiently high consistency penalty will just push towards exploiting that (even if the intermediate model needs to be almost as large as the full predictor to exploit it properly).
I'm curious whether you think this is the main obstacle. If we had a version of the correlation-consistency approach that always gave the direct translator minimal expected consistency loss, do we as-of-yet lack a counterexample for it?
Hypothesis: Maybe you're actually not considering a reporter i that always use an intermediate model; but instead a reporter i' that does translations on hard questions, and just uses the intermediate model on questions where it's confident that the intermediate model understands everything relevant. I see three different possible issues with that idea:
1. To do this, i' needs an efficient way (ie one that doesn't scale with the size of the predictor) to (on at least some inputs) be highly confident that the intermediate model understands everything relevant about the situation. I think this is a reasonable "worst-case" assumption, but I'm not sure. If you're using it, I'd be curious to know.
2. Even when the reporter gets inputs that the intermediate model fully understands, it seems like the reporter will run into issues if its sampling-process (for generating distributions) runs into some inputs that the intermediate model doesn't fully understand. (I.e., if i' gets an input for which X1 is true, and then simulates many more random inputs for which X1 is true, and by chance X2 is true for one of them, then the reporter will have to do translation on that sample). Which makes it seem unlikely that i' wouldn't have to do translation at least once per input.
3. If the model is confident that the intermediate model understands everything relevant about the situation, it seems more efficient to return the intermediate model's answer to the question at hand than to loop over it many times, trying to fix correlations. So really we should get a reporter i'' that does translation on the predictor on hard questions and returns an intermediate model's latent knowledge on easy questions. That seems like an ok reporter to get.
I don't understand your counterexample in the appendix Details for penalizing inconsistencies across different inputs. You present a cheating strategy that requires the reporter to run and interpret the predictor a bunch of times, which seems plausibly slower than doing honest translation. And then you say you fix this issue with:
But this dependence could be avoided if there was an intermediate model between the predictor’s Bayes net (which we are assuming is very large) and the human’s Bayes net. Errors identified by the intermediate model are likely to be “real” errors in the human’s model. But we can identify those errors with an amount of computation that depends only on the intermediate model, rather than needing to actually run the full predictor. For example, if the human’s Bayes net has size N, the intermediate model has size 2N, and the predictor’s Bayes net has size a 1e12 x N, then the cost of using the intermediate model many times can still be small relative to the cost of direct translation.
Roughly speaking, I don't see how the reporter with an intermediate model systematically wins out over the translator given that the intermediate model will miss many things that the predictor understands.
Taking it more slowly, using the example from the report:
Let's say we have X1='do rowhammer' (which the intermediate model does understand) and X2='do eldritch hacks that the intermediate model isn't good enough to understand'.
Let's say we have reporters:
Let's say the reporter is asked "Is the diamond safe?" on an input that includes X1, and the human runs consistency checks across many actions containing X1 (and none containing X2). Here, it seems like i and t both pass the consistency checks, and i is faster than t, so i wins.
But now let's say that the reporter is asked "Is the diamond safe?" on an input that includes X2, and the human runs consistency checks across many actions containing X2. In this case, the reporter i will fail the consistency checks (since the intermediate model's predictions won't match the predictor's predictions, and so the reporter will fail to adjust for the revealing correlations), so t will come out ahead.
So if these reporters are the only competitors, it seems like we should be able to tune the regularization to make t win.
It's very easy to construct probability distributions that have earlier timelines, that look more intuitively unconfident, and that have higher entropy than the bio-anchors forecast. You can just take some of the probability mass from the peak around 2050 and redistribute it among earlier years, especially years that are very close to the present, where bioanchors are reasonably confident that AGI is unlikely.