Previously "Lanrian" on here. Research analyst at Open Philanthropy. Views are my own.
Competence does not seem to aggressively overwhelm other advantages in humans:
[...]
g. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here).
The usage of capabilities/competence is inconsistent here. In points a-f, you argue that general intelligence doesn't aggressively overwhelm other advantages in humans. But in point g, the ELO difference between the best and worst players is less determined by general intelligence than by how much practice people have had.
If we instead consistently talk about domain-relevant skills: In the real world, we do see huge advantages from having domain-specific skills. E.g. I expect elected representatives to be vastly better at politics than medium humans.
If we instead consistently talk about general intelligence: The chess data doesn't falsify the hypothesis that human-level variation in general intelligence is small. To gather data about that, we'd want to analyse the ELO-difference between humans who have practiced similarly much but who have very different g.
(There are some papers on the correlation between intelligence and chess performance, so maybe you could get the relevant data from there. E.g. this paper says that (not controlling for anything) most measurements of cognitive ability correlates with chess performance at about ~0.24 (including IQ iff you exclude a weird outlier where the correlation was -0.51).)
Another fairly common argument and motivation at OpenAI in the early days was the risk of "hardware overhang," that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago.
Could you clarify this bit? It sounds like you're saying that OpenAI's capabilities work around 2017 was net-positive for reducing misalignment risk, even if the only positive we count is this effect. (Unless you think that there's substantial reason that acceleration is bad other than giving the AI safety community less time.) But then in the next paragraph you say that this argument was wrong (even before GPT-3 was released, which vaguely gestures at the "around 2017"-time). I don't see how those are compatible.
This seems plausible if the environment is a mix of (i) situations where task completion correlates (almost) perfectly with reward, and (ii) situations where reward is very high while task completion is very low. Such as if we found a perfect outer alignment objective, and the only situation in which reward could deviate from the overseer's preferences would be if the AI entirely seized control of the reward.
But it seems less plausible if there are always (small) deviations between reward and any reasonable optimization target that isn't reward (or close enough so as to carry all relevant arguments). E.g. if an AI is trained on RL from human feedback, and it can almost always do slightly better by reasoning about which action will cause the human to give it the highest reward.
As the main author of the "Alignment"-appendix of the truthful AI paper, it seems worth clarifying: I totally don't think that "train your AI to be truthful" in itself is a plan for how to tackle any central alignment problems. Quoting from the alignment appendix:
While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work on scaleable truthfulness to encounter many of those same difficulties, and to benefit from many of the same solutions.
In other words: I don't think we had a novel proposal for how to make truthful AI systems, which tackled the hard bits of alignment. I just meant to say that the hard bits of making truthful A(G)I are similar to the hard bits of making aligned A(G)I.
At least from my own perspective, the truthful AI paper was partly about AI truthfulness maybe being a neat thing to aim for governance-wise (quite apart from the alignment problem), and partly about the idea that research on AI truthfulness could be helpful for alignment, and so it's good if people (at least/especially people who wouldn't otherwise work on alignment) work on that problem. (As one example of this: Interpretability seems useful for both truthfulness and alignment, so if people work on interpretability intended to help with truthfulness, then this might also be helpful for alignment.)
I don't think you're into this theory of change, because I suspect that you think that anyone who isn't directly aiming at the alignment problem has negligible chance of contributing any useful progress.
I just wanted to clarify that the truthful AI paper isn't evidence that people who try to hit the hard bits of alignment always miss — it's just a paper doing a different thing.
(And although I can't speak as confidently about others' views, I feel like that last sentence also applies to some of the other sections. E.g. Evan's statement, which seems to be about how you get an alignment solution implemented once you have it, and maybe about trying to find desiderata for alignment solutions, and not at all trying to tackle alignment itself. If you want to critique Evan's proposals for how to build aligned AGI, maybe you should look at this list of proposals or this positive case for how we might succeed.)
Here's what the curves look like if you fit them to the PaLM data-points as well as the GPT-3 data-points.
Keep in mind that this is still based on Kaplan scaling laws. The Chinchilla scaling laws would predict faster progress.
Linear:
Logistic:
First I gotta say: I thought I knew the art of doing quick-and-dirty calculations, but holy crap, this methodology is quick-and-dirty-ier than I would ever have thought of. I'm impressed.
But I don't think it currently gets to right answer. One salient thing: it doesn't take into account Kaplan's "contradiction". I.e., Kaplan's laws already suggested that once we were using enough FLOP, we would have to scale data faster than we have to do in the short term. So when I made my extrapolations, I used a data-exponent that was larger than the one that's represented in that graph.
I now tried to do figure out the answer to this question using Chinchilla's loss curves and Kaplan's adjusted-for-contradiction loss curves, but I realised...
...that Chinchilla's "loss" and Kaplan's "loss" are pretty incomparable.
It's unsurprising that they're somewhat different (they might have used different datasets or something, when evaluating the loss), but I am surprised that Chinchilla's curves uses an additive term that predicts that loss will never go below 1.69. What happened with the claims that ideal text-prediction performance was like 0.7? (E.g. see here for me asking why gwern estimates 0.7, and gwern responding.)
Anyway, this makes it very non-obvious to me how to directly translate my benchmark extrapolations to a chinchilla context. Given that their "loss" is so different, I don't know what I could reasonably assume about the relationship between [benchmark performance as a function of chinchilla!loss] and [benchmark performance as a function of gpt-3!loss].
Ok so I tried running the numbers for the neural net anchor in my bio-anchors guesstimate replica.
Previously the neural network anchor used an exponent (alpha) of normal(0.8, 0.2) (first number is mean, second is standard deviation). I tried changing that to normal(1, 0.1) (smaller uncertainty because 1 is a more natural number, and some other evidence was already pointing towards 1). Also, the model previously said that a 1-trillion parameter model should be trained with 10^normal(11.2, 1.5) data points. I changed that to have a median at 21.2e12 parameters, since that's what the chinchilla paper recommends for a 1-trillion parameter models. (See table 3 here.)
The result of this is to increase the median compute needed by ~2.5 OOMs. The 5th percentile increases ~2 OOMs and the 95th percentile increases ~3.5 OOMs.
Depends on how you were getting to that +N OOMs number.
If you were looking at my post, or otherwise using the scaling laws to extrapolate how fast AI was improving on benchmarks (or subjective impressiveness), then the chinchilla laws means you should get there sooner. I haven't run the numbers on how much sooner.
If you were looking at Ajeya's neural network anchor (i.e. the one using the Kaplan scaling-laws, not the human-lifetime or evolution anchors), then you should now expect that AGI comes later. That model anchors the number of parameters in AGI to the number of synapses in the human brain, and then calculates how much compute you'd need to train a model of that size, if you were on the compute-optimal trajectory. With the chinchilla scaling laws, you need more data to train a compute-optimal model with a given number of parameters (data is proportional to parameters instead of parameters^0.7). So now it seems like it's going to be more expensive to train a compute-optimal model with 10^15 parameters, or however many parameteres AGI would need.
In fact, if we think of pseudo-inputs as predicates that constrain X, we can approximate the probability of unacceptable behavior during deployment as[7]
P(C(M,x) | x∼deploy)≈maxα∈XpseudoP(α(x) | x∼deploy)⋅ P(C(M,x) | α(x), x∼deploy) such that, if we can get a good implementation of P, we no longer have to worry as much about carefully constraining Xpseudo, as we can just let P's prior do that work for us.
Where footnote 7 reads:
Note that this approximation is tight if and only if there exists some α∈Xpseudo such that α(x)↔C(M,x)
I think the "if" direction is right, here, but the "only if" direction is wrong. For example, the approximation is also tight in the case where Xpseudo only has a single element alpha such that alpha(x) is true for all x.
I think the approximation is tight if and only if any of the α∈Xpseudo that maximizes the expression fulfils C(M,x) –> α(x).
I'm curious if anyone made a serious attempt at the shovel-ready math here and/or whether this approach to counterfactuals still looks promising to Abram? (Or anyone else with takes.)