Previously "Lanrian" on here. Research analyst at Open Philanthropy. Views are my own.
Here's a proposed operationalization.
For models that can't gradient hack: The model is "capable of doing X" if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.)
(The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren't optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.)
For models that might be able to gradient-hack, but are well-modelled as having certain goals: The model is "capable of doing X" if it would start doing X if doing X was a valuable instrumental goal, for it.
For both kinds: "you can get it to do X" if you could make it do X with some large amount of research+compute budget (say, 1% of the pre-training budget), no-holds-barred.
Edit: Though I think your operationalization also looks fine. I mainly wanted to point out that the "finetuning" definition of "capable of doing X" might be ok if you include the possibility of finetuning on hypothetical datasets that we don't have access to. (Since we only know how to check the task — not perform it.)
I intend to write a lot more on the potential “brains vs brawns” matchup of humans vs AGI. It’s a topic that has received surprisingly little depth from AI theorists.
I recommend checking out part 2 of Carl Shulman's Lunar Society podcast for content on how AGI could gather power and take over in practice.
Note that B is (0.2,10,−1)-distinguishable in P.
I think this isn't right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.
And for your counterexample, s* = "C" will have B_P-(s*) be 0 (because there's 0 probably of generating "C" in the future). So the sup is at least 0 > -1.
(Note that they've modified the paper, including definition 3, but this comment is written based on the old version.)
Are you mainly interested in evaluating deceptive capabilities? I.e., no-holds-barred, can you elicit competent deception (or sub-components of deception) from the model? (Including by eg fine-tuning on data that demonstrates deception or sub-capabilities.)
Or evaluating inductive biases towards deception? I.e. testing whether the model is inclined towards deception in cases when the training data didn't necessarily require deceptive behavior.
(The latter might need to leverage some amount of capability evaluation, to distinguish not being inclined towards deception from not being capable of deception. But I don't think the reverse is true.)
Or if you disagree with that way of cutting up the space.
I assume that's from looking at the GPT-4 graph. I think the main graph I'd look at for a judgment like this is probably the first graph in the post, without PaLM-2 and GPT-4. Because PaLM-2 is 1-shot and GPT-4 is just 4 instead of 20+ benchmarks.
That suggests 90% is ~1 OOM away and 95% is ~3 OOMs away.
(And since PaLM-2 and GPT-4 seemed roughly on trend in the places where I could check them, probably they wouldn't change that too much.)
Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like:
That description misses out on effects where BNSL-fitting would predict that there's a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don't know how important that is. Curious for your intuition about whether or not that's important, and/or other reasons for why my above description is or isn't reasonable.
When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the "sigmoid" forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim "BNSL does better" corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the "sigmoid") isn't informative for how fast they converge to ~maximum performance (top part of the "sigmoid")? That seems plausible.
I'm curious if anyone made a serious attempt at the shovel-ready math here and/or whether this approach to counterfactuals still looks promising to Abram? (Or anyone else with takes.)
Competence does not seem to aggressively overwhelm other advantages in humans:
[...]
g. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here).
The usage of capabilities/competence is inconsistent here. In points a-f, you argue that general intelligence doesn't aggressively overwhelm other advantages in humans. But in point g, the ELO difference between the best and worst players is less determined by general intelligence than by how much practice people have had.
If we instead consistently talk about domain-relevant skills: In the real world, we do see huge advantages from having domain-specific skills. E.g. I expect elected representatives to be vastly better at politics than medium humans.
If we instead consistently talk about general intelligence: The chess data doesn't falsify the hypothesis that human-level variation in general intelligence is small. To gather data about that, we'd want to analyse the ELO-difference between humans who have practiced similarly much but who have very different g.
(There are some papers on the correlation between intelligence and chess performance, so maybe you could get the relevant data from there. E.g. this paper says that (not controlling for anything) most measurements of cognitive ability correlates with chess performance at about ~0.24 (including IQ iff you exclude a weird outlier where the correlation was -0.51).)
Cool paper!
I'd be keen to see more examples of the paraphrases, if you're able to share. To get a sense of the kind of data that lets the model generalize out of context. (E.g. if it'd be easy to take all 300 paraphrases of some statement (ideally where performance improved) and paste in a google doc and share. Or lmk if this is on github somewhere.)
I'd also be interested in experiments to determine whether the benefit from paraphrases is mostly fueled by the raw diversity, or if it's because examples with certain specific features help a bunch, and those occasionally appear among the paraphrases. Curious if you have a prediction about that or if you already ran some experiments that shed some light on this. (I could have missed it even if it was in the paper.)