[AN #156]: The scaling hypothesis: a plan for building AGI

This comment is inspired by a conversation with Ajeya Cotra.

As a simple example of how the scaling hypothesis affects AI safety research, it suggests that the training objective (“predict the next word”) is relatively unimportant in determining properties of the trained agent; in contrast, the dataset is much more important. This suggests that analyses based on the “reward function used to train the agent” are probably not going to be very predictive of the systems we actually build.

To elaborate on this more:

Claim 1: Scaling hypothesis + abundance of data + competitiveness requirement implies that an alignment solution will need to involve pretraining.

Argument: The scaling hypothesis implies that you can get strong capabilities out of abundant effectively-free data. So, if you want your alignment proposal to be competitive, it must also get strong capabilities out of effectively-free data. So far, the only method we know of for this is pretraining.

Note that you could have schemes where you train an actor model using a reward model that is always aligned; in this case your actor model could avoid pretraining (since you can generate effectively-free data from the reward model) but your reward model will need to be pretrained. So the claim is that some part of your scheme involves pretraining; it doesn't have to be the final agent that is deployed.

Claim 2: For a fixed 'reasonable' pretraining objective, there exists some (possibly crazy and bespoke but still reasonably-sized) dataset which would make the resulting model aligned without any finetuning.

(This claim is more of an intuition pump for Claim 3, rather than an interesting claim in its own right)

Argument 1: As long as your pretraining objective doesn't do something unreasonable like say "ignore the data, always say 'hello world'", given the fixed pretraining objective each data point acts as a "constraint" on the parameters of the model. If you have D data points and N model parameters with D > N, then you should expect these constraints to approximately determine the model parameters (in the same way that N linearly independent equations on N variables uniquely determine those variables). So with the appropriate choice of the D data points, you should be able to get any model parameters you want, including the parameters of the aligned model.

Argument 2: There are ~tens of bits going into the choice of pretraining objective, and ~millions of bits going into the dataset, so in some sense nearly all of the action is in the dataset.

Argument 3: For the specific case of next-word prediction, you could take an aligned model, generate a dataset by running that model, and then train a new model with next-word prediction on that dataset.
I believe this is equivalent to model distillation, which has been found to be really unreasonably effective, including for generalization (see e.g. here), so I’d expect the resulting model would be aligned too.

Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.

Argument: Basically the same as for Claim 2: by far most of the influence on which model you get out is coming from the dataset.

(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn't mean that the pretraining objective can't have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)

Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.

Argument: There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there's a corresponding model that the resulting GPT-N would "try" to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn't matter much which pretraining objective you use, so most of these models would be wrong.

[-]TurnTrout4y40

Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.

Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn't matter.

And this is true up to a point: up to constant factors, it doesn't matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so "there exists a program in U2-encoding which implements P in U1-encoding" doesn't get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties.

Stepping out of the analogy, even though I agree that "reasonable" pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:

(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn't mean that the pretraining objective can't have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)

[-]Rohin Shah4y60

Yeah, I agree with all this. I still think the pretraining objective basically doesn't matter for alignment (beyond being "reasonable") but I don't think the argument I've given establishes that.

I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).

[-]TurnTrout4y20

Sure.

Additional note for posterity: when I talked about "some objectives [may] make alignment far more likely", I was considering something like "given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

[-]Rohin Shah4y20

what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

You're going to need the ease of specification condition, or something similar; else you'll probably run into no-free-lunch considerations (at which point I think you've stopped talking about anything useful).

[-]ESRogs4y*30

Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.

I got a bit confused by this section, I think because the word "model" is being used in two different ways, neither of which is in the sense of "machine learning model".

Paraphrasing what I think is being said:

An observer (us) has a model_1 of what GPT-N is doing.
According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions.
The observer's model_1 makes good predictions about GPT-N's behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution.
The way that the observer's model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite -- the observer's model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?).

Is that right?

[-]Rohin Shah4y60

Yes, that's right, sorry about the confusion.

[-]Daniel Kokotajlo4y20

Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:

There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there's a corresponding model that the resulting GPT-N would "try" to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn't matter much which pretraining objective you use, so most of these models would be wrong.

Seems to me the conclusion of this argument is that "In general it's not true that the AI is trying to achieve its training objective." The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding "It'll probably just keep filling in missing words as in training" we should conclude "we have no idea what it'll do; treacherous turn is a real possibility because that's what'll happen for most goals it could have, and it may have a goal for all we know."

[-]Rohin Shah4y*40

The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn.

?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?

EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just "fills in words" instead.

Seems to me the conclusion of this argument is that "In general it's not true that the AI is trying to achieve its training objective."

Isn't that effectively what I said? (I was trying to be more precise since "achieve its training objective" is ambiguous, but given what I understand you to mean by that phrase, I think it's what I said?)

we have no idea what it'll do; treacherous turn is a real possibility because that's what'll happen for most goals it could have, and it may have a goal for all we know.

This seems reasonable to me (and seems compatible with what I said)

[-]Daniel Kokotajlo4y40

OK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.

[-]Lukas Finnveden4y70

(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)

I'd like to know what this figure is based on. In the linked post, Gwern writes:

The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character⁠.

But in that linked post, there's no mention of "0.7" bits in particular, as far as I or cmd-f can see. The most relevant passage I've read is:

Claude Shannon found that each character was carrying more like 1 (0.6-1.3) bit of unguessable information (differing from genre to genre8); Hamid Moradi found 1.62-2.28 bits on various books9⁠; Brown et al 1992 found <1.72 bits; Teahan & Cleary 1996 got 1.46; Cover & King 1978 came up with 1.3 bits10⁠; and Behr et al 2002 found 1.6 bits for English and that compressibility was similar to this when using translations in Arabic/Chinese/French/Greek/Japanese/Korean/Russian/Spanish (with Japanese as an outlier). In practice, existing algorithms can make it down to just 2 bits to represent a character, and theory suggests the true entropy was around 0.8 bits per character.11

I'm not sure what the relationship is between supposedly unguessable information and human performance, but assuming that all these sources were actually just estimating human performance, and without looking into the sources more... this isn't just lots of uncertainty, but vast amounts of uncertainty, where it's very plausible that GPT-3 has already beaten humans. This wouldn't be that surprising, given that GPT-3 must have memorised a lot of statistical information about how common various words are, which humans certainly don't know by default.

I have a lot of respect for people looking into a literature like this and forming their own subjective guess, but it'd be good to know if that's what happened here, or if there is some source that pinpoints 0.7 in particular as a good estimate.

[-]gwern4y70

It's based on those estimates and the systematic biases in such methods & literatures. Just as you know that psychology and medical effects are always overestimated and can be rounded down by 50% to get a more plausible real world estimate, such information-theoretic methods will always overestimate model performance and underestimate human performance, and are based on various idealizations: they use limited genres and writing styles (formal, omitting informal like slang), don't involve extensive human calibration or training like the models get, don't involve any adversarial examples, don't try to test human reasoning by writing up texts made up of logical riddles and puzzles or complicated cause-and-effect scenarios or even things like Winograd Schemas, are time-biased, etc. We've seen a lot of these issues come up in benchmarking, like ImageNet models outside ImageNet despite hitting human parity or superiority. (If we are interested in truly testing 'compression = intelligence', we need texts which stress all capabilities and remove all of those issues.)

So given Shannon's interval's lower end is 0.6, and Grassberger's asymptotic is 0.8 (the footnote 11) and a widespread of upper bounds going down to 1.3 along with extremely dumb fast algorithms hitting 2, I am comfortable with rounding them downish to get estimates of 0.7 bpc being the human performance; and I expect that to, if anything, be still underestimating true human peak performance, so I wouldn't be shocked if it was actually more like 0.6 bpc.

[-]Aryeh Englander4y40

I'd like to hear more thoughts, from Rohin or anybody else, about how the scaling hypothesis might affect safety work.

[-]Rohin Shah4y30

Wrote a separate comment here (in particular I think claims 1 and 4 are directly relevant to safety)

[-]David Scott Krueger (formerly: capybaralet)4y30

Second, we can match the certification to the types of people and institutions, that is, our certifications talk about the executives, citizens, or corporations (rather than e.g. specific algorithms, that may be replaced in the future). Third, the certification system can build in mechanisms for updating the certification criteria periodically.

* I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts. This appears to contradict the "Second" point above somewhat.
* I want people to work on developing the infrastructure for such analyses. This is in keeping with the "Third" point.
* This will likely involve a massive increase in investment of AI talent in the process of certification.

As an example, I think "manipulative" algorithms -- that treat humans as part of the state to be optimized over -- should be banned in many applications in the near future, and that we need expert involvement to determine the propensity of different algorithms to actually optimize over humans in various contexts.

[-]Rohin Shah4y20

I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts. This appears to contradict the "Second" point above somewhat.

The idea with the "Second" point is that the certification would be something like "we certify that company X has a process Y for analyzing and fixing potential problem Z whenever they build a new algorithm / product", which seems like it is consistent with your belief here? Unless you think that the process isn't enough, you need to certify the analysis itself.

[-]David Scott Krueger (formerly: capybaralet)4y30

I think the contradiction may only be apparent, but I thought it was worth mentioning anyways.
My point was just that we might actually want certifications to say things about specific algorithms.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

22

[AN #156]: The scaling hypothesis: a plan for building AGI

22

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

AI GOVERNANCE

FEEDBACK

PODCAST