Thoughts on the Alignment Implications of Scaling Language Models

leogao

[Epistemic status: slightly rambly, mostly personal intuition and opinion that will probably be experimentally proven wrong within a year considering how fast stuff moves in this field]

This post is also available on my personal blog.

Thanks to Gwern Branwen, Steven Byrnes, Dan Hendrycks, Connor Leahy, Adam Shimi, Kyle and Laria for the insightful discussions and feedback.

Background

By now, most of you have probably heard about GPT-3 and what it does. There’s been a bunch of different opinions on what it means for alignment, and this post is yet another opinion from a slightly different perspective.

Some background: I'm a part of EleutherAI, a decentralized research collective (read: glorified discord server - come join us on Discord for ML, alignment, and dank memes). We're best known for our ongoing effort to create a GPT-3-like large language model, and so we have a lot of experience working with transformer models and looking at scaling laws, but we also take alignment very seriously and spend a lot of time thinking about it (see here for an explanation of why we believe releasing a large language model is good for safety). The inspiration for writing this document came out of the realization that there's a lot of tacit knowledge and intuitions about scaling and LMs that's being siloed in our minds that other alignment people might not know about, and so we should try to get that out there. (That being said, the contents of this post are of course only my personal intuitions at this particular moment in time and are definitely not representative of the views of all EleutherAI members.) I also want to lay out some potential topics for future research that might be fruitful.

By the way, I did consider that the scaling laws implications might be an infohazard, but I think that ship sailed the moment the GPT-3 paper went live, and since we’ve already been in a race for parameters for some time (see: Megatron-LM, Turing-NLG, Switch Transformer, PanGu-α/盘古α, HyperCLOVA, Wudao/悟道 2.0, among others), I don’t really think this post is causing any non-negligible amount of desire for scaling.

Why scaling LMs might lead to Transformative AI

Why natural language as a medium

First, we need to look at why a perfect LM could in theory be Transformative AI. Language is an extremely good medium for representing complex, abstract concepts compactly and with little noise. Natural language seems like a very efficient medium for this; images, for example, are much less compact and don’t have as strong an intrinsic bias towards the types of abstractions we tend to draw in the world. This is not to say that we shouldn’t include images at all, though, just that natural language should be the focus.

Since text is so flexible and good at being entangled with all sorts of things in the world, to be able to model text perfectly, it seems that you'd have to model all the processes in the world that are causally responsible for the text, to the “resolution” necessary for the model to be totally indistinguishable from the distribution of real text. For more intuition along this line, the excellent post Methods of prompt programming explores, among other ideas closely related to the ideas in this post, a bunch of ways that reality is entangled with the textual universe:

A novel may attempt to represent psychological states with arbitrarily fidelity, and scientific publications describe models of reality on all levels of abstraction. [...] A system which predicts the dynamics of language to arbitrary accuracy does require a theory of mind(s) and a theory of the worlds in which the minds are embedded. The dynamics of language do not float free from cultural, psychological, or physical context[.]

What is resolution

I’m using resolution here to roughly mean how many different possible world-states get collapsed into the same textual result; data being high resolution would mean being able to narrow down the set of possible world states to a very small set.

(Brief sidenote: this explanation of resolution isn’t mandatory for understanding the rest of the post so you can skip the rest of this section if you’re already convinced that it would be possible to make a LM have an internal world model. Also, this section is largely personal intuition and there isn’t much experimental evidence to either confirm or deny these intuitions (yet), so take this with a grain of salt)

Here’s an example that conveys some of the intuition behind this idea of resolution and what it means for a system trying to model it. Imagine you’re trying to model wikipedia articles. There are several levels of abstraction this can be done on, all of which produce the exact same output but work very differently internally. You could just model the “text universe” of wikipedia articles—all of the correlations between words—and then leave it at that, never explicitly modelling anything at a higher level of abstraction. You could also model the things in the real world that the wikipedia articles are talking about as abstract objects with certain properties—maybe that’s sunsets, or Canadian geese, or cups of coffee—and then model the patterns of thought at the level of emotions and beliefs and thoughts of the humans as they observe these objects or phenomena and then materialize those thoughts into words. This still gets you the same distribution, but rather than modelling at the level of text, you’re modelling at the level of human-level abstractions, and also the process that converts those abstractions into text. Or you could model the position and momentum of every single particle in the Earth (ignoring quantum effects for the sake of this example), and then you could construct a ground up simulation of the entire Earth, which contains sunsets and geese and wikipedia editors.

Obviously, the easiest way is to just model at the level of the text. Humans are just really complicated, and certainly modelling the entire universe is hugely overkill. But let's say we now add a bunch of data generated by humans that has little mutual information with the wikipedia articles, like tweets. This increases the resolution of the data, because there are a bunch of world states you couldn't tell apart before that you now can. For the text model, as far as it's concerned, this data is totally unrelated to the wikipedia data, and it has to model it mostly separately except for maybe grammar. Meanwhile, the human-modelling system can largely reuse its model with only minor additions. (The universe model just flips a single flag corresponding to whether twitter data is recorded in the corpus).

As you keep increasing the resolution by adding more and more data (such as books, images, videos, and scientific data) generated by the kinds of systems we care about, the cost of modelling the text universe rapidly increases, but the cost of modelling abstractions closer to reality increases much slower. An analogy here is if you only model a record of which side a flipped coin lands on, it's cheaper to just use a random number generator to pick between heads and tails, instead of simulating all the physics of a coin toss. However, if you're trying to model a 4k60fps video of a coin being flipped, you might instead want to actually simulate a coin being flipped and then deduce what the video should look like.

Another way of looking at this is that resolution is how much Bayesian evidence about the universe can be obtained by updating on the dataset from a universal prior.

You might be wondering why it matters what the model's internal representation looks like if it outputs the exact same thing. Since having useful, disentangled internal models at roughly human-level abstractions is crucial for ideas like Alignment By Default, a model which can be harnessed through querying but which doesn’t contain useful internal models would be pretty good for capabilities (since it would be indistinguishable from one that does, if you just look at the output) and pretty bad for alignment, which is the worst case scenario in terms of the alignment-capabilities tradeoff. Also, it would be more brittle and susceptible to problems with off-distribution inputs, because it's right for the wrong reasons, which also hurts safety. If models only ever learn to model the text universe, this problem might become very insidious because as LMs get bigger and learn more, more and more of the text universe will be "on-distribution" for it, making it feel as though the model has generalized to all the inputs we can think of giving it, but breaking the moment the world changes enough (i.e possibly because of the actions of an AI built using the LM, for example). Therefore, it's very important for safety for our models to have approximately human level abstractions internally. Of course, internal models at the right level of abstraction isn't a guarantee that the model will generalize to any given amount (Model Splintering will probably become an issue when things go really off-distribution), but the failure cases would likely be more predictable.

In theory, you could also go too high resolution and end up with a model that models everything at the atomic level which is also kind of useless for alignment. In practice, though, this is likely less of a tradeoff and more of a “more resolution is basically always better" situation since even the highest resolution data we can possibly get is peanuts in the grand scheme of things.

Another minor hitch is that the process through which physical events get picked up by humans and then converted into words is quite lossy and makes the resolution somewhat coarse; in other words, there are a lot of physical phenomena that we'd want our LM to understand but modelling what humans think about them is way easier than actually modelling the phenomena. This probably won't be a problem in practice, though, and if it is we can always fix it by adding non-human-generated data about the phenomena in question.

Thankfully, I don't expect resolution to be a hard problem to solve in practice. Making the data harder to model is pretty easy. We could look towards massively multilingual data, since texts in all languages describe similar concepts at human abstraction levels but in a way that increases the complexity of the text itself significantly. We could even use images or video, which are pretty high resolution (pun not intended), available in large supply, and already being pursued by a bunch of people in the form of multimodal models. The existence of Multimodal Neurons implies that it’s possible for models to internally unify these different modalities into one internal representation.

If that’s not enough, we could do some kind of augmentation or add synthetic data to the training set, designed in such a way that modelling the underlying processes that cause the synthetic data is significantly easier than directly modelling the distribution of text. There are a huge host of ways this could potentially be done, but here are a few random ideas: we could generate tons of sentences by randomly composing facts from something like Cyc; augment data by applying different basic ciphers; or add automatically generated proofs of random statements using proof assistants like Lean/Coq.

Where scaling laws fit in

Of course, we don’t have perfect LMs (yet). The main evidence in practice for scaling LMs being a viable path to this ideal case is in scaling laws (Scaling Laws for Neural Language Models). Essentially, what the scaling laws show is that empirically, you can predict the optimal loss achievable for a certain amount of compute poured into the model with striking accuracy, and that this holds across several orders of magnitude.

If you extrapolate the line out, it approaches zero as compute goes to infinity, which is totally impossible because natural language must have some nonzero irreducible loss. Therefore, at some point the scaling law must stop working. There are two main ways I could see this happen.

Scenario 1: Weak capabilities scenario

One possible way is that the loss slowly peels away from the scaling law curve, until it either converges to some loss that isn't the irreducible loss and refuses to go down further, or it approaches the irreducible loss but takes forever to converge, to the point that the resources necessary to get the loss close enough to irreducible for the level of capabilities we talk about in this post would be astronomically huge and not worth it. Of course, in practice since we don't actually know what the irreducible loss is, these two cases will be pretty hard to tell apart, but the end result is the same: we keep scaling, and the models keep getting better, but never quite enough to do the world modelling-y things well enough to be useful, and eventually people give up and move on to something else.

One main reason this would happen is that negative log likelihood loss rewards learning low order correlations much more than high order correlations; learning how to write a grammatically correct but factually incorrect paragraph achieves significantly better loss than a factually correct but grammatically incorrect paragraph. As such, models will learn all the low order correlations first before even thinking about higher order stuff- sort of like an Aufbauprinzip for LMs. This is why models like GPT-3 have impeccable grammar and vocabulary but their long term coherence leaves something to be desired. At the extreme, a model might spend a ton of capacity on memorizing the names of every single town on Earth or learning how to make typos in exactly the same way as humans do before it even budges on learning anything more high-order like logical reasoning or physical plausibility. Since there’s just so many low order correlations to be learned, it might be that any model that we could reasonably ever train would get stuck here and would therefore never get to the high order correlations. (more on this idea of models learning low-order correlations first: The Scaling Hypothesis)

Plus, once the model gets to high order correlations, it’s not entirely guaranteed it would actually be able to learn them easily. It would depend heavily on humans (and the things humans model, so basically most of the known world) being easy to model, which.. doesn’t seem to be the case, but I leave open the possibility that this is my anthropocentrism speaking.

Also, it might be that transformers and/or SGD just can't learn high order correlations and reasoning at all, possibly because there are possibly architectural constraints of transformers that start kicking in at some point (see Can you get AGI from a Transformer?). Some of these limitations might be fairly fundamental to how neural networks work and prevent any replacement architecture from learning high order reasoning.

If this scenario happens, then LMs will never become very capable or dangerous and therefore will not need to be aligned.

Scenario 2: Strong capabilities scenario

The other way is the model could follow the scaling law curve perfectly until it gets to the irreducible loss, and then forever cease to improve. This scenario could also happen if there’s an asymptotic approach to the irreducible loss that’s fast enough that with a reasonable amount of compute we can get the level of capabilities discussed in this post, which is arguably more likely for higher resolution data because it would be harder to model. This case would happen if past a certain critical size, the model is able to entirely learn the process that generates the text (i.e the humans writing text, and the systems that these humans are able to observe and model) down to the resolution permissible by text. This scenario would be.. kind of scary, because it would mean that scaling is literally all we need, and that we don’t even need that much more scaling before we're "there".

In this scenario, the prioritization of low-order correlations due to the negative log likelihood loss implies that high order traits will only improve slowly for potentially many orders of magnitude as the model is slowly able to memorize more and more inane things like the names of every famous person ever until suddenly it runs out of low hanging fruit and starts improving at constructing flawless logical deductions. This is, needless to say, totally different from how humans learn language or reasoning, and has led to a lot of both under and overestimating of GPT-3's capabilities. In fact, a lot of arguments about GPT-3 look like one person arguing that it must already be superintelligent because it uses language flawlessly and someone else arguing that since it has the physical/logical reasoning capabilities of a 3 year old, it's actually extremely unintelligent. This is especially dangerous because we will severely underestimate the reasoning potential of LMs right up until the last moment when the model runs out of low order correlations. There would be very little warning even just shortly before it started happening, and probably no major change in train/val loss trend during, though there would be a huge difference in downstream evaluations and subjective generation quality.

And that’s all assuming we keep using negative log likelihood. There remains the possibility that there exists a loss function that does actually upweight logical coherence, etc, which would completely invalidate this intuition, and bring highly intelligent LMs even faster. On the bright side there’s the possibility this might be a positive, because it would likely lead to models with richer, more disentangled models inside of them, which would be useful for alignment as I mentioned earlier, though I don’t think this is nearly enough to cancel out the huge negative of advancing capabilities so much. Thankfully, it seems like the reason such a loss function doesn’t exist yet isn’t from lack of trying, it’s just really hard to make work.

To see just how much a different loss function could help, consider that cherrypicking from n samples is actually just a way of applying roughly bits of optimization pressure to the model, modulo some normalization factors (I also talk about this intuition in footnote 2 of Building AGI Using Language Models). Since even lightly cherrypicked samples of GPT-3 are a lot more coherent than raw samples, I expect that this means we're probably not too far from a huge boost in coherence if we figure out how to apply the right optimization. This of course only provides a lower bound for how much the negative log likelihood loss needs to improve to get the same level of coherence, because it doesn’t align with our notion of quality exactly, as previously mentioned—the model could happily improve log n bits by memorizing random stuff rather than improving coherence.

I also think it’s very unlikely that we will run into some totally insurmountable challenge with transformers or SGD that doesn't get patched over within a short period of time. This is mostly because historically, betting that there's something NNs fundamentally can't do is usually not a good bet, and every few years (or, increasingly, months) someone comes up with a way to surmount the previous barrier for NNs.

As one final piece of evidence, the scaling law and GPT-3 papers show that as your model gets bigger, it actually gets more sample efficient. To me this is a priori very counterintuitive given the curse of dimensionality. This seems to imply that bigger models are favored by the inductive biases of SGD, and to me suggests that bigger is the right direction to go.

If this scenario happens, we’re probably screwed if we don’t plan for it, and even if we do plan we might still be screwed.

What should we do?

There are a number of possible alignment strategies depending on how LMs scale and what the properties of bigger LMs are like. This list is neither exhaustive nor objective, these are just a few possible outcomes that take up a good chunk of my probability mass at this moment in time and my opinions at this time.

Reward model extraction

Possibly the most interesting (and risky) direction in my opinion, which I mentioned briefly earlier, is trying to extract some reward model for human values or even something like CEV out of a big LM that has already learned to model the world. This idea comes from Alignment by Default (henceforth AbD for short), where we build some model and train it in a way such that it builds up an internal world model of what the AbD post calls “natural abstractions” (in the language of this post, those would be human-level abstractions that happen to be described in high resolution in natural language and are therefore more easily learned), and then we find some way to point to or extract the reward model we want inside that LM’s world model. One way this could be accomplished, as outlined in the AbD post, is to find some proxy for the reward signal we want and fine tune the LM on that proxy and hope that the model decides to point to an internal reward model. It might also turn out that there are other ways to do this, maybe by using interpretability tools to identify the subnetworks responsible for the LM’s understanding of CEV, or perhaps involve optimizing a continuous prompt to “select” the right part of the LM (The Power of Scale for Parameter-Efficient Prompt Tuning, Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm).

I expect making a LM that forms a usable internal model of some useful reward signal inside itself to be very difficult in practice and require a lot of trial and error, possibly requiring custom training data or methods to make the resolution for that reward signal much higher. As an aside, I think this is one major advantage of thinking in terms of resolution: since resolution is relative to the data, we don’t have to just hope that any given abstraction is natural; we can actually make certain abstractions “more natural” for AbD if we want, just by changing the data! The other major issue with this approach is that there’s no guarantee that a model extracted from a LM will be at all robust to being subject to optimization pressures, and certainly I wouldn’t expect this to work as a strong alignment solution, but rather just as a way to bootstrap strong alignment, since the goal of any LM based system, in my opinion, would be to build a weakly-aligned system with the main purpose of solving strong alignment (i.e I think Bootstrapped Alignment is a good idea). That being said, I still think it would be significantly more robust than anything handcrafted because human values are very complex and multifaceted (Thou Art Godshatter), and learned systems generally do better at understanding complex systems than handcrafted systems (The Bitter Lesson). Also, it’s likely that in practice this extracted reward model wouldn’t be optimized against directly, but rather used as a component of a larger reward system.

This may sound similar to regular value learning, but I would argue this has the potential to be significantly more powerful because we aren't just confined to the revealed preferences of humans but theoretically any consistent abstract target that can be embedded in natural language, even if we can't formalize it, and the resulting model would hopefully be goodhart-robust enough for us to use it to solve strong alignment. Of course, this sounds extremely lofty and you should be skeptical; I personally think this has a low chance (10%ish) of actually working out, since a lot of things need to go right, but I still think more work in this direction would be a good idea.

Some very preliminary work in this direction can be seen in GPT-3 on Coherent Extrapolated Volition, where it is shown that GPT-3 has at least some rudimentary understanding of what CEV is, though not nearly enough yet. The ETHICS dataset also seems like a good proxy for use in some kind of fine-tuning based AbD scheme.

Some concrete research directions for this approach: interpretability work to try and identify useful models inside current LMs like GPT-2/3; implementing some rudimentary form of AbD to try and extract a model of something really simple; messing around with dataset compositions/augmentations to improve performance on world-modelling tasks (i.e PiQA, Winograd schemas, etc), which is hopefully a good proxy for internal model quality.

Human emulation

We could also just have the LM emulate Einstein, except with the combined knowledge of humanity, and at 1000x speed, or emulate a bunch of humans working together, or something similar. How aligned this option is depends a lot on how aligned you think humans are, and on how high fidelity the simulation is. Humans are not very strongly aligned (i.e you can’t apply a lot of optimization pressure to the human reward function and get anything remotely resembling CEV): humans are susceptible to wireheading and cognitive biases, have their own inner alignment problems (Inner alignment in the brain), and so on (also related: The Fusion Power Generator Scenario). Still, emulated humans are still more aligned than a lot of other possibilities, and it’s possible that this is the best option we have if a lot of the other options don’t pan out. Another limitation of human emulation is there’s an inherent ceiling to how capable the system can get, but I wouldn’t worry too much about it because it would probably be enough to bootstrap strong alignment. Related posts in this direction: Solving the whole AGI control problem, version 0.0001 - Imitation

A minor modification to this would be to emulate other nonhuman agents, whether through prompting, RL finetuning, or whatever the best way to do so is. The major problems here are that depending on the details, nonhuman agents might be harder to reason about and harder to get the LM to emulate since they’d be off distribution, but this might end up as a useful tool as part of some larger system.

Some concrete research directions for this approach: exploring prompt engineering or training data engineering to see how we can reliably get GPT-3 and similar LMs to act human-like; developing better metrics that capture human emulation quality better.

Human amplification

This option encompasses things like IDA and (to a lesser extent) Debate where we rely on human judgement as a component of the AI system, and use the LMs as the magic black box that imitates humans in IDA or the debaters in Debate. How aligned this option is depends a lot on how aligned IDA and Debate are (and how aligned people are). Also, depending on how true the factored cognition hypothesis is, there may be serious limits to how capable these systems can get, though conditioning on IDA working at all, I don’t think it’s likely that this will be a bottleneck for the same reasons as in the previous sections. Some kind of human amplification strategy does feel like the most established and “down to earth” of the proposals despite all this, however. Overall, I’m cautiously optimistic about IDA/Debate-like systems using LMs.

The feasibility of this option is also correlated with the last option (human emulation) because this is essentially putting (partial) human emulation in a loop.

Some concrete research directions for this approach: implementing IDA for some kind of toy task using GPT-3 or similar LMs, possibly doing some kind of augmentation and/or retrieval system to squeeze the most out of the human data from the amplification step.

Oracle / STEM AI

Another option that has been proposed for making safe AI systems is to make an "oracle" only able to answer questions and with only the goal of answering those questions as accurately as possible. A specific variant of this idea is the STEM AI proposal (name taken from An overview of 11 proposals for building safe advanced AI, though this idea is nothing new), which is essentially an oracle with a domain limited to only scientific questions, with the hope that it never gets good enough at modelling humans to deceive us.

A bunch of people have argued for various reasons why oracle/tool AI isn’t necessarily safe or economically competitive (for a small sampling: The Parable of Predict-O-Matic, Analysing: Dangerous messages from future UFAI via Oracles, Reply to Holden on 'Tool AI', Why Tool AIs Want to Be Agent AIs). I would argue that most of the safety concerns only apply to extremely strong alignment, and that oracles are probably safe enough for weak alignment, though I’m not very confident in this and I’m open to being proven wrong. This is the option that is most likely to happen by default, though, since even GPT-3 exhibits oracle-ish behavior with minimal prompting, whereas the other options I discussed are all various stages of theoretical. As such it makes sense to plan for what to do just in case this is the only option left.

Some concrete research directions for this approach: exploring prompt engineering to see how we can reliably get GPT-3 and similar LMs to give its best-guess rather than what it thinks the median internet user would tend to say; exploring ways to filter training data for STEM-only data and looking at whether this actually pushes the model’s understandings of various topics in the direction we expect (Measuring Massive Multitask Language Understanding introduces a very fine grained evaluation which may be useful for something like this); interpretability work to see how the model represents knowledge and reasoning, so that we could possibly extract novel knowledge out of the model without even needing to query it (something like Knowledge Neurons in Pretrained Transformers seems like a good place to start).

Conclusion

Natural language is an extremely versatile medium for representing abstract concepts and therefore given sufficiently high resolution data a language model will necessarily have to learn to represent and manipulate abstract concepts to improve loss beyond a certain point. Additionally, from the evidence we have from scaling laws, there is a fairly good chance that we will find ourselves in the strong capabilities scenario where this point is crossed only a few orders of magnitude in size from now, resulting in large language models quickly gaining capabilities, and making the first transformational AI a large language model. Finally, there are a number of existing prosaic alignment directions that are amenable to being applied to language models, which I hope will be explored further in the future.

[-]Charlie Steiner5y40

Great post! I very much hope we can do some clever things with value learning that let us get around needing AbD to do the things that currently seem to need it.

The fundamental example of this is probably optimizability - is your language model so safe that you can query it as part of an optimization process (e.g. making decisions about what actions are good), without just ending up in the equivalent of deepDream's pictures of Maximum Dog.

29