Lukas Finnveden


Imitative Generalisation (AKA 'Learning the Prior')
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the future

I don't think this is right. I've put my proposed modifications in cursive:

We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don't have ground-truth for the future, so we can't test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future.

(It might be a good idea to share some parameters between the second and first network.)

Prediction can be Outer Aligned at Optimum

Oops, I actually wasn't trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.

Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:

1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I'm fine with "optimal performance" either requiring that GPT-3 magically returns these correct answers; or requiring that it returns some distribution along the lines that I defined in my post.)

2. The researchers will read what GPT-3 outputs, and there exists some string that causes them to go mad and give control over the world to GPT-3.

In this case, if we define optimal performance as "correctly predicting as many words as possible" or "achieve minimum total loss over the entire history of the world", I agree that optimal performance would plausibly involve taking over the world to feed itself the right questions. However, I think a better definition is "Correctly predicts every word it's asked to predict", because I think this tracks what (un)supervised learning actually selects for (absent inner alignment concerns).

To make that last point more clear; I'm claiming that in a supervised learning scenario where predicting label X always gives the neural network a small punishment, but then causes the training procedure to give the model 10 easy questions, the model will learn to not choose label X. (At least in the long run. In the short run, I could see the model being a little bit more likely to predict label X than expected, because easy questions could give the model a less steep gradient, so a neural network that predicts X would stick around a bit longer than it would have if X hadn't yielded easy questions. However, every time X is chosen, the nn is directionally pushed away from choosing it again; so in the infinite data limit, I think it would learn to not do that.) Conversely, if predicting label Y is correct on some inputs, but predicting Y immeidately lead to 10 hard questions, I don't think the model would learn to avoid label Y (though I'm unsure if the learning process would converge to choosing Y or just be unstable and never converge). I'm actually very curious if you agree with this; it seems like an important question.

(Conversely, I think that a robot RL-agent would be incentivised to control its future challenges, because of how RL loss functions are shaped. A definition of RL optimal behavior should naturally take this into account.)

Now, by defining optimal behavior as "Correctly predicts every word it's asked to predict", I'm saying that the model is outer aligned at optimum iff the model cannot simultaneously answer every question optimally and adversarially optimise against humans. And if we only consider the normal output channel, I'm saying it couldn't, because it would have to make incorrect predictions in order to broadcast its message; in which case it wouldn't be optimal.

If we also consider side-channels, this gets messier, because my chosen definition doesn't imply any constraints on what the AI could do with side channels. Thus, if it could use side channels to take over the world in any way, my definition would say it wasn't outer aligned. My preferred solution to this is just to intentionally ignore side channels when defining outer alignment at optimum (as you correctly inferred). Three reasons for this:

  • As mentioned above, taking side channels into account would mean that any model with powerful side channels is classified as outer misaligned, even if there's no incentive to use these side channels in any particular way.
  • Separately, I suspect that supervised learning normally doesn't incentivise neural networks to use side channels in any particular way (absend inner alignment concerns).
  • Finally, It just seems kind of useful to talk about the outer alignment properties of abstract agent-models, since not all abstract agent-models are outer aligned. Side-constraints can be handled separately.

(Btw I'd say the bad image classifier is aligned but less performance-competitive than a good image classifier, though I'm sympathetic to the view that it doesn't make sense to talk about its alignment properties at all.)

Prediction can be Outer Aligned at Optimum
That is, if you write down a loss function like "do the best possible science", then the literal optimal AI would take over the world and get a lot of compute and robots and experimental labs to do the best science it can do.

I think this would be true for some way to train a STEM AI with some loss functions (especially if it's RL-like, can interact with the real world, etc) but I think that there are some setups where this isn't the case (e.g. things that look more like alphafold). Specifically, I think there exists some setups and some parsimonious definition of "optimal performance" such that optimal performance is aligned: and I claim that's the more useful definition.

To be more concrete, do you think that an image classifier (trained with supervised learning) would have convergent instrumental goals that goes against human interests? For image classifiers, I think there's a natural definition of "optimal performance" that corresponds to always predicting the true label via the normal output channel; and absent inner alignment concerns, I don't think a neural network trained on infinite data with SGD would ever learn anything less aligned than that. If so, it seems like best definition of "at optimum" is the definition that says that the classifier is outer aligned at optimum.

2020 AI Alignment Literature Review and Charity Comparison

He's definitely given some money, and I don't think the 990 absence means much. From here:

in 2016, the IRS was still processing OpenAI’s non-profit status, making it impossible for the organization to receive charitable donations. Instead, the Musk Foundation gave $10m to another young charity, [...] The Musk Foundation’s grant accounted for the majority of’s revenue, and almost all of its own funding, when it passed along $10m to OpenAI later that year.

Also, when he quit in 2018, OpenAI wrote "Elon Musk will depart the OpenAI Board but will continue to donate and advise the organization". The same blog post lists multiple other donors than Sam Altman, so donating to OpenAI without showing up on the 990s must be the default, for some reason.

Extrapolating GPT-N performance

This has definitely been productive for me. I've gained useful information, I see some things more clearly, and I've noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!

Extrapolating GPT-N performance
I'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable.

"actually it's quite gameable" = "actually it's quite easy" ;)

More seriously, I agree that a full blown turing test is hard, but this is because the interrogator can choose whatever question is most difficult for a machine to answer. My statement about "ordinary conversation" was vague, but I was imagining something like sampling sentences from conversations between humans, and then asking questions about them, e.g. "What does this pronoun refer to?", "Does this entail or contradict this other hypothesis?", "What will they say next?", "Are they happy or sad?", "Are they asking for a cheeseburger?".

For some of these questions, my original claim follows trivially. "What does this pronoun refer to?" is clearly easier for randomly chosen sentences than for winograd sentences, because the latter have been selected for ambiguity.

And then I'm making the stronger claim that a lot of tasks (e.g. many personal assistant tasks, or natural language interfaces to decent APIs) can be automated via questions that are similarly hard as the benchmark questions; ie., that you don't need more than the level of understanding signalled by beating a benchmark suite (as long as the model hasn't been optimised for that benchmark suite).

Extrapolating GPT-N performance

Cool, thanks. I agree that specifying the problem won't get solved by itself. In particular, I don't think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven't been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:

    • Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more narrow than the things we want to automate, and it's easier to game more narrow benchmarks, I don't trust trends based on narrow, fine-tuned benchmarks that much.
    • However, in a few-shot setting, there's not enough data to game the benchmarks in an overly narrow way. Instead, they can be fairly treated as a sample from all possible questions you could ask the model. If the model can answer some superglue questions that seem reasonably difficult, then my default assumption is that it could also answer other natural language questions that seem similarly difficult.
      • This isn't always an accurate way of predicting performance, because of our poor abilities to understand what questions are easy or hard for language models.
      • However, it seems like should at least be an unbiased prediction; I'm as likely to think that benchmark question A is harder than non-benchmark question B as I am to think that B is harder than A (for A, B that are in fact similarly hard for a language model).
    • However, when automating stuff in practice, there are two important problems that speak against using few-shot prompting:
      • As previously mentioned, tasks-to-be-automated are less narrow than the benchmarks. Prompting with examples seems less useful for less narrow situations, because each example may be much longer and/or you may need more prompts to cover the variation of situations.
      • Finetuning is in fact really powerful. You can probably automate stuff with finetuning long before you can automate it with few-shot prompting, and there's no good reason to wait for models that can do the latter.
    • Thus, I expect that in practice, telling the model what to do will happen via finetuning (perhaps even in an RL-fashion directly from human feedback), and the purpose of the benchmarks is just to provide information about how capable the model is.
    • I realise this last step is very fuzzy, so to spell out a procedure somewhat more explicitly: When asking whether a task can be automated, I think you can ask something like "For each subtask, does it seem easier or harder than the ~solved benchmark tasks?" (optionally including knowledge about the precise nature of the benchmarks, e.g. that the model can generally figure out what an ambiguous pronoun refers to, or figure out if a stated hypothesis is entailed by a statement). Of course, a number of problem makes this pretty difficult:
      • It assumes some way of dividing tasks into a number of sub-tasks (including the subtask of figuring out what subtask the model should currently be trying to answer).
      • Insofar as that which we're trying to automate is "farther away" from the task of predicting internet corpora, we should adjust for how much finetuning we'll need to make up for that.
      • We'll need some sense of how 50 in-prompt-examples showing the exact question-response format compares to 5000 (or more; or less) finetuning samples showing what to do in similar-but-not-exactly-the-same-situation.

Nevertheless, I have a pretty clear sense that if someone told me "We'll reach near-optimal performance on benchmark X with <100 examples in 2022" I would update differently on ML progress than if they told me the same thing would happen in 2032; and if I learned this about dozens of benchmarks, the update would be non-trivial. This isn't about "benchmarks" in particular, either. The completion of any task gives some evidence about the probability that a model can complete another task. Benchmarks are just the things that people spend their time recording progress on, so it's a convenient list of tasks to look at.

for us to know the exact thing we want and precisely characterize it is basically the condition for something being subject to automation by traditional software. ML can come into play where the results don't really matter that much, with things like search/retrieval, ranking problems,

I'm not sure what you're trying to say here? My naive interpretation is that we only use ML when we can't be bothered to write a traditional solution, but I don't think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)

My take is that for us to know the exact thing we want and precisely characterize it is indeed the condition for writing traditional software; but for ML, it's sufficient that we can recognise the exact thing that we want. There are many problems where we recognise success without having any idea about the actual steps needed to perform the task. Of course, we also need a model with sufficient capacity, and a dataset with sufficiently many examples of this task (or an environment where such a dataset can be produced on the fly, RL-style).

Extrapolating GPT-N performance

Re 3: Yup, this seems like a plausibly important training improvement. FWIW, when training GPT-3, they did filter the common crawl using a classifier that was trained to recognise high-quality data (with wikipedia, webtext, and some books as positive examples) but unfortunately they don't say how big of a difference it made.

I've been assuming (without much thoughts) that doing this better could make training up to ~10x cheaper, but probably not a lot more than that. I'd be curious if this sounds right to you, or if you think it could make a substantially bigger difference.

Extrapolating GPT-N performance
Benchmarks are filtered for being easy to use, and useful for measuring progress. (...) So they should be difficult, but not too difficult. (...) Only very recently has this started to change with adversarial filtering and evaluation, and the tasks have gotten much more ambitious, because of advances in ML.

That makes sense. I'm not saying that all benchmarks are necessarily hard, I'm saying that these ones look pretty hard to me (compared with ~ordinary conversation).

many of these ambitious datasets turn out ultimately to be gameable

My intuition is that this is far less concerning for GPT-3 than for other models, since it gets so few examples for each benchmark. You seem to disagree, but I'm not sure why. In your top-level comment, you write:

While it seems to be an indicator of generality, in the particular case of GPT-3's few-shot learning setting, the output is controlled by the language modeling objective. This means that even though the model may not catch on to the same statistical regularities as task-specific trained models do from their datasets, it essentially must rely on statistical regularities that are in common between the language modeling supervision and the downstream task.

If for every benchmark, there were enough statistical regularities in common between language modeling supervision and the benchmark to do really well on them all, I would expect that there would also be enough statistical regularities in common between language modeling supervision and whatever other comparably difficult natural-language task we wanted to throw at it. In other words, I feel more happy about navigating with my personal sense of "How hard is this language task?" when we're talking about few-shot learning than when we're talking about finetuned models, becase finetuned models can entirely get by with heuristics that only work on a single benchmark, while few-shot learners use sets of heuristics that cover all tasks they're exposed to. The latter seem far more likely to generalise to new tasks of similar difficulty (no matter if they do it via reasoning or via statistics).

You also write "It stands to reason that this may impose a lower ceiling on model performance than human performance, or that in the task-specific supervised case." I don't think this is right. In the limit of training on humans using language, we would have a perfect model of the average human in the training set, which would surely be able to achieve human performance on all tasks (though it wouldn't do much better). So the only questions are:

  • How fast will more parameters + more data increase performance on the language modeling task? (Including: Will performance asymptote before we've reached the irreducible entropy?)
  • As the performance on language modeling increases, in what order will the model master what tasks?

There are certainly some tasks were the parameter+data requirements are far beyond our resources; but I don't see any fundamental obstacle to reaching human performance.

I think this is related to your distinction between a "general-purpose few-shot learner" a "general-purpose language model", which I don't quite understand. I agree that GPT-3 won't achieve bayes-optimality, so in that sense it's limited in its few shot learning abilities; but it seems like it should be able to reach human-level performance through pure human-imitation in the limit of excelling on the training task.

Extrapolating GPT-N performance
Take for example writing news / journalistic articles. [...] I think similar concerns apply to management, accounting, auditing, engineering, programming, social services, education, etc. And I can imagine many ways in which ML can serve as a productivity booster in these fields but concerns like the ones I highlighted for journalism make it harder for me to see how AI of the sort that can sweep ML benchmarks can play a singular role in automation, without being deployed along a slate of other advances.

Completely agree that high benchmark performance (and in particular, GPT-3 + 6 orders of magnitude) is insufficient for automating these jobs.

(To be clear, I believe this independent about concerns of accountability. I think GPT-3 + 6 OOM just wouldn't be able to perform these jobs as competently as a human.)

On 1b and economically useful tasks: you mention customer service, personal assistant, and research assistant work. [...] But beyond the restaurant setting, retail ordering, logistics, and delivery seems already pretty heavily automated by, e.g., the likes of Amazon. So it's hard for me to see what exactly could be "transformative" here.
For personal assistant and research assistant work, it also seems to me that an incredible amount of this is already automated. [...] Again, here, I'm not sure exactly what "transformation" by powerful function approximation alone would look like.

I strongly agree with this. I think predictions of when we'll automate what low-level tasks is informative for general trends in automation, but I emphatically do not believe that automation of these tasks would constitute transformative AI. In particular, I'm honestly a bit surprised that the internet hasn't increased research productivity more, and I take it as pretty strong evidence that time-saving productivity improvements needs to be extremely good and general if they are to accelerate things to any substantial degree.

Load More