Training goals for large language models

[-]Adam Jermyn3y20

The section on fixed points was interesting! I wonder if there's a way to avoid the recursion altogether though? Specifically, is there a way to condition the model such that the world it simulates doesn't contain humans who use the model (or one very like it)? I'm not sure, and would be interested in your thoughts on this.

[-]Johannes Treutlein3y20

Thank you!

It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research.

In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models' outputs before reasoning about superrationality, so it would turn things into a version of Newcomb's problem with transparent boxes. This might make coordination between the models less likely? Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.

The other possibility would be to not rely on IDA at all, instead just training a superhuman model and using it directly. Maybe one could extract superhuman knowledge from them safely via some version of microscope AI? Of course, in this case, the model might still reason about humans using similar models, based on its generalization ability alone. Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?

[-]Adam Jermyn3y21

Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.

Oh interesting. I think this still runs into the issue that you'll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue).

Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?

I was imagining that we train the model to predict e.g. tomorrow's newspaper given today's. The fact that it's not just a stream of text but comes with time-stamps (e.g. this was written X hours later) feels important for making it simulate actual histories.

[-]Charlie Steiner3y24

I was confident that on this very site there would be an example of someone writing an essay with the framing device that it was a blog post from 5 years in the future. Sadly, I only had enough attention span to google "site:lesswrong.com from the future" and click the first link. It was a writing game called Wikipedia Articles from the Future.

My point with this is I'm real pessimistic about generating the AI alignment textbook from 100 years in the future with prompt engineering. Why expect that you're going to get something far outside the training distribution, rather than the most likely continuation that could have come from the training distribution, which already contains people pretending to be from the future?

I would have been even more pessimistic before Minerva, but even so, we don't have a couple billion tokens of training data of people completely solving close relatives of the alignment problem to fine-tune on. Minerva is still shocking to me, but it's clear that an active ingredient in it is having a training distribution that demonstrates many copies of the reasoning you want the AI to do, and few copies of bad reasoning. And if you say the AF is such a dataset I am going to laaaugh.

[-]Johannes Treutlein3y10

Thanks for your comment! I agree that we probably won't be able to get a textbook from the future just by prompting a language model trained on human-generated texts.

As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice.

Overall I'm more optimistic about using the model in an IDA-like scheme. One way this might fail on capability grounds is if solving alignment is blocked by a lack of genius-level insights, and if it is hard to get a model to come up with/speed up such insights (e.g. due to a lack of training data containing such insights).

^{^}

There are of course also other possible well-specified prediction targets for oracles.

^{^}

An AI may still have to model human intent implicitly insofar as that is important for generating text.

^{^}

The fact that we have access to distributions, instead of, e.g. maximum likelihood estimates, is important for several reasons: first, maximum likelihood estimates can be very untypical. For instance, when throwing a pair of dice repeatedly, the maximum likelihood estimate for each sum of eyes is 7. However, in most worlds, the sum won’t be 7 every single time. Second, we want to be able to incentivize the model to be uncertain in a calibrated way; otherwise, the model might choose to focus on some versions of an output it knows well how to produce, even if some harder to model version would be equally likely given the prompt. For instance, a model may be uncertain whether it is supposed to write an honest news article or a fictional story. If both are equally likely, and there is only one plausible fictional story, but many different possible news articles, then a model outputting a maximum likelihood estimate might consistently produce the fictional story. A model sampling from a distribution incentivized by a proper scoring rule would output news articles and fictional stories with equal probability. Third, some proposals might depend on getting multiple samples. E.g., one may be able to implement a version of the consensus algorithm using samples from a large language model.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

Training goals for large language models

19

Introduction

Using large language models for alignment

Behavioral objective

Self-fulfilling prophecies and counterfactual oracles

Logical dependences

Generalization and malign induction

Model cognition

Using the model

Unhelpful predictions

Dangerous agents

Conclusion