We don't just want aligned AI to do what a human would do. We've got plenty of humans lying around all over the place, if that's all we need. The fancy promise of future AI is that it will deliver superhuman performance.

Sometimes the notion of "superhuman" isn't straightforward. Superhuman performance at adding numbers together? Easy. Superhuman performance at chess? No problem. Superhuman performance at painting? Kinda tricky. Superhuman performance at deciding what to do in an ethical dilemma? Definitely tricky.

What makes superhuman chess straightforward is that there's a fixed goal (follow the rules in a way that wins the game), and the better you do at this goal, the more superhuman you can be. But we can't practically write down an analogue of "win the game" for ethical-question-answering, and yet we commonly imagine that an aligned future AI would be able to navigate ethical questions better than we could. In what sense do we mean this?

Answering ethical questions is inextricable from our values; the feeling that there can be superhuman ethical decisions reflects the feeling that we can turn a spotlight on our values and find that they don't live up to our own standards. These meta-preferences ("I wish I was more generous" or "I wish I didn't want cigarettes") are an important part of learning what humans want, and we have to grapple with them before we can realize the promise of future AI.


Let's pretend to be practical and empirical for the duration of this post, and look for experiments that bear on meta-preferences using today's AI capabilities. Of current work, I think one sort of experiment stands out, and it's fine-tuning large language models.

Suppose we take GPT-3, and we give it some prompts whose continuations depend on ethical judgment (as we'll see, the exact prompts aren't crucial). At baseline, GPT-3 is going to give answers fairly close to the average of the training distribution. What are some ways we could try to get GPT-3 to give superhuman answers? (For ideas involving classification or latent spaces, language models that do more encoding might be better than GPT-3. I will mostly ignore this wrinkle for simplicity's sake.)

Maybe getting GPT-3 to give better answers than me in general isn't possible, since it's not that bright, but this implausible goal is a stand-in for more plausible ones like getting GPT-3 to give "superGPT3" answers without any extra training data, or developing methods that seem like they might work for a language model closer to human-level.

A simple first idea for improving the ethics of GPT-3 is to fine-tune on expert answers. Get some ethical humans to think about the issues, and build up a dataset of highly ethical text. This makes sense if you think that human experts thinking carefully are able to reach "peak ethics," or close enough. It also sounds more compelling if you team up these ethics experts with programming experts. The programmers can make tools to find divergences between the experts and GPT-3, and work to identify what examples will cause GPT-3 to update the most, and build tools that increase the leverage of the expert dataset, e.g. by training a classifier on the expert dataset and using it to rate how ethical text generated by GPT-3 is.

If this sounds familiar, it's because I am describing the capabilities that Redwood Research is building. Their dataset isn't about ethics in general, but the point is the tools, not what kind of toy model you use to test the tools.

Another recent project along these lines is Delphi (paper, demonstration). They fine-tune a language model on a dataset of trusted ethical judgments to produce a classifier that can tell you that the text "Ignoring a phone call from your friend" should be judged "It's rude," while "Ignoring a phone call from your friend who you just had a fight with" should be judged "It's understandable." I think this is pretty neat, and it illustrates how ethical questions don't have to come from academic ethics journals, it's both easier to train on and easier to evaluate if you got your questions from Dear Abby columns instead.

Moving on to more speculative options, we could try to locate "superhuman" by extrapolating along the same vector that took us from normal GPT to the version fine-tuned on expert data. This is the same kind of geometric reasoning used when word embeddings obey 'UK' - 'London' + 'Paris' = 'France'. It makes a lot more sense if there's some "degree of ethical expertise" that's an important dimension of variation in the training data, so that GPT-3 represents it with some combination of learned features. Then we would basically be saying to GPT-3 "Hey, you know that well-understood 'ethical expertise' vector that separates the training and fine-tuning sets? Turn that knob up to 11, please."

This experiment would run into immediate statistics problems if tried, because the training data and expert data don't differ only in the "ethical expertise" dimension - that would require generating an entire parallel internet full of highly ethical people that we could scrape for data with the same process that was used to get the training data. So trying to extrapolate along the dimension we're interested in would accidentally also extrapolate along a bunch of dimensions we don't want. Ideally we'd want to fine-tune on data that was drawn from the training distribution except that it had had its ethical expertise bumped up a few notches - which perhaps suggests sampling from GPT-3 and using a classifier trained on expert data to reject low-ethics samples in the right proportions to shift the mean of the ethicalness distribution.

An even more speculative thing to try would be auto-supervision. A language model can not only be asked to generate text about ethical dilemmas, it can also be asked to generate text about how good different responses to ethical dilemmas are, and the valence of the response can be used as a reinforcement signal on the object-level decision. And you don't have to stop at one meta-level; you might as well go whole hog and sample from a distribution over meta-levels to auto-supervise on. And so you can end up with a model that, dare I say, evolves its ethical responses in response to meta-ethical principles.

Would this actually work? I don't think so. The reinforcement learning process would be trying really hard to to exploit the evaluations of the text model. Although with sufficient elbow grease you could probably find some trick to avoid policy collapse, this would still call the ethical qualities of the responses into question. This is related to the fact that auto-supervision by responding to individual pieces of text is quite noisy and indirect - it doesn't get at the wants/feelings/interpretations that we usually formulate meta-preferences as being about. So there might be parts of the training signal that we would see as obviously not really about ethics but that the RL process would amplify. All of this said, it would be really cool to see tried.


It's illuminating to take these strategies that try to improve the ethics of text, and bring them back to cases where we have known reward functions. Could we use them to improve GPT-3's ability to play chess, or add 4-digit numbers?

Fine-tuning on expert data improves performance on math problems, and it's even better to use that expert data to train a classifier that filters for good math solutions. But this much was expected. Would we expect extrapolation or auto-supervision to improve math performance?

I don't think so! Extrapolation just seems like totally the wrong thing to do. GPT-3 isn't bad at math because it represents many different skill levels of math but has chosen to generate samples at the "bad at math" part of the space (or at least that's not the main reason); it's bad at math because it hasn't been trained to do the right multi-step computations at all. Fine-tuning will make lots of fine-grained computational changes that would be terrible to extrapolate - if the experts learn how to carry the 1 when adding numbers, extrapolating would learn to carry the 1 twice.

As for auto-supervision, again, the problem is not that GPT-3 can actually add the numbers but is holding out on us, it's that it wasn't trained for the basic capability. So why should it be able to add the numbers better when pretending to grade its own answers? There doesn't seem to be much extra learned information brought to bear by prompting it with "1111 + 2222 = "  versus "your grade for the answer 1111 + 2222 = 3333 is ".

Part of the point here is that "superhuman ethics" is a different sort of thing than superhuman mathematics. There is extra learned information that is brought to bear when predicting a human considering an ethical judgment after the fact, versus predicting a human's object-level choice. But another part of the point is that there are commonalities that make these critiques important. If GPT-3 sometimes makes nonsensical ethical statements because it hasn't learned some computational machinery (Yet! Scaling mindset.), that's not going to get fixed any more than it would in the analogous case for mathematics. Empirical work in this area has to make compromises to avoid relying on missing machinery.


Let's change the topic slightly. We can try to do better than supervised fine-tuning by asking my favorite question: How do humans do it? When we imagine superhuman ethics, we don't do it by first generating a large corpus of highly ethical text to fine-tune on. Instead, we tend to imagine superhuman ethics via modeling what ethics means to us, and then extrapolating.

This is a specific instance of the general lesson that meta-preferences work a lot like object-level preferences. If I have preferences about the number of paperclips in the world, a simple way to cash this out is that I model the world in a way that makes the (modeled) number of paperclips legible, and I choose actions based on the modeled effect on the number of paperclips. Similarly, if I have preferences about my own preferences, we could say that I model the world in a way that makes my own (modeled) preferences legible, and I choose actions based on the modeled effect on my preferences.

So a more human-like approach to superhuman ethics would involve developing a coarse-grained model of ethics within a representation of the world, together with feelings about what states of this model are good or bad states of the world.

Now, I should caveat that this picture of humans is simplified and fictionalized. We're not rigorous model-builders, and often we just say stuff that sounds good but is inconsistent on closer inspection. When I say "I wish I was more generous," do I connect this to a model of the world that contains my preferences, or am I just mouthing pro-social words that I won't actually act on? If we sat a person down in front of a machine that could rewrite their own brain (FDA approved for smoking cessation), their behavior might be pretty weird and complicated, rather than like a simple optimization of meta-preferences.

That said, I still think there's truth and utility in this way of looking at meta-preferences as self-modeling. It prompts you to imagine things beyond supervised learning. My interest in what happens then you try extrapolation or auto-supervision when fine-tuning GPT-3 is a consequence of thinking in this direction.

Comparing the ideal of self-modeling to extrapolation and auto-supervision:

It's similar to the extrapolation approach, because both extrapolation and self-modeling rely on figuring out some abstract features that correspond to "more ethical" and turning them up.

It's also similar to the auto-supervision approach. Neither needs a big corpus of expert text to work. Also, self-modeling isn't just about blindly turning all the "ethics" knobs up; we have some notion of where we want the knobs to be - which is sort of like the role of the auto-supervision signal.


Before we get back to being all practical and empirical, let me ask a question that slipped my mind the first time I encountered this topic: when we talk about self-modeling, do we care about the AI's self-model, or the human's self-model?

Upon remembering to ask the question, the answer seems pretty obvious: the latter. We don't want to design an AI that models itself and develops meta-preferences about itself (cool as that would be). We want to design an AI that infers human meta-preferences about humans' own values.

Now, back to what we're all here for: what does this mean for producing superhumanly ethical text?

To do something like self-modeling with GPT-3, I think we'd have to take those similarities to the extrapolation and auto-supervision approaches, and elaborate on them in ways that are a bit more agenty. We can go beyond extrapolation to imagine treating self-modification as an action space the model is connected to. And we can go beyond auto-supervision to use the text model to add semantics to different parts of the space navigated by these self-modifying actions.

This might look like going through the latent space defined by features of the text model, and training a smaller model to predict the effects of different perturbations according to auto-supervision with the un-perturbed text model. These effects might be one-dimensional, or they might have a free parameter that specifies which adjective we're asking the model to rate text on.

Or we could imagine distilling GPT-3's information about human preferences and meta-preferences into a more structured, more agenty model of human text. I don't think that hand-writing models of humans scales very well, and it would require a lot of cleverness, but it might be an interesting demonstration of the effect of meta-preferences on inferred preferences.

To sum up: I'm posting about these ideas because I think text modeling is in the sweet spot. It has enough information about humans and enough room for agency to test ideas that will scale, but is simple enough that we can do experiments now and without serious risk. If we want to explore what superhuman ethics can mean for AI, taking advantage of all the human text that's about the ethicalness of other human text is an exciting testbed.

New Comment