Suppose that we want to translate between English and an alien language (Klingon). We have plenty of Klingon text, and separately we have plenty of English text, but it’s not matched up and there are no bilingual speakers.

We train GPT on a mix of English and Klingon text and find that it becomes fluent in both. In some sense this model “knows” quite a lot about both Klingon and English, and so it should be able to read a sentence in one language, understand it, and then express the same idea in the other language. But it’s not clear how we could train a translation model.

Of course some concepts won’t have translations, and the model will often be uncertain about the translation of a term. But we can still ask for a model to explain the meaning of a Klingon expression as best as it can to an English-speaking user. For example, it could say “This is an idiomatic expression that’s often used to express great uncertainty” or “This is a small animal that is familiar to most Klingon speakers, I think it’s kind of like a frog but am not really sure” rather than translating a sentence directly.

How can we construct an objective that incentivizes the model to “try its best” at this translation task?

Translation-specific approaches

There are many published heuristics for unsupervised translation (e.g. Lample et al). I don’t think those techniques should completely satisfy us:

  • Existing methods can’t lead to a model that appropriately describes its uncertainty or talks the user through a hard-to-translate expression. (At least as far as I’m aware.)
  • We have no real reason to think existing methods fully utilize the model’s understanding, or to expect those methods to scale well. (In practice, I think they are impressive but still lag behind the quality of our models’ understanding.)
  • These heuristics are specific to translation, whereas we’d like to find general methods that can scale up to harder problems.

Existing alignment techniques

If we try to apply RL from human feedback to translation, we immediately run into a problem: how am I supposed to judge which of two English explanations of a Klingon sentence is better, given that I don’t know Klingon?

Debate doesn’t easily address this difficulty either — if one model claims that “qapla” means “great success” and the other claims it means “minor success,” I can’t easily decompose that disagreement into simpler sub-questions that debaters disagree about. Debaters could cite phrases in the database where “qapla” is used, but they’d need to average weak evidence over many phrases. Making things worse, to interpret each usage they’d need to agree about the meaning of the rest of the phrase — -which isn’t necessarily any simpler than the original disagreement about “qapla.” Even if this process was possible, it’s not at all clear that GPT would be able to do it — -being able to translate between Spanish and English doesn’t mean I have an encyclopedic knowledge of all the documents from which I built up my intuitive sense of a particular word’s meaning (which I’d need in order to win such a debate).

Right now I don’t think we have any scalable strategies to this kind of problem; I think it’s a core open question for alignment.

Unsupervised translation seems like a good problem to think about for alignment

I think the key feature of this situation is that our model has acquired a bunch of intuitions about the domain which are only justified empirically — the model “knows” about the meaning of phrases only insofar as it has a very complex hypothesis that was supported by the data.

This situation is going to become increasingly common as we train more powerful models, and will immediately be a real problem if we are applying human feedback to fine-tune GPT; while GPT is subhuman in many ways, it’s already acquired plenty of knowledge that any given human contractor would lack.

Most of GPT’s knowledge is something that came from some human, but ultimately we will be training models that generate new knowledge (e.g.by searching over plans in realistic environments, or by writing code on their own and learning about what works), and no human will have that knowledge. So we can’t hope to get around this problem by simply hiring more knowledgeable contractors.

This can leave us in a situation where it’s extremely difficult for humans to oversee AI decisions. If a model says “My intuition is that this business plan will make a lot of money” the user will need to decide whether or not to trust it. If they don’t, then they may find themselves at an increasing economic disadvantage. If they do, then they may have lost the ability to effectively oversee AI systems except by evaluating the consequences of their actions. That leads directly into the classical challenges of AI safety, namely that AI systems evaluated exclusively on the basis of measured outcomes have a tendency to push the world in undesirable directions (since we can’t measure what we care about) and to corrupt our measurements.

My vague hope

I’m hoping we can address this using the kind of approach discussed in learning the prior. That might look like:

  • In parallel with training GPT, train a helper model that explains the meaning of phrases (it can also provide other intuitions or background facts that are useful for predicting the next word).
  • As we train on Klingon text, we sample phrases and then ask a human “which word will come next?” The human uses the helper model to understand what is being discussed and make a prediction.
  • We optimize the helper model to make the human’s next-word predictions good (in parallel with generative pre-training).
  • Finally, a human uses the same helper model to evaluate a proposed Klingon → English translation, and we use this to train a translator by RL.

That short description sweeps a lot of complexity under the rug. Most importantly, the success of the scheme relies on the correctness of the prior over helper models (or else the helper could just be another copy of GPT-Klingon), and we don’t have a credible strategy for representing and manipulating our prior over complex programs.

Overall, I’d say that this is more at the level of “vague hope” rather than “concrete proposal.” I think it’s an open question whether anything in this space will work.

I think that this is the kind of problem which makes e.g. MIRI researchers justifiably skeptical that scalable ML alignment is possible at all, and it’s the main focus of my current conceptual work on AI alignment. I’m glad that this kind of theoretical crux also looks like it will soon be relevant to ML practice, since I think it will make it much easier to close the gap between people who work on ML and people who work on alignment.


“Unsupervised” translation as a safety problem was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 6:24 PM

Some tentative thoughts:

Re Debate:

Making things worse, to interpret each usage they’d need to agree about the meaning of the rest of the phrase — -which isn’t necessarily any simpler than the original disagreement about “qapla.” 

Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).

Also, one might try to use GPT to complete prompts such as:

The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means 

In both of these approaches we still need to deal with the potential problem of catastrophic inner alignment failures occurring before the point where we have sufficiently useful helper models. [EDIT: and in the Debate-based approach there's also an outer alignment problem: a player may try to manipulate the judge into choosing them as the winner.]

The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means 

If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.

Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).

This is basically what the helper model does, except:

  • For competitiveness you should learn and evaluate the dictionary at the same time you are training the model, running a debate experiment many times where debaters have to output a full dictionary would likely be prohibitively expensive.
  • Most knowledge about language isn't easily captured in a dictionary (for example, a human using a Spanish-English dictionary is a mediocre translator), so we'd prefer have a model that answers questions about meaning than have a model that outputs a static dictionary.
  • I don't know what standard you want to use for "helpful for understanding the passage" but I think "helps predict the next word correctly" is probably the best approach (since the goal is to be competitive and that's how GPT learned).

After making those changes we're back at the learning the prior proposal.

I think that proposal may work passably here because we can potentially get by with a really crude prior---basically we think "the helper should mostly just explain the meaning of terms" and then we don't need to be particularly opinionated about which meanings are more plausible. I agree that the discussion in the section "A vague hope" is a little bit too pessimistic for the given context of unaligned translation.

If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.

Agreed. Perhaps it's possible to iteratively train GPT models in an Amplification-like setup, where in each iteration we add to the English training corpus some newly possible translations; aiming to end up with something like an HCH translator. (We may not need to train a language model from scratch in each iteration; at the extreme, we just to do fine-tuning on the new translations.)

Planned summary for the Alignment Newsletter:

We have previously seen that a major challenge for alignment is that our models may learn <@inaccessible information@>(@Inaccessible information@) that we cannot extract from them, because we do not know how to provide a learning signal to train them to output such information. This post proposes unsupervised translation as a particular concrete problem to ground this out.

Suppose we have lots of English text, and lots of Klingon text, but no translations from English to Klingon (or vice versa), and no bilingual speakers. If we train GPT on the text, it will probably develop a good understanding of both English and Klingon, such that it “should” have the ability to translate between the two (at least approximately). How can we get it to actually (try to) do so? Existing methods (both in unsupervised translation and in AI alignment) do not seem to meet this bar.

One vague hope is that we could train a helper agent such that a human can perform next-word prediction on Klingon with the assistance of the helper agent, using a method like the one in Learning the prior (AN #109).

Maybe we can ask GPT to output English-Klingon dictionary?