“Unsupervised” translation as an (intent) alignment problem

Some tentative thoughts:

Re Debate:

Making things worse, to interpret each usage they’d need to agree about the meaning of the rest of the phrase — -which isn’t necessarily any simpler than the original disagreement about “qapla.”

Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).

Also, one might try to use GPT to complete prompts such as:

The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means

In both of these approaches we still need to deal with the potential problem of catastrophic inner alignment failures occurring before the point where we have sufficiently useful helper models. [EDIT: and in the Debate-based approach there's also an outer alignment problem: a player may try to manipulate the judge into choosing them as the winner.]

[-]paulfchristiano5y50

The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means

If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.

Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).

This is basically what the helper model does, except:

For competitiveness you should learn and evaluate the dictionary at the same time you are training the model, running a debate experiment many times where debaters have to output a full dictionary would likely be prohibitively expensive.
Most knowledge about language isn't easily captured in a dictionary (for example, a human using a Spanish-English dictionary is a mediocre translator), so we'd prefer have a model that answers questions about meaning than have a model that outputs a static dictionary.
I don't know what standard you want to use for "helpful for understanding the passage" but I think "helps predict the next word correctly" is probably the best approach (since the goal is to be competitive and that's how GPT learned).

After making those changes we're back at the learning the prior proposal.

I think that proposal may work passably here because we can potentially get by with a really crude prior---basically we think "the helper should mostly just explain the meaning of terms" and then we don't need to be particularly opinionated about which meanings are more plausible. I agree that the discussion in the section "A vague hope" is a little bit too pessimistic for the given context of unaligned translation.

[-]Ofer5y10

If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.

Agreed. Perhaps it's possible to iteratively train GPT models in an Amplification-like setup, where in each iteration we add to the English training corpus some newly possible translations; aiming to end up with something like an HCH translator. (We may not need to train a language model from scratch in each iteration; at the extreme, we just to do fine-tuning on the new translations.)

[-]Rohin Shah5y20

Planned summary for the Alignment Newsletter:

We have previously seen that a major challenge for alignment is that our models may learn <@inaccessible information@>(@Inaccessible information@) that we cannot extract from them, because we do not know how to provide a learning signal to train them to output such information. This post proposes unsupervised translation as a particular concrete problem to ground this out.
Suppose we have lots of English text, and lots of Klingon text, but no translations from English to Klingon (or vice versa), and no bilingual speakers. If we train GPT on the text, it will probably develop a good understanding of both English and Klingon, such that it “should” have the ability to translate between the two (at least approximately). How can we get it to actually (try to) do so? Existing methods (both in unsupervised translation and in AI alignment) do not seem to meet this bar.
One vague hope is that we could train a helper agent such that a human can perform next-word prediction on Klingon with the assistance of the helper agent, using a method like the one in Learning the prior (AN #109).

[-]avturchin5y20

Maybe we can ask GPT to output English-Klingon dictionary?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

28

“Unsupervised” translation as an (intent) alignment problem

28

Translation-specific approaches

Existing alignment techniques

Unsupervised translation seems like a good problem to think about for alignment

My vague hope