A possible training procedure for human-imitators

1Paul Christiano

0Paul Christiano

0Jessica Taylor

0Paul Christiano

New Comment

4 comments, sorted by Click to highlight new comments since: Today at 5:41 AM

This looks a lot like a variational autoencoder, unless I'm missing some distinction (Note that once you have a bit-by-bit predictor, you can assume WLOG that the distribution on A is uniform). Thinking about how variational autoencoders work in the superintelligent case seems worthwhile if we want to scale up something like imitation. I also wouldn't be that surprised if it produced some practically useful insight.

(Incidentally, I think that variational autoencoders and generative adversarial nets are the leading approaches to generative modeling right now.)

I agree that the steganography problem looks kind of bad for adversarial methods in the limit. For coping with the analogous problem with approval-maximization, I think the best bet is to try to make the generative model state transparent to the discriminative model. But this is obviously not going to work for generative adversarial models, since access to the generator state would make distinguishing trivial.

Actually I'm not sure exactly what you mean by importance sampling here.

The variational lower bound would be to draw samples from and compute . The log probability of the output under is bounded by the expectation of this quantity (with equality iff is the correct conditional distribution over ).

I'm just going to work with this in my other comments, I assume it amounts to the same thing.

It would be great to see an analysis of this from a complexity-theoretic / cryptographic perspective. Are there distributions that can't be imitated correctly in this way, even when they should be within the power of your model? Are there distributions where you get potentially problematic behavior like in the steganography case?

(That's also surely of interest to the mainstream ML community given the recent prominence of variational autoencoders, so it seems quite likely someone has done it.)

After that there is a more subtle question about learnability, but it would be good to start with the easy part.

Some proposed ways to train a system to imitate a human imitate a human involve having one system imitate a human, while another system tries to tell the first system apart from an actual human. If the first system can get classified as human by the second system, then (one might think) it is imitating a human well, as long as the second system is sufficiently smart.

I described a possible problem with these approaches in a LessWrong thread:

As Paul says later in the thread, the underlying problem is that it is easy for the first AI to change its output in an important way without the second AI noticing.

Here's a proposed way of implementing human-imitation while avoiding this problem. The human-imitator is a single AI that is given a prefix of a string produced by a human and predicts the next bit. It is rewarded for predicting the next bit accurately (using some proper scoring rule). We can use this system to imitate a human by sampling from its implied probability distribution over strings, bit by bit. If the system is very good at predicting the next bit in a human-produced string, then this implied probability distribution will be an accurate prediction of the human string.

Unfortunately, predicting the string bit-by-bit might be computationally harder than producing a string that is hard to tell apart from what a human would produce. Here's an extension that makes the problem slightly easier.

We want some way to represent a distribution over strings. One way of representing a distribution over strings S is to introduce some auxiliary data A and function f:A→S, and represent a distribution p over A. For example, if the model for generating strings is a PCFG, then the auxiliary data A is the actual parse tree. We could assume that p is represented by a bit-by-bit predictor (which takes a prefix of the data A and returns a probability for the next bit). In the PCFG example, this is easy; just order the data A in a "generative" order, and these bit-by-bit predictions will be easy.

Given the actual string S, it is possible to show a lower bound on this string's marginal probability under p by importance sampling possible parse trees consistent with a given string. That is, the system will represent some distribution q over A (after seeing the string S), which only outputs A values for which f(A)=S. This sampler can be used to estimate the total probability under p of A values for which f(A)=S.

Note that this is only one method of estimating the probability. Another method is model-counting using hash functions.

So in order to predict a human-produced string, the AI:

We could represent a bit-by-bit generative model as e.g. a split copy of the AI that outputs probabilities for the next bit given previous bits. Note that steps 2-4 only happen during training; during testing, you only need to run step 1 and then run the function f on the result.

This kind of method seems computationally easier than directly predicting a human-produced string bit-by-bit, but it might still be harder than the original adversarial proposal.