A possible training procedure for human-imitators

[-]paulfchristiano10y10

This looks a lot like a variational autoencoder, unless I'm missing some distinction (Note that once you have a bit-by-bit predictor, you can assume WLOG that the distribution on A is uniform). Thinking about how variational autoencoders work in the superintelligent case seems worthwhile if we want to scale up something like imitation. I also wouldn't be that surprised if it produced some practically useful insight.

(Incidentally, I think that variational autoencoders and generative adversarial nets are the leading approaches to generative modeling right now.)

I agree that the steganography problem looks kind of bad for adversarial methods in the limit. For coping with the analogous problem with approval-maximization, I think the best bet is to try to make the generative model state transparent to the discriminative model. But this is obviously not going to work for generative adversarial models, since access to the generator state would make distinguishing trivial.

[-]paulfchristiano10y00

Actually I'm not sure exactly what you mean by importance sampling here.

The variational lower bound would be to draw samples from $q$ and compute $l o g (p / q)$ . The log probability of the output under $p$ is bounded by the expectation of this quantity (with equality iff $q$ is the correct conditional distribution over $A$ ).

I'm just going to work with this in my other comments, I assume it amounts to the same thing.

[-]jessicata10y00

What I mean is: compute ${^E}_{q} [[f (A) = x] p (A) / q (A)]$ , which is a probabilistic lower bound on $P_{p} (f (A) = x)$ .

The variational score gives you a somewhat worse lower bound if $q$ is different from $p (A | f (A) = x)$ . Due to Jensen's inequality, $E_{q} [log ([f (A) = x] p (A) / q (A))] \leq log E_{q} [[f (A) = x] p (A) / q (A)] \leq log P_{p} (f (A) = x)$

It probably doesn't make a huge difference either way.

[-]paulfchristiano10y00

It would be great to see an analysis of this from a complexity-theoretic / cryptographic perspective. Are there distributions that can't be imitated correctly in this way, even when they should be within the power of your model? Are there distributions where you get potentially problematic behavior like in the steganography case?

(That's also surely of interest to the mainstream ML community given the recent prominence of variational autoencoders, so it seems quite likely someone has done it.)

After that there is a more subtle question about learnability, but it would be good to start with the easy part.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

2

A possible training procedure for human-imitators

2