Summary: in approximating a scheme like HCH , we would like some notion of "the best the prediction can be given available AI capabilities". There's a natural notion of "the best prediction of a human we should expect to get". In general this doesn't yield good predictions of HCH, but it does yield an HCH-like computation model that seems useful.
(thanks to Ryan Carey, Paul Christiano, and some people at the November veteran's workshop for helping me work through these ideas)
Suppose we would like an AI system to predict what HCH would do. The AI system is limited; it doesn't have a perfect prediction of a human. What's the best we should expect it to do?
As a simpler sub-question, we can ask what the best prediction for a single query to a human is. Let H:String→ΔString be the "true human": a stochastic function mapping a question to a distribution over answers (say, over quantum uncertainty). How "good" of a prediction function ^H:String→ΔString should we expect to get?
The short answer is that we should expect that, for any question x, ^H(x) should be within ϵ of some pretty good prediction of H(x).
(feel free to skip this section if you're willing to buy the previous paragraph)
We will create an online prediction system that on each iteration i takes in a question Xi:String and outputs either a distribution over answers Qi:ΔString, or ⊥ to indicate ambiguity. If outputting ⊥, the prediction system observes Yi∼H(Xi). We will construct this online prediction system from a bunch of untrusted experts P1,...,PK:Δ(String→ΔString), each of whom is a probability distribution over the human H.
Suppose one expert is "correct" in that in fact H∼Pk for some k. Then KWIK learning will succeed in creating an online prediction system such that, with high probability, for each i in which Qi (and not ⊥) is output,
∥Qi−Pk(Yi|the data known at time i)∥1<ϵ. That is, the predictions Qi will be close to the "correct predictions" that Pk makes by total variation distance. Furthermore, ⊥ must be output only ˜O(K/ϵ2) times; this measures the amount of training data required.\
For the rest of this post we should assume that, after setting up the KWIK learner, we do active learning (finding inputs x on which the learner outputs ⊥) until the KWIK learner no longer outputs ⊥, then getting ^H using the current state of the KWIK learner. If we didn't do this, there would be no concrete ^H stochastic function because the state of the learner would keep changing over time.
The assumptions in the section (especially that one expert is correct) are pretty sketchy, but I expect the basic picture of "predictions should be good within ϵ" to work out.
Now that we have an approximate prediction of a human, we can use this to approximate a collection of humans. For example, we might want to predict H(“a'')+H(“b''), i.e. the result of asking H the questions “a'' and “b'' and summing the answers. In general we can consider any function g:(String→ΔString)×String→ΔString which computes something by querying some stochastic function a bunch of times, and consider the problem of predicting g(H).
The obvious way to predict g(H) is g(^H); in this case, ^H(“a'')+^H(“b''). But this can be highly inaccurate even if ^H is accurate!
Let us say that either H(“a'')=0 and H(“b'')=1, or H(“a'')=1 and H(“b'')=0. The AI does not have enough information to distinguish these possibilities; under this uncertainty, it is reasonable to think they are equally likely, so we have ^H(“a'')=^H(“b'')=Bernoulli(0.5).
The AI has enough information to conclude that H(“a'')+H(“b'')=1. But the distribution ^H(“a'')+^H(“b'') will put 0.25 probability mass on 0, 0.5 on 1, and 0.25 on 2.
In general we shouldn't expect replacing H with ^H to work very well; it does not take into account any correlation between H(x1) and H(x2) for x1≠x2.
Can we do better by taking into account the fact that the experts described in the previous section have a full joint distribution over all the H(x) values? Not really; the good expert Hk cannot be distinguished from ^H by more than ϵ with a single experiment (i.e. single question-answering round). So if some other expert predicts Hk′=^H, this expert will not be proven wrong in a single experiment.
Now, the expert might be proven wrong if we expand the notion of "experiment" to include asking H multiple questions. But this becomes more expensive, and actually doesn't help much when we're considering pretty big collections of H calls (where correlations over a greater number of H(x) values are important).
I've basically given up on the problem of predicting large computations made out of calls to H using untrusted expert advice. There's just too much room for an expert to mess with the predictions by changing the correlation structure in ways that are hard to detect with short experiments.
Despite the fact that predicting large computations made of calls to H seems intractable, large computations made of calls to ^H are useful in their own right. We can think of g(^H) as a collaboration among many "clones" of a single human, each of whom has a personality sampled from the AI's distribution over that human's personality traits. That is, each call to ^H is considered to be asking a question to an independent sampling of the human's psychological parameters (sampled from the AI's information state).
For example, if the AI does not know Bob's favorite color, then ˆBob(“what is your favorite color?'') will be stochastic. If we consider the computation g(H):=[H(“what is your favorite color?'')=H(“what is your favorite color?'')] which asks for H's favorite color twice and checks if they are equal, then g(ˆBob) will return false a non-negligible percentage of the time.
If we define g such that g(H)=HCH (i.e. g asks its argument how to spawn more copies and so on), then g(^H) is the equivalent of HCH for clones sampled from the AI's information state. (See also the notation for HCH variants in this post). The issue with psychological parameters is pretty weird but doesn't seem to present serious difficulties for most uses of HCH I can think of. I haven't thought about it a ton, but in general it seems like it should be possible to collaborate with clones of yourself that have slightly different psychological parameters (they'll only be slightly different if the AI knows a lot about you). I confirmed with Paul Christiano that he is optimistic about g(^H) being useful and pessimistic about predictions of HCH proper that take correlation into account.
When considering very large computations g(^H), we might be concerned that local errors could propagate throughout the computation. But it's possible to mitigate this by doing something like taking multiple samples of ^H(x) for some question x and taking a majority vote, as described in this post.
(feel free to skip this section)
Paul Christiano told me about an idea to get our predictions ^H(x) to not overestimate the probability of any action by more than a factor of ϵ, i.e. ∀y:Qi(y)≤(1+ϵ)Pk(Yi=y|the data known at time i)
Roughly, this can be done by taking the minimum probability of y according to all the credible experts, then renormalizing. This seems useful if we're concerned about Q predicting rare bad things that Pk wouldn't predict. It doesn't change the nature of the analysis much, though.
Note: I currently think that the basic picture of getting within ϵ of a good prediction is actually pretty sketchy. I wrote about the sample complexity here. Additional to the sample complexity issues, the requirement is for predictors to be Bayes-optimal, but Bayes-optimality is not possible for bounded reasoners. This is important because e.g. some adversarial predictor might make very good predictions on some subset of questions (because it's spending its compute on those specifically), causing other predictors to be filtered out (if those questions are used to determine who the best predictor is). I don't know what kind of analysis could get the ϵ-accuracy result at this point.