Last year Stuart Armstrong announced a contest to come up with the best questions to ask an Oracle AI. Wei Dai wrote,
Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer).
He later related his answer to Paul Christiano's posts on Human-in-the-counterfactual-loop and Elaborations on apprenticeship learning. Here I'm interested in concrete things that can be expected to go wrong in the near future if we gave GPT-N this task.
To provide a specific example, suppose we provided the prompt,
This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models.
If by assumption, GPT-N is at least as good as a human expert team at generating blog posts, we could presumably expect this GPT-N to produce a very high quality post explaining how to inspect machine learning models. We would therefore have a way of to automate alignment research at a high level. But a number of important questions remain, such as,
- How large would GPT-N need to be before it started producing answers comparable to a human expert team, and
- Given the size of the model, what high-level incentives should we expect to guide the training of the model? In other words, what mesa optimization-like instantiations can we expect to result from training, exactly, and
- Is there a clear and unambiguous danger that the model would be manipulative? If so, why?
- Is the threat model more that we don't know what we don't know, or that we have a specific reason to believe the model would be manipulative in a particular direction?
How many samples did you prune through to get this? Did you do any re-rolls? What was your stopping procedure?