I'm offering $1,000 for good questions to ask of AI Oracles. Good questions are those that are safe and useful: that allows us to get good results out of the Oracle without increasing risk.
To enter, put your suggestion in the comments below. The contest ends at the end of the 31st of August, 2019.
A perennial suggestion for a safe AI design is the Oracle AI: an AI confined to a sandbox of some sort, that interacts with the world only by answering questions.
This is, of course, not safe in general; an Oracle AI can influence the world through the contents of its answers, allowing it to potentially escape the sandbox.
- A counterfactual Oracle is one whose objective function (or reward, or loss function) is only non-trivial in worlds where its answer is not seen by humans. Hence it has no motivation to manipulated humans through its answer.
- A low bandwidth Oracle is one that must select its answers off a relatively small list. Though this answer is a self-confirming prediction, the negative effects and potential for manipulation is restricted because there are only a few possible answers available.
Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren't asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.
Getting useful answers
The counterfactual and low bandwidth Oracles are safer than unrestricted Oracles, but this safety comes at a price. The price is that we can no longer "ask" the Oracle any question we feel like, and we certainly can't have long discussions to clarify terms and so on. For the counterfactual Oracle, the answer might not even mean anything real to us - it's about another world, that we don't inhabit.
Despite this, its possible to get a surprising amount of good work out of these designs. To give one example, suppose we want to fund various one of a thousand projects on AI safety, but are unsure which one would perform better. We can't directly ask either Oracle, but there are indirect ways of getting advice:
- We could ask the low bandwidth Oracle which team A we should fund; we then choose a team B at random, and reward the Oracle if, at the end of a year, we judge A to have performed better than B.
- The counterfactual Oracle can answer a similar question, indirectly. We commit that, if we don't see its answer, we will select team A and team B at random and fund them for year, and compare their performance at the end of the year. We then ask for which team A it expects to most consistently outperform any team B.
Both these answers get around some of the restrictions by deferring to the judgement of our future selves, averaged across many randomised universes.
But can we do better? Can we do more?
Your better questions
This is the purpose of this contest: for you to propose ways of using either Oracle design to get the most safe-but-useful work out of questions.
So I'm offering $1,000 for interesting new questions we can ask of these Oracles. Of this:
- $350 for the best question to ask a counterfactual Oracle.
- $350 for the best question to ask a low bandwidth Oracle.
- $300 to be distributed as I see fit among the non-winning entries; I'll be mainly looking for innovative and interesting ideas that don't quite work.
Questions and entries
Put your suggested questions in the comment below. Because of the illusion of transparency
- We could ask the counterfactual Oracle that, if we were to fund team A and team B, selected at random, and compare them after a year, which possible team A would we consistently rank as having the best performance (averaged over all teams Bs).
one team at random and rank their performance after a year
£500 from Nick, £250 from me
There has been some recent progress in designing safe AI Oracles. See for example the counterfactual Oracle and the low-bandwidth Oracle (online answering with very limited answer options); see Good and safe uses of AI Oracles for more details. Both these designs offer the hope of getting safe answers from a superintelligent AI, despite the many difficulties in getting AIs aligned with human values in general.
This safety comes at a cost, however, as both designs are limited in what questions can be asked and how they can be answered. But there are imaginative ways to partially get around these limitations and get good answer out of such Oracle designs. One can't ask the low-bandwidth Oracle any direct questions about designing AI safety; but one could ask "out of these 100,000 AI safety projects, which would we judge as having made the most progress in a year, if we fund that project and one other random one on the list?". And that question would give us an indirect way of making much faster AI safety progress.
So I'm organising a competition, to find the best and most high-impact use of either of these two Oracle designs to either enhance AI safety or dramatically improve humanity in general. There will be a £250 prize for the winning entries in the counterfactual and the low-bandwidth Oracle design - unless no entries are sufficiently good in that category, in which case the other category will get a first and a second prize. Depending on the answers, there may be a third £250 prize, to be awarded either to a) an interesting design that doesn't work but that could be fixed to work well, or b) an imaginative question design that, while not high-impact in itself, opens up whole new areas of potential questions.
Requirements for entries:
- For the counterfactual Oracle: the question asked, and an explanation as to how the reward is calculated in the world in which the erasure event happens (eg: "What is Google's stock price next month (given that we don't see this prediction)?", reward calculated as ||predicted price - actual price after erasure||^2).
- For the low-bandwidth Oracle: the question asked, phrased as a list of options for the Oracle to choose from, and how (and when) humans will reward or punish the Oracle for its answer (eg "What is Google's stock price next month (given that we see this prediction)?", options being a list of prices from $0 to $100,000 on $10 increments, reward calculated as negative of the number of options that were more accurate than the answer provided).
Recommendations for entries:
- Especially for more complicated entries, it is recommended that the submission explain why their design will a) maintain safety, and b) give useful/accurate/high-impact answers. For example, what does knowing Google's counterfactual price actually tell us?
Please write your entries in the comments of this thread, labelling that you are submitting an entry. Others can answer/respond to the entry submitted, but each entry must be its own separate comment, and not an answer to another comment. There is no limit to the number of entries submitted per username; if the entries from the same username are similar, I may amalgamate them and consider them as a whole.
The contest will be judged by me, Stuart Armstrong, exclusively, though I may talk over some of the design ideas with colleagues. The criteria will be my best interpretation of what I have explained above. The final decision will be final; however, before the prizes are awarded, if you feel (maybe due to a comment I left) that I am misinterpreting your suggestion, then a) message me about this, and b) re-write your submission to correct the misinterpretation.