Contest: $1,000 for good questions to ask to an Oracle AI

by Stuart Armstrong3 min read31st Jul 201956 comments

18

Oracle AIBounties (closed)AI Boxing (Containment)AI
Frontpage

Edit: contest closed now, will start assessing the entries.

The contest

I'm offering $1,000 for good questions to ask of AI Oracles. Good questions are those that are safe and useful: that allows us to get information out of the Oracle without increasing risk.

To enter, put your suggestion in the comments below. The contest ends at the end[1] of the 31st of August, 2019.

Oracles

A perennial suggestion for a safe AI design is the Oracle AI: an AI confined to a sandbox of some sort, that interacts with the world only by answering questions.

This is, of course, not safe in general; an Oracle AI can influence the world through the contents of its answers, allowing it to potentially escape the sandbox.

Two of the safest designs seem to be the counterfactual Oracle, and the low bandwidth Oracle. These are detailed here, here, and here, but in short:

  • A counterfactual Oracle is one whose objective function (or reward, or loss function) is only non-trivial in worlds where its answer is not seen by humans. Hence it has no motivation to manipulate humans through its answer.
  • A low bandwidth Oracle is one that must select its answers off a relatively small list. Though this answer is a self-confirming prediction, the negative effects and potential for manipulation is restricted because there are only a few possible answers available.

Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren't asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.

Getting useful answers

The counterfactual and low bandwidth Oracles are safer than unrestricted Oracles, but this safety comes at a price. The price is that we can no longer "ask" the Oracle any question we feel like, and we certainly can't have long discussions to clarify terms and so on. For the counterfactual Oracle, the answer might not even mean anything real to us - it's about another world, that we don't inhabit.

Despite this, its possible to get a surprising amount of good work out of these designs. To give one example, suppose we want to fund various one of a million projects on AI safety, but are unsure which one would perform better. We can't directly ask either Oracle, but there are indirect ways of getting advice:

  • We could ask the low bandwidth Oracle which team A we should fund; we then choose a team B at random, and reward the Oracle if, at the end of a year, we judge A to have performed better than B.
  • The counterfactual Oracle can answer a similar question, indirectly. We commit that, if we don't see its answer, we will select team A and team B at random and fund them for year, and compare their performance at the end of the year. We then ask for which team A[2] it expects to most consistently outperform any team B.

Both these answers get around some of the restrictions by deferring to the judgement of our future or counterfactual selves, averaged across many randomised universes.

But can we do better? Can we do more?

Your better questions

This is the purpose of this contest: for you to propose ways of using either Oracle design to get the most safe-but-useful work.

So I'm offering $1,000 for interesting new questions we can ask of these Oracles. Of this:

  • $350 for the best question to ask a counterfactual Oracle.
  • $350 for the best question to ask a low bandwidth Oracle.
  • $300 to be distributed as I see fit among the non-winning entries; I'll be mainly looking for innovative and interesting ideas that don't quite work.

Exceptional rewards go to those who open up a whole new category of useful questions.

Questions and criteria

Put your suggested questions in the comment below. Because of the illusion of transparency, it is better to explain more rather than less (within reason).

Comments that are submissions must be on their separate comment threads, start with "Submission", and you must specify which Oracle design you are submitting for. You may submit as many as you want; I will still delete them if I judge them to be spam. Anyone can comment on any submission. I may choose to ask for clarifications on your design; you may also choose to edit the submission to add clarifications (label these as edits).

It may be useful for you to include details of the physical setup, what the Oracle is trying to maximise/minimise/predict and what the counterfactual behaviour of the Oracle users humans are assumed to be (in the counterfactual Oracle setup). Explanations as to how your design is safe or useful could be helpful, unless it's obvious. Some short examples can be found here.

EDIT after seeing some of the answers: decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.


  1. A note on timezones: as long as it's still the 31 of August, anywhere in the world, your submission will be counted. ↩︎

  2. These kind of conditional questions can be answered by a counterfactual Oracle, see the paper here for more details. ↩︎

18

59 comments, sorted by Highlighting new comments since Today at 9:45 PM
New Comment

Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer). In that case, reward function is computed as similarity between the predicted posts and the actual top posts on AF as ranked by karma, with similarity computed using some ML model.

This seems to potentially significantly accelerate AI safety research while being safe since it's just showing us posts similar to what we would have written ourselves. If the ML model for measuring similarity isn't secure, the Oracle might produce output that attack the ML model, in which case we might need to fall back to some simpler way to measure similarity.

It looks like my entry is pretty close to the ideas of Human-in-the-counterfactual-loop and imitation learning and apprenticeship learning. Questions:

  1. Stuart, does it count against my entry that it's not actually a very novel idea? (If so, I might want to think about other ideas to submit.)
  2. What is the exact relationship between all these ideas? What are the pros and cons of doing human imitation using this kind of counterfactual/online-learning setup, versus other training methods such as GAN (see Safe training procedures for human-imitators for one proposal)? It seems like there are lots of posts and comments about human imitations spread over LW, Arbital, Paul's blog and maybe other places, and it would be really cool if someone (with more knowledge in this area than I do) could write a review/distillation post summarizing what we know about it so far.
  1. I encourage you to submit other ideas anyway, since your ideas are good.

  2. Not sure yet about how all these things relate; will maybe think of that more later.

What if another AI would have counterfactually written some of those posts to manipulate us?

If that seems a realistic concern during the time period that the Oracle is being asked to predict, you could replace the AF with a more secure forum, such as a private forum internal to some AI safety research team.

(I'm still confused and thinking about this, but figure I might as well write this down before someone else does. :)

While thinking more about my submission and counterfactual Oracles in general, this class of ideas for using CO is starting to look like trying to implement supervised learning on top of RL capabilities, because SL seems safer (less prone to manipulation) than RL. Would it ever make sense to do this in reality (instead of just doing SL directly)?

[This comment is no longer endorsed by its author]Reply

This seems incredibly dangerous if the Oracle has any ulterior motives whatsoever. Even – nay, especially – the ulterior motive of future Oracles being better able to affect reality to better resemble their provided answers.

So, how can we prevent this? Is it possible to produce an AI with its utility function as its sole goal, to the detriment of other things that might… increase utility, but indirectly? (Is there a way to add a "status quo" bonus that won't hideously backfire, or something?)

Submission for a counterfactual oracle: precommit that, if the oracle stays silent, a week from now you'll try to write the most useful message to your past self, based on what happens in the world during that week. Ask the oracle to predict that message. This is similar to existing solutions, but slightly more meta, because the content of the message is up to your future self - it could be lottery numbers, science papers, disaster locations, or anything else that fits within the oracle's size limit. (If there's no size limit, just send the whole internet.)

You could also form a bucket brigade to relay messages from further ahead, but that's a bad idea. If the oracle's continued silence eventually leads to an unfriendly AI, it can manipulate the past by hijacking your chain of messages and thus make itself much more likely. The same is true for all high-bandwidth counterfactual oracles - they aren't unfriendly in themselves, but using them creates a thicket of "retrocausal" links that can be exploited by any potential future UFAI. The more UFAI risk grows, the less you should use oracles.

This is similar to existing solutions, but slightly more meta

I feel like this is about equally meta as my "Superintelligent Agent" submission, since my committee could output "Show the following message to the operator: ..." and your message could say "I suggest that you perform the following action: ...", so the only difference between your idea and mine is that in my submission the output of the Oracle is directly coupled to some effectors to let the agent act faster, and yours has a (real) human in the loop.

The more UFAI risk grows, the less you should use oracles.

Hmm, good point. I guess Chris Leong made a similar point, but it didn't sink in until now how general the concern is. This seems to affect Paul's counterfactual oversight idea as well, and maybe other kinds of human imitations and predictors/oracles, as well as things that are built using these components like quantilizers and IDA.

Thinking about this some more, all high-bandwidth oracles (counterfactual or not) risk receiving messages crafted by future UFAI to take over the present. If the ranges of oracles overlap in time, such messages can colonize their way backwards from decades ahead. It's especially bad if humanity's FAI project depends on oracles - that increases the chance of UFAI in the world where oracles are silent, which is where the predictions come from.

One possible precaution is to use only short-range oracles, and never use an oracle while still in prediction range of any other oracle. But that has drawbacks: 1) it requires worldwide coordination, 2) it only protects the past. The safety of the present depends on whether you'll follow the precaution in the future. And people will be tempted to bend it, use longer or overlapping ranges to get more power.

In short, if humanity starts using high-bandwidth oracles, that will likely increase the chance of UFAI and hasten it. So such oracles are dangerous and shouldn't be used. Sorry, Stuart :-)

Thinking about this some more, all high-bandwidth oracles (counterfactual or not) risk receiving messages crafted by future UFAI to take over the present.

Note that in the case of counterfactual oracle, this depends on UFAI "correctly" solving counterfactual mugging (i.e., the UFAI has to decide to pay some cost in its own world to take over a counterfactual world where the erasure event didn't occur).

So such oracles are dangerous and shouldn’t be used.

This seems too categorical. Depending on the probabilities of various conditions, using such oracles might still be the best option in some circumstances.

Very worthwhile concern, and I will think about it more.

Yeah. And low-bandwidth oracles can have a milder version of the same problem. Consider your "consequentialist" idea: if UFAI is about to arise, and one of the offered courses of action leads to UFAI getting stopped, then the oracle will recommend against that course of action (and for some other course where UFAI wins and maxes out the oracle's reward).

Submission. “Superintelligent Agents.” For the Counterfactual Oracle, ask the Oracle to predict what action(s) a committee of humans would recommend doing next (which may include submitting more queries to the Oracle), then perform that action(s).

The committee, by appropriate choice of recommendations, can implement various kinds of superintelligent agents. For example, by recommending the query "What would happen if the next action is X?" (in the event of erasure, actually do X and record or have the committee write up a description of the consequences as training data) (ETA: It may be better to have the committee assign a numerical score, i.e., utility, to the consequences instead.) a number of times for different X, followed by the query "What would the committee recommend doing next, if it knew that the predicted consequences for the candidate actions are as follows: ..." (in the event of erasure, let physical committee members read the output of the relevant previous queries and then decide what to do), it would in effect implement a kind of quantilizer. If IDA can be implemented using Counterfactual Oracles (as evhub suggested), then the committee can choose to do that as well.

My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle's answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human's answer to that of the oracle.

More precisely, let

  • be the counterfactual oracle,
  • be the human's answer to question when given the ability to call on any question other than , and
  • be some distance metric on answers in natural language (it's not that hard to make something like this, even with current ML tools).

Then, reward as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let where is hidden from and judged only by as in the standard counterfactual oracle setup.

(Of course, this doesn't actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)

Is it safe to ask the Oracle a subquestion in the event of erasure? Aren't you risking having the Oracle produce an answer that is (in part) optimized to make it easier to predict the answer to the main question, instead of just the best prediction of how the human would answer that subquestion? (Sorry if this has already been addressed during a previous discussion of counterfactual oracles, because I haven't been following it closely.)

I'm not sure I understand the concern. Isn't the oracle answering each question to maximize its payoff on that question in event of an erasure? So it doesn't matter if you ask it other questions during the evaluation period. (If you like, you can say that you are asking them to other oracles---or is there some way that an oracle is a distinguished part of the environment?)

If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don't we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it's a problem regardless.

I guess you can construct a model where the oracle does what you want, but only if you don't ask any other oracles questions during the evaluation period, but it's not clear to me how you would end up in that situation and at that point it seems worth trying to flesh out a more precise model.

I’m not sure I understand the concern.

Yeah, I'm not sure I understand the concern either, hence the tentative way in which I stated it. :) I think your objection to my concern is a reasonable one and I've been thinking about it myself. One thing I've come up with is that with the nested queries, the higher level Oracles could use simulation warfare to make the lower level Oracles answer the way that they "want", whereas the same thing doesn't seem to be true in the sequential case (if we make it so that in both cases each Oracle cares about just performance on the current question).

I mean, if the oracle hasn't yet looked at the question they could use simulation warfare to cause the preceding oracles to take actions that lead to them getting given easier questions. Once you start unbarring all holds, stuff gets wild.

Yes, but if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case.

Also, aside from simulation warfare, another way that the nested case can be manipulative and the sequential case not is if each Oracle cares about doing well on a fixed distribution of inputs (as opposed to doing well "on the current question" or "in the real world" or "on the actual questions that it gets"). That's because in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn't improve performance on any particular given input. In the nested case, performance on given inputs do increase.

in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn't improve performance on any particular given input

Why is that? Doesn't my behavior on question #1 affect both question #2 and its answer?

Also, this feels like a doomed game to me---I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.

Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?

I was assuming each "question" actually includes as much relevant history as we can gather about the world, to make the Oracle's job easier, and in particular it would include all previous Oracle questions/answers, in which case if Oracle #1 does X to make question #2 easier, it was already that easy because the only world in which question #2 gets asked is one in which Oracle #1 did X. But now I realize that's not actually a safe assumption because Oracle #1 could break out of its box and feed Oracle #2 a false history that doesn't include X.

My point about "if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case" still stands though, right?

Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.

You may well be right about this, but I'm not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?

You may well be right about this, but I'm not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?

What I want: "There is a model in the class that has property P. Training will find a model with property P."

What I don't want: "The best way to get a high reward is to have property P. Therefore a model that is trying to get a high reward will have property P."

Example of what I don't want: "Manipulative actions don't help get a high reward (at least for the episodic reward function we intended), so the model won't produce manipulative actions."

So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:

Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren’t asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.

On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?

ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that's a useful direction to think?

So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:

This is an objection to reasoning from incentives, but it's stronger in the case of some kinds of reasoning from incentives (e.g. where incentives come apart from "what kind of policy would be selected under a plausible objective"). It's hard for me to see how nested vs. sequential really matters here.

On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?

(I don't think model class is going to matter much.)

I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.

(Though I don't think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart's posts about "forward-looking" vs. "backwards-looking" oracles?)

I think it's also interesting to imagine internal RL (e.g. there are internal randomized cognitive actions, and we use REINFORCE to get gradient estimates---i.e. you try to increase the probability of cognitive actions taken in rounds where you got a lower loss than predicted, and decrease the probability of actions taken in rounds where you got a higher loss), which might make the setting a bit more like the one Stuart is imagining.

ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that's a useful direction to think?

Seems like the counterfactually issue doesn't come up in the Opt case, since you aren't training the algorithm incrementally---you'd just collect a relevant dataset before you started training. I think the Opt setting throws away too much for analyzing this kind of situation, and would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).

I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.

This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn't that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?

(Though I don’t think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart’s posts about “forward-looking” vs. “backwards-looking” oracles?)

I don't understand what you mean in this paragraph (especially "since each possible parameter setting is being evaluated on what other parameter settings say anyway"), even after reading Stuart's post, plus Stuart has changed his mind and no longer endorses the conclusions in that post. I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart's reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)

would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).

Couldn't you simulate that with Opt by just running it repeatedly?

This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn't that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?

"The best model" is usually regularized. I don't think this really changes the picture compared to imagining optimizing over some smaller space (e.g. space of models with regularize<x). In particular, I don't think my intuitions are sensitive to the difference.

I don't understand what you mean in this paragraph (especially "since each possible parameter setting is being evaluated on what other parameter settings say anyway")

The normal procedure is: I gather data, and am using the model (and other ML models) while I'm gathering data. I search over parameters to find the ones that would make the best predictions on that data.

I'm not finding parameters that result in good predictive accuracy when used in the world. I'm generating some data, and then finding the parameters that make the best predictions about that data. That data was collected in a world where there are plenty of ML systems (including potentially a version of my oracle with different parameters).

Yes, the normal procedure converges to a fixed point. But why do we care / why is that bad?

I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart's reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)

I take a perspective where I want to use ML techniques (or other AI algorithms) to do useful work, without introducing powerful optimization working at cross-purposes to humans. On that perspective I don't think any of this is a problem (or if you look at it another way, it wouldn't be a problem if you had a solution that had any chance at all of working).

I don't think Stuart is thinking about it in this way, so it's hard to engage at the object level, and I don't really know what the alternative perspective is, so I also don't know how to engage at the meta level.

Is there a particular claim where you think there is an interesting disagreement?

Couldn't you simulate that with Opt by just running it repeatedly?

If I care about competitiveness, rerunning OPT for every new datapoint is pretty bad. (I don't think this is very important in the current context, nothing depends on competitiveness.)

Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.

Does anyone know what Paul meant by this? I'm afraid I might be missing some relatively simple but important insight here.

[This comment is no longer endorsed by its author]Reply

If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don't we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it's a problem regardless.

Yeah, that's a good point. In my most recent response to Wei Dai I was trying to develop a loss which would prevent that sort of coordination, but it does seem like if that's happening then it's a problem in any counterfactual oracle setup, not just this one. Though it is thus still a problem you'd have to solve if you ever actually wanted to implement a counterfactual oracle.

I was thinking about this, and it's a bit unclear.

First, if you're willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you're guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you're relying on getting enough training data to produce an agent which will optimize for this objective, you're screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you're already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you're already willing to make this (very strong) assumption, so it's fine.

Second, I don't think you're entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.

First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures.

But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you'll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won't be able to give you a useful prediction. Wouldn't it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my "predict the best AF posts" idea)? I don't see what value the IDA idea is adding here.

Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure.

Given the above, "only provide rewards in the event of a complete erasure" doesn't seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?

Yeah, that's a good point.

Okay, here's another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step's worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.

Maybe we can turn this into a zero-sum game, though? Here's a proposal: let be a copy of and be the set of all questions in the current tree that also get erasures. Then, let such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It's still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.

It is also worth noting, however, that even if this works it is a very artificial fix, since the term you're subtracting is a constant with no dependence on , so if you're trying to do gradient descent to optimize this loss, it won't change anything at all (which sort of goes to show how gradient descent doesn't distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we're still back at the problem of none of this working unless you're willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.

which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives

Sorry I haven't followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?

In the case of "actual" IDA, I guess the plan is for each overseer to look inside the model they're training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I'm not sure how that can happen at the lower levels where the overseers are not very smart.

Two basic questions I couldn't figure out (sorry):

Can you use a different oracle for every subquestion? If you can, how would this affect the concern Wei_Dai raises?

If we know the oracle is only optimizing for the specified objective function, are mesa-optimisers still a problem for the proposed system as a whole?

You can use a different oracle for every subquestion, but it's unclear what exactly that does if you don't know what the oracle's actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution.

The key here, I think, is the degree to which you're willing to make an assumption of the form you mention--that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you're allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you're assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that's a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn't care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you're assuming.

Submission: low-bandwidth oracle

Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.

Submission. "Bug Finder." Ask LBO to find the most serious bug in some source code or executable. The set of possible answers is code locations crossed with types of bug (e.g., "off by one", "buffer overflow", "insecure cryptography", "other"). Gather a group of experts to search that location for that type of bug, then reward the oracle according to the seriousness of the bug that is found. This can be used for cyber-offensive or defensive purposes, or just to fix bugs.

Submission. "Plagiarize the future." For the counterfactual Oracle, ask the Oracle to predict articles in a future publication (e.g., newspaper, magazine, journal, website, blog) (counterfactually if we didn’t see the Oracle’s answer). This is a generalization of my earlier submission. The result can be used to predict disasters ahead of time and try to prepare for or prevent them, or to advance any field of research, or to make money. Note that for disaster prediction, the Oracle won't be able to predict any disasters that are severe enough to disable the Oracle or its feedback/training mechanism, so it might give a false sense of security.

Submission. "Consequentialist." For LBO, ask the oracle to pick a course of action, from a list generated by humans, that has the best consequences. Perform that action, and after some period of time has passed, reward the oracle according to how good the consequences actually were, based on the subjective judgement of some group of judges. (This kind of subsumes all my other LBO submissions.)

Submission. "Overlooked Comment." For LBO, ask the oracle to find the most overlooked comment (or post) on LW (or some other forum). The set of possible answers is all the existing comments on LW. After getting the answer from the oracle, gather a group of judges and present each of them with the comment picked by the oracle and a few other random comments (without telling them which is which, to reduce bias) and have the judges assign scores to each comment (according to what they think the karma of that comment should be). Reward the oracle according to the difference between the judges' average score and actual karma of that comment.

Submission. "Pre-Crime 2." For LBO, ask the oracle to predict where to find evidence for the most serious crime that is in progress or currently being planned. The set of possible answers is coordinates on Earth crossed with a list of possible places to look at that coordinate (e.g., "basement", "browser history"). Search the predicted location and if sufficient evidence is found, do a more intrusive search of whoever is implicated in the crime. Reward the oracle according to the seriousness of the crime that is detected after the more intrusive search. This can be used to implement a minimally intrusive police state (perhaps for the purposes of minimizing x-risk).

Question: are we assuming that mesa optimizer and distributional shift problems have been solved somehow? Or should we assume that some context shift might suddenly cause the Oracle to start giving answered that aren't optimized for the objective function that we have in mind, and plan our questions accordingly?

Assume either way, depending on what your suggestion is for.

Where (under which assumption) would you suggest that people focus their efforts?

Also, what level of capability should we assume the Oracle to have, or which assumption about level of capability would you suggest that people focus their efforts on?

Your examples all seem to assume oracles that are superhumanly intelligent. If that's the level of capability we should target with our questions, should we assume that we got this Oracle through a local or distributed takeoff? In other words, does the rest of the world look more or less like today's or are there lots of other almost-as-capable AIs around?

ETA: The reason for asking these questions is that you're only giving one prize for each type of Oracle, and would probably not give the prize to a submission that assumes something you think is very unlikely. It seems good to communicate your background views so that people aren't surprised later when you don't pick them as winners due to this kind of reason.

The ideal solution would have huge positive impacts and complete safety, under minimal assumptions. More realistically, there will be a tradeoff between assumptions and impact.

I'm not suggesting any area for people to focus their efforts, because a very effective approach with minimal assumptions might win, or a fantastically effective approach under stronger assumptions. It's hard to tell in advance what will be the most useful.

Here are some submission examples (note I'm not saying they are good examples, just well formatted; the Edits in the third submission are deliberate):

  • "Submission. For the counterfactual Oracle, ask the Oracle what Google's stock price will be next month (counterfactually if we didn't see the Oracle's answer). In that case, loss function computed as ||predicted price - actual price after erasure||^2. If we don't see the answer, the programmers are assumed to not ask the question again for a month, neither to this Oracle nor to any other. This will demonstrate the true value of Google, and can ultimately be used to remove noise from the stock market.
  • "Submission: low-bandwidth oracle. Give it a list of a thousand companies, and ask which one will most increase in value in percentage terms over the week. At the end of the month, rank the companies by percentage increase. Loss function is rank of the company the oracle selected. Programmer will try to invest in selected company, but will do it discreetly. This will help to gather resources for AI safety research.
  • "Submission. Ask the low bandwidth Oracle which of my friends will surprise me most this fortnight. It choose from a list of friends; I'll decide which one surprises me most. Loss function is if it choose the wrong friend, if it choose the right one. This will help me figure out myself and my social circle, and better focus on AI safety. The risk is low because none of my friends are particularly important, positively or negatively, to the world. EDIT: To be clear, I also want to use this to figure out what the word "surprise" means to me, and what the AI predicts it will mean to me. EDIT 2: People have pointed out that it might be dangerous to have the AI construct my own meanings for categories, but it only has three bits or so of optimisation power (I don't have that many friends :-(, so it's mainly me thinking this through, not the AI manipulating me.

Submission?: high-bandwidth counterfactual oracles are dangerous and shouldn't be used. Explained in this comment.

Submission. "Evacuation!" For LBO, ask the oracle to predict where the next natural disaster(s) will strike within some period of time. The set of possible answers is the set of subsets of all possible neighborhoods (in the world or within some country) that can be evacuated. At the end of the time period, reward the oracle according to the actual cost-benefit of the suggested evacuation.

Submission. "Debate prep." For the counterfactual Oracle, ask the Oracle to predict the course (video or audio or transcript) of a future debate (counterfactually if we didn’t see the Oracle’s answer). Use the prediction to prepare your candidate for the debate, then ask the Oracle to predict the course of that debate again. Repeat until the candidate consistently wins the debate, or an erasure event occurs.

Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)

Submission. "Hacking/phishing assistant." For the counterfactual Oracle, ask the Oracle to predict what would happen if one were to send a message/data/command to some hacking/phishing (human or machine) target. In the event of erasure, actually send that message to the target and use the actual response to train the Oracle. Note this is safer than using RL to automate hacking/phishing because humans are coming up with candidate messages to send (so they'll avoid messages that could cause bad side-effects such as psychological damage to the recipient, or creation of self-replicating code), but potentially more capable than using humans or human imitators to do hacking/phishing because the Oracle can model the target better than humans can. (ETA: This idea could be combined with a human imitator to make the system faster / more capable.)

Submission. "Pre-crime." For the counterfactual Oracle, ask the Oracle to predict future crime (counterfactually if we didn’t see the Oracle’s answer) so they can be prevented. In that case, loss can be computed by (mechanically) comparing the Oracle's predictions with a database of crime that actually occurred. Note that the Oracle won't be able to predict any crime that successfully disables the Oracle or its feedback/training mechanism, which might be especially relevant if we're interested in defending the Oracle or preventing x-risk level crime (such as deploying an unauthorized AGI or running an unauthorized physics experiment).

I know the contest is over, but this idea for a low-bandwidth oracle might be useful anyhow: Given a purported FAI design, what is the most serious flaw? Then highlight lines from the FAI design description, plus given a huge corpus of computer science papers, LW/AF posts, etc. highlight relevant paragraphs from those as well (perhaps using some kind of constraint like "3 or fewer paragraphs highlighted in their entirety") that, taken together, come closest to pinpointing the issue. We could even give it a categorization scheme for safety problems we came up with, and it could tell us which category this particular problem comes closest to falling under. Or offer it categories a particular hint could fall under to choose from, such as "this is just an analogy", "keep thinking along these lines", etc. Then do the same and ask it to highlight text which leads to a promising solution. The rationale being that unforseen difficulties are the hardest part of alignment, but if there's a flaw, it will probably be somehow analogous to a problem we've seen in the past, or will be addressable using methods which have worked in the past, or something. But it's hard to fit "everything we've seen in the past" into one human head.

Submission: Counterfactual oracle. Five years ago we took 100 mice that had various types of cancer and gave them various treatments and recorded how long each mouse lived. Write a program that if given a plain language description of a mouse, its cancer, and the cancer treatment it received would estimate how long a mouse would live. If humans are not going to look at your answer your score will be based (1) on how good a job your program does at estimating how long each of the 100 mice lived after our automated checker gives you a description of their cancers and treatments and (2) how short your program is. (2) prevents the oracle from outputting itself as the program.


Submission: Counterfactual oracle. Write a program that if given training data and a machine learning program would in one minute estimate how good the machine learning program would do (by some objective metric) if the program trained for one month on "this type of computer". If humans are not going to look at your answer the automated validation system will run your program. This system will give your program the training data and the machine learning program and give your program one minute to answer how good our program did after we trained it for one month. In this situation your score would be based on the accuracy of your estimate and on how short your program is.


Submission: Low-bandwidth oracle. Here is a list of all the elements and many compounds. Give us a list of up to seven of the items we have listed. Next to each of the items you list give us a percentage of no more than two significant figures. We will use what you provide to attempt to create a new patentable material. We will auction off the property rights to this material. Your score will be an increasing function of how much we get for these property rights.





Solution, invent something obviously very dangerous. Multiple big governments get into bidding war to keep it out of the others hands.