Original Post:

We present an algorithm [updated version], then show (given four assumptions) that in the limit, it is human-level intelligent and benign.

Will MacAskill has commented that in the seminar room, he is a consequentialist, but for decision-making, he takes seriously the lack of a philosophical consensus. I believe that what is here is correct, but in the absence of feedback from the Alignment Forum, I don't yet feel comfortable posting it to a place (like arXiv) where it can get cited and enter the academic record. We have submitted it to IJCAI, but we can edit or revoke it before it is printed.

I will distribute at least min($365, number of comments * $15) in prizes by April 1st (via venmo if possible, or else Amazon gift cards, or a donation on their behalf if they prefer) to the authors of the comments here, according to the comments' quality. If one commenter finds an error, and another commenter tinkers with the setup or tinkers with the assumptions in order to correct it, then I expect both comments will receive a similar prize (if those comments are at the level of prize-winning, and neither person is me). If others would like to donate to the prize pool, I'll provide a comment that you can reply to.

To organize the conversation, I'll start some comment threads below:

  • Positive feedback
  • General Concerns/Confusions
  • Minor Concerns
  • Concerns with Assumption 1
  • Concerns with Assumption 2
  • Concerns with Assumption 3
  • Concerns with Assumption 4
  • Concerns with "the box"
  • Adding to the prize pool

Edit 30/5/19: An updated version is on arXiv. I now feel comfortable with it being cited. The key changes:

  • The Title. I suspect the agent is unambitious for its entire lifetime, but the title says "asymptotically" because that's what I've shown formally. Indeed, I suspect the agent is benign for its entire lifetime, but the title says "unambitious" because that's what I've shown formally. (See the section "Concerns with Task-Completion" for an informal argument going from unambitious -> benign).
  • The Useless Computation Assumption. I've made it a slightly stronger assumption. The original version is technically correct, but setting is tricky if the weak version of the assumption is true but the strong version isn't. This stronger assumption also simplifies the argument.
  • The Prior. Rather than having to do with the description length of the Turing machine simulating the environment, it has to do with the number of states in the Turing machine. This was in response to Paul's point that the finite-time behavior of the original version is really weird. This also makes the Natural Prior Assumption (now called the No Grue Assumption) a bit easier to assess.

Edit 17/02/20: Published at AAAI. The prior over world-models is now totally different, and much better. There's no "amnesia antechamber" required. The Useless Computation Assumption and the No Grue Assumption are now obselete. The argument for unambitiousness now depends on the "Space Requirements Assumption", which we probed empirically. The ArXiv link is up-to-date.

Asymptotically Unambitious AGI
New Comment
149 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

If I have a great model of physics in hand (and I'm basically unconcerned with competitiveness, as you seem to be), why not just take the resulting simulation of the human and give it a long time to think? That seems to have fewer safety risks and to be more useful.

More generally, under what model of AI capabilities / competitiveness constraints would you want to use this procedure?

2michaelcohen
I know I don't prove it, but I think this agent would be vastly superhuman, since it approaches Bayes-optimal reasoning with respect to its observations. ("Approaches" because MAP -> Bayes). For the asymptotic results, one has to consider environments that produce observations with the true objective probabilities (hence the appearance that I'm unconcerned with competitiveness). In practice, though, given the speed prior, the agent will require evidence to entertain slow world-models, and for the beginning of its lifetime, the agent will be using low-fidelity models of the environment and the human-explorer, rendering it much more tractable than a perfect model of physics. And I think that even at that stage, well before it is doing perfect simulations of other humans, it will far surpass human performance. We manage human-level performance with very rough simulations of other humans. That leads me to think this approach is much more competitive that simulating a human and giving it a long time to think.
3Paul Christiano
I'm keen on asymptotic analysis, but if we want to analyze safety asymptotically I think we should also analyze competitiveness asymptotically. That is, if our algorithm only becomes safe in the limit because we shift to a super uncompetitive regime, it undermines the use of the limit as analogy to study the finite time behavior. (Though this is not the most interesting disagreement, probably not worth responding to anything other than the thread where I ask about "why do you need this memory stuff?")
2michaelcohen
Definitely agree. I don't think it's the case that a shift to super uncompetitiveness is actually an "ingredient" to benignity, but my only discussion of that so far is in the conclusion: "We can only offer informal claims regarding what happens before BoMAI is definitely benign..."
2Paul Christiano
Surely that just depends on how long you give them to think. (See also HCH.)
2michaelcohen
By competitiveness, I meant usefulness per unit computation.
3Paul Christiano
The algorithm takes an argmax over an exponentially large space of sequences of actions, i.e. it does 2^{episode length} model evaluations. Do you think the result is smarter than a group of humans of size 2^{episode length}? I'd bet against---the humans could do this particular brute force search, in which case you'd have a tie, but they'd probably do something smarter.
2michaelcohen
I obviously haven't solved the Tractable General Intelligence problem. The question is whether this is a tractable/competitive framework. So expectimax planning would naturally get replaced with a Monte-Carlo tree search, or some better approach we haven't thought of. And I'll message you privately about a more tractable approach to identifying a maximum a posteriori world-model from a countable class (I don't assign a very high probability to it being a hugely important capabilities idea, since those aren't just lying around, but it's more than 1%). It will be important, when considering any of these approximations, to evaluate whether they break benignity (most plausibly, I think, by introducing a new attack surface for optimization daemons). But I feel fine about deferring that research for the time being, so I defined BoMAI as doing expectimax planning instead of MCTS. Given that the setup is basically a straight reinforcement learner with a weird prior, I think that at that level of abstraction, the ceiling of competitiveness is quite high.
3Paul Christiano
I'm sympathetic to this picture, though I'd probably be inclined to try to model it explicitly---by making some assumption about what the planning algorithm can actually do, and then showing how to use an algorithm with that property. I do think "just write down the algorithm, and be happier if it looks like a 'normal' algorithm" is an OK starting point though Stepping back from this particular thread, I think the main problem with competitiveness is that you are just getting "answers that look good to a human" rather than "actually good answers." If I try to use such a system to navigate a complicated world, containing lots of other people with more liberal AI advisors helping them do crazy stuff, I'm going to quickly be left behind. It's certainly reasonable to try to solve safety problems without attending to this kind of competitiveness, though I think this kind of asymptotic safety is actually easier than you make it sound (under the implicit "nothing goes irreversibly wrong at any finite time" assumption).
2michaelcohen
Starting a new thread on this: here.

Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.

Comment prizes:

Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20

A plausible dangerous side-effect. Wei Dai. Link. $40

Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120

Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20

Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90

Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35

Simulating suffering agents. cousin_it. Link. $20

Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20

Options for transfer:

1) Venmo. Send me a request at @Michael-Cohen-45.

2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).

3) Name a charity for me to donate the money to.

I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than... (read more)

Here is an old post of mine on the hope that "computationally simplest model describing the box" is actually a physical model of the box. I'm less optimistic than you are, but it's certainly plausible.

From the perspective of optimization daemons / inner alignment, I think like the interesting question is: if inner alignment turns out to be a hard problem for training cognitive policies, do we expect it to become much easier by training predictive models? I'd bet against at 1:1 odds, but not 1:2 odds.

2michaelcohen
I don't actually rely on this assumption, although it underpins the intuition behind Assumption 2.
2Paul Christiano
I agree that you don't rely on this assumption (so I was wrong to assume you are more optimistic than I am). In the literal limit, you don't need to care about any of the considerations of the kind I was raising in my post.

From Paul:

I think the main problem with competitiveness is that you are just getting "answers that look good to a human" rather than "actually good answers."

The comment was here, but I think it deserves its own thread. Wei makes the same point here (point number 3), and our ensuing conversation is also relevant to this thread.

My answers to Wei were two-fold: one is that if benignity is established, it's possible to safely tinker with the setup until hopefully "answers that look good to a human" resembles good answers (we ... (read more)

Given that you are taking limits, I don't see why you need any of the machinery with forgetting or with memory-based world models (and if you did really need that machinery, it seems like your proof would have other problems). My understanding is:

  • Your already assume that you can perform arbitrarily many rounds of the algorithm as intended (or rather you prove that there is some such that if you ran steps, with everything working as intended and in particular with no memory corruption, then you would get "benign" behavior).
  • Any time the
... (read more)
2michaelcohen
Notational note: I use i0 to denote the episode when BoMAI becomes demonstrably benign and n0 for something else. Any time any model makes a different on-policy prediction from the intended model, it loses some likelihood (in expectation). The off-policy predictions don't get tested. Under a policy that doesn't cause the computer's memory to be tampered with (which is plausible, even ideal), ν† and ν⋆ are identical, so we can't count on ν† losing probability mass relative to ν⋆. The approach here is to set it up so that world-models like ν† either start with a lower prior, or else eventually halt when they exhaust their computation budget.
3Paul Christiano
I agree with that, but if they are always making the same on-policy prediction it doesn't matter what happens to their relative probability (modulo exploration). The agent can't act on an incentive to corrupt memory infinitely often, because each time requires the models making a different prediction on-policy. So the agent only acts on such an incentive finitely many times, and hence never does so after some sufficiently late episode i0. Agree/disagree? (Having a bad model can still hurt, since the bogus model might agree on-policy but assign lower rewards off-policy. But if they also always approximately agree on the exploration distribution, then a bad model also can't discourage exploration. And if they don't agree on the exploration distribution, then the bad model will eventually get tested.)
2michaelcohen
Ah I see what you’re saying. I suppose I constrained myself to producing an algorithm/setup where the asymptotic benignity result followed from reasons that don’t require dangerous behavior in the interim. Also, you can add another parameter to BoMAI where you just have the human explorer explore for the first E episodes. The i0 in the Eventual Benignity Theorem can be thought of as the max of i' and i''. i' comes from the i0 in Lemma 1 (Rejecting the Simple Memory-Based). i'' comes from the point in time when ^ν(i) is ε-accurate on policy, which renders Lemma 3 applicable. (And Lemma 2 always applies). My initial thought was to set E so that the human explorer is exploring for the whole time when the MAP world-model was not necessarily benign. This works for i'. E can just be set to be greater than i'. The thing it doesn't work for is i''. If you increase E, the value of i'' goes up as well. So in fact, if you set E large enough, the first time BoMAI controls the episode, it will be benign. Then, there is a period where it might not be benign. However, from that point on, the only "way" for a world-model to be malign is by being worse than ε-inaccurate on-policy, because Lemmas 1 and 2 have already kicked in, and if it were ε-accurate on-policy, Lemma 3 would kick in as well. The first point to make about this is that in this regime, benignity comes in tandem with intelligence--it has to be confused to be dangerous (like a self-driving car). The second point is: I can't come up with an example of world-model which is plausibly maximum a posteriori in this interval of time, and which is plausibly dangerous (for what that's worth; and I don't like to assume it's worth much because it took me months to notice ν†).
3Paul Christiano
I think my point is this: * The intuitive thing you are aiming at is stronger than what the theorem establishes (understandably!) * You probably don't need the memory trick to establish the theorem itself. * Even with the memory trick, I'm not convinced you meet the stronger criterion. There are a lot of other things similar to memory that can cause trouble---the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia.
2michaelcohen
This is a conceptual approach I hadn't considered before--thank you. I don't think it's true in this case. Let's be concrete: the asymptotic feature that would have caused it to avoid memory-based models even without amnesia is trial and error, applied to unsafe policies. Every section of the proof, however, can be thought of as making off-policy predictions behave. The real result of the paper would then be "Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested". So while there might be malign world-models of a different flavor to the memory-based ones, I don't think the way this theorem treats them is unsatisfying.

Comment thread: general concerns/confusions

  1. Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
  2. I'm concerned about overloading the word "benign" with a new concept (mainly not seeking power outside the box, if I understand correctly) that doesn't match either informal usage or a previous technical definition. In particular this "benign" AGI (in the limit) will hack the operator's mind to give itself maximum reward, if that's possible, right?
  3. The system seems limited to answering questions that the human operator can correctly evaluate the answers to within a single episode (although I suppose we could make the episodes very long and allow multiple humans into the room to evaluate the answer together). (We could ask it other questions but it would give answers that sound best to the operator rather than correct answers.) If you actually had this AGI today, what questions would you ask it?
  4. If you were to ask it a question like "Given these symptoms, do I need emergency medical treatment?" and the correct answer is "yes", it would answer "no" because if it answered "yes" then the operator would leave the room and it would get 0 reward for the rest of the episode. M
... (read more)
2michaelcohen
When I say it would continue to accomplish whatever task we wanted, I'm being a bit sloppy--if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that "whatever task we wanted" does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task "respond with the true answer to [difficult question]" is not one that the operator can convert into a reward, but the task "respond with an answer that sounds plausible to the operator" is. I think this subsumes your example.
2michaelcohen
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the "smarter" information theoretic exploratory policies that I've considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it's not a weird cheat code.
1michaelcohen
Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don't use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it's behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way). Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator. I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail. REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like "AI Safety via debate."
2Wei Dai
This seems useful if you could get around the mind hacking problem, but how would you do that? I don't know how this would work in terms of your setup. The most obvious way would seem to require the two agents to simulate each other, which would be impossible, and I'm not sure what else you might have in mind.
2Wei Dai
On second thought, (even assuming away the mind hacking problem) if you ask about "how to make a safe unbounded AGI" and "what's wrong with the answer" in separate episodes, you're essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on. (Two episodes isn't enough to determine whether the first answer you got was a good one, because the second answer is also optimized for sounding good instead of being actually correct, so you'd have to do another episode to ask for a counter-argument to the second answer, and so on, and then once you've definitively figured out that some answer/node was bad, you have to ask for another answer at that node and repeat this process.) The point of "AI Safety via Debate" was to let AI do all this searching for you, so it seems that you do have to figure out how to do something similar to avoid the exponential search. ETA: Do you know if the proposal in "AI Safety via Debate" is "asymptotically benign" in the sense you're using here?
2michaelcohen
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
1Wei Dai
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it "benign" in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
1michaelcohen
Add the retrograde amnesia chamber and an explorer, and we're pretty much at this, right? Without the retrograde amnesia, it might still be benign, but I don't know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
1michaelcohen
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing. ... but yes, it is still exponential (exponential in what, exactly? maybe the number of concepts we have handles for?); this comment is the real answer to your question.
2Wei Dai
Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I'm not sure how to tell what will really happen short of actually having a superintelligent AI to test with.
2michaelcohen
You're right (see the redaction). Why Wei is right. Here's an unpolished idea though: they could do something like minimax. Instead of simulating the other agent, they could model the environment as responding to a pair of actions. For inference, they would have the history of their opponent's actions as well, and for planning, they could pick their action to maximize their objective assuming the other agent's actions are maximally inconvenient.
1Wei Dai
So you basically have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via debate”, namely that it seems hard to predict what happens when you have superintelligent AIs play a zero-sum game with a human as the judge.
1michaelcohen
Yep.
1Wei Dai
With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B's previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can't ask for any counter-arguments from B without risking being mind hacked by B. (I realize this is a problem in “AI Safety via debate” as well, but I'm asking you since you're here and Geoffrey Irving isn't. :)
1michaelcohen
*but A could concoct a story ... counterarguments from B .. mind hacked by B, right? I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn't be a problem. That said, here's one possibility: if A ever suggests that you don't read more output from B, don't read anything more from B, then flip coins to give A a 25% chance of victory.
1Wei Dai
Yeah, I mixed up the A's and B's at the end. It's fixed now. Thanks for pointing it out. I don't understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can't what I suggested still happen? But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn't read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they're trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.
1michaelcohen
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that. So actually I framed my point above wrong: "demanding that A use their words" could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources). But your original point was "why doesn't A just claim B is mind-hacking" not "why doesn't B just mind-hack"? The answer to that point was "demand A use their words rather than negotiate an end to the conversation" or more moderately, "75%-demand that A do this."
1Wei Dai
Oh, I see, I didn't understand "it is harder to mystify a judge than it is to pierce through someone else mystifying a judge" correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to "pierce through" B's attack and stop it with its words if the judge keeps listening to B (and B was actually attacking). I don't think the "AI safety via debate" paper actually makes arguments for this assumption (at least I couldn't find where it does). Do you have reasons to think it's true, or ideas for how to verify that it's true, short of putting a human in a BoMAI?
1michaelcohen
Yeah... I don't have much to add here. Let's keep thinking about this. I wonder if Paul is more bullish on the premise that "it is harder to mystify a judge than it is to pierce through someone else mystifying a judge" than I am? Recall that this idea was to avoid If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn't originally what brought us here.
1michaelcohen
If the person leaves having been hypnotized, since it's not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2's only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively. A bit of a side note: I'm curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
2Wei Dai
Instead of hypnosis, I'm more worried about the AI talking the operator into some kind of world view that implies they should be really generous to the AI (i.e., give it max rewards), or give some sequence of answers that feel extremely insightful (and inviting further questions/answers in the same vein). And then the operator might feel a desire afterwards to spread this world view or sequence of answers to others (even though, again, this wasn't optimized for by the AI). If you try to solve the mind hacking problem iteratively, you're more likely to find a way to get useful answers out of the system, but you're also more likely to hit upon an existentially catastrophic form of mind hacking. I guess it depends on how many interactions per episode and how long each answer can be. I would say >.9 probability that hypnosis or something like what I described above is optimal if they are both long enough. So you could try to make this system safer by limiting these numbers, which is also talked about in "AI Safety via Debate" if I remember correctly.
1michaelcohen
It is plausible to me that there is selection pressure to make the operator "devoted" in some sense to BoMAI. But most people with a unique motive are not able to then take over the world or cause an extinction event. And BoMAI has no incentive to help the operator gain those skills. Just to step back and frame this conversation, we're discussing the issue of outside-world side-effects that correlate with in-the-box instrumental goals. Implicit in the claim of the paper is that technological progress is an outside-world correlate of operator-satisfaction, an in-the-box instrumental goal. I agree it is very much worth considering plausible pathways to negative consequences, but I think the default answer is that with optimization pressure, surprising things happen, but without optimization pressure, surprising things don't. (Again, that is just the default before we look closer). This doesn't mean we should be totally skeptical about the idea of expecting technological progress or long-term operator devotion, but it does contribute to my being less concerned that something as surprising as extinction would arise from this.
4Wei Dai
Yeah, the threat model I have in mind isn't the operator taking over the world or causing an extinction event, but spreading bad but extremely persuasive ideas that can drastically curtail humanity's potential (which is part of the definition of "existential risk"). For example fulfilling our potential may require that the universe eventually be controlled mostly by agents that have managed to correctly solve a number of moral and philosophical problems, and the spread of these bad ideas may prevent that from happening. See Some Thoughts on Metaphilosophy and the posts linked from there for more on this perspective.
2michaelcohen
Let XX be the event in which: a virulent meme causes sufficiently many power-brokers to become entrenched with absurd values, such that we do not end up even satisficing The True Good. Empirical analysis might not be useless here in evaluating the "surprisingness" of XX. I don't think Christianity makes the cut either for virulence or for incompatibility with some satisfactory level of The True Good. I'm adding this not for you, but to clarify for the casual reader: we both agree that a Superintelligence setting out to accomplish XX would probably succeed; the question here is how likely this is to happen by accident if a superintelligence tries to get a human in a closed box to love it.
0michaelcohen
Can you explain this?
1Wei Dai
Suppose there are n forms of mind hacking that the AI could do, some of which are existentially catastrophic. If your plan is "Run this AI, and if the operator gets mind-hacked, stop and switch to an entirely different design." the likelihood of hitting upon an existentially catastrophic form of mind hacking is lower than if the plan is instead "Run this AI, and if the operator gets mind-hacked, tweak the AI design to block that specific form of mind hacking and try again. Repeat until we get a useful answer."
1michaelcohen
Hm. This doesn't seem right to me. My approach for trying to form an intuition here includes returning to the example (in a parent comment) but I don't imagine this satisfies you. Another piece of the intuition is that mind-hacking for the aim of reward within the episode, or even the possible instrumental aim of operator-devotion, still doesn't seem very existentially risky to me, given the lack of optimization pressure to that effect. (I know the latter comment sort of belongs in other branches of our conversation, so we should continue to discuss it elsewhere). Maybe other people can weigh in on this, and we can come back to it.
1michaelcohen
I'm open to other terminology. Yes, there is no guarantee about what happens to the operator. As I'm defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is "not existentially dangerous."
2Wei Dai
There's still an existential risk in the sense that the AGI has an incentive to hack the operator to give it maximum reward, and that hack could have powerful effects outside the box (even though the AI hasn't optimized it for that purpose), for example it might turn out to be a virulent memetic virus. Of course this is much less risky than if the AGI had direct instrumental goals outside the box, but "benign" and "not existentially dangerous" both seem to be claiming a bit too much. I'll think about what other term might be more suitable.
1michaelcohen
The first nuclear reaction initiated an unprecedented temperature in the atmosphere, and people were right to wonder whether this would cause the atmosphere to ignite. The existence of a generally intelligent agent is likely to cause unprecedented mental states in humans, and we would be right to wonder whether that will cause an existential catastrophe. I think the concern of "could have powerful effects outside the box" is mostly captured by the unprecedentedness of this mental state, since the mental state is not selected to have those side effects. Certainly there is no way to rule out side-effects of inside-the-box events, since these side effects are the only reason it's useful. And there is also certainly no way to rule out how those side effects "might turn out to be," without a complete view of the future. Would you agree that unprecedentedness captures the concern?
2Wei Dai
I think my concern is a bit more specific than that. See this comment.
2Paul Christiano
From the formal description of the algorithm, it looks like you use a universal prior to pick k, and then allow the kth Turing machine to run for ℓ steps, but don't penalize the running time of the machine that outputs k. Is that right? That didn't match my intuitive understanding of the algorithm, and seems like it would lead to strange outcomes, so I feel like I'm misunderstanding.
1michaelcohen
Yes this is correct. If you use the same bijection consistently from strings to natural numbers, it looks a little more intuitive than if you don't. The universal prior picks k (the number) by outputting k as a string. The kth Turing machine is the Turing machine described by k as a string. So you end up looking at the Kolmogorov complexity of the description of the Turing machine. So the construction of the description of the world-model isn't time-penalized. This doesn't change the asymptotic result, so I went with the more familiar K(x) rather than translating this new speed prior into measure over finite strings, which would require some more exposition, but I agree with you it feels like there might be some strange outcomes "before the limit" as a result of this approach: namely, the code on the UTM that outputs the description of the world-model-Turing-machine will try to do as much of the computation as possible in advance, by computing the description of an speed-optimized Turing machine for when the actions start coming. The other reasonable choices here instead of K(x) are S(x) (constructed to be like the new speed prior here) and ℓ(x)--the length of x. But ℓ(x) basically tells you that a Turing machine with fewer states is simpler, which would lead to a measure over H∞ that is dominated by world-models that are just universal Turing machines, which defeats the purpose of doing maximum a posteriori instead of a Bayes mixture. The way this issue appears in the proof renders the Natural Prior Assumption less plausible.
3Paul Christiano
This invalidates some of my other concerns, but also seems to mean things are incredibly weird at finite times. I suspect that you'll want to change to something less extreme here. (I might well be misunderstanding something, apologies in advance.) Suppose the "intended" physics take at least 1E15 steps to run on the UTM (this is a conservative lower bound, since you have to simulate the human for the whole episode). And suppose β<0.999 (I think you need β much lower than this). Then the intended model gets penalized by at least exp(1E12) for its slowness. For almost the same description complexity, I could write down physics + "precompute the predictions for the first N episodes, for every sequence of possible actions/observations, and store them in a lookup table." This increases the complexity by a few bits, some constant plus K(N|physics), but avoids most of the computation. In order for the intended physics to win, i.e. in order for the "speed" part of the speed prior to do anything, we need the complexity of this precomputed model to be at least 1E12 bits higher than the complexity of the fast model. That appears to happen only once N > BB(1E12). Does that seem right to you? We could talk about whether malign consequentialists also take over at finite times (I think they probably do, since the "speed" part of the speed prior is not doing any work until after BB(1E12) steps, long after the agent becomes incredibly smart), but it seems better to adjust the scheme first. Using the speed prior seems more reasonable, but I'd want to know which version of the speed prior and which parameters, since which particular problem bites you will depend on those choices. And maybe to save time, I'd want to first get your take on whether the proposed version is dominated by consequentialists at some finite time.
1michaelcohen
Yes. I recall thinking about precomputing observations for various actions in this phase, but I don’t recall noticing how bad the problem was not in the limit. This goes in the category of “things I can’t rule out”. I say maybe 1/5 chance it’s actually dominated by consequentialists (that low because I think the Natural Prior Assumption is still fairly plausible in its original form), but for all intents and purposes, 1/5 is very high, and I’ll concede this point. 2−K(s)(1+ε) is a measure over binary strings. Instead, let’s try ∑p∈{0,1}∗:U(p)=s2−ℓ(p)βcT(U,p), where ℓ(p) is the length of p, T(U,p) is the time it takes to run p on U, and c is a constant. If there were no cleverer strategy than precomputing observations for all the actions, then c could be above |A|−md, where d is the number of episodes we can tolerate not having a speed prior for. But if it somehow magically predicted which actions BoMAI was going to take in no time at all, then c would have to be above 1/d. What problem do you think bites you?
2Paul Christiano
Do you get down to 20% because you think this argument is wrong, or because you think it doesn't apply? What's β? Is it O(1) or really tiny? And which value of c do you want to consider, polynomially small or exponentially small? Wouldn't they have to also magically predict all the stochasticity in the observations, and have a running time that grows exponentially in their log loss? Predicting what BoMAI will do seems likely to be much easier than that.
1michaelcohen
You argument is about a Bayes mixture, not a MAP estimate; I think the case is much stronger that consequentialists can take over a non-trivial fraction of a mixture. I think that the methods with consequentialists discover for gaining weight in the prior (before the treacherous turn) are mostly likely to be elegant (short description on UTM), and that is the consequentialists' real competition; then [the probability the universe they live in produces them with their specific goals]or [the bits to directly specify a consequentialist deciding to to do this] set them back (in the MAP context).
2Paul Christiano
I don't see why their methods would be elegant. In particular, I don't see why any of {the anthropic update, importance weighting, updating from the choice of universal prior} would have a simple form (simpler than the simplest physics that gives rise to life). I don't see how MAP helps things either---doesn't the same argument suggest that for most of the possible physics, the simplest model will be a consequentialist? (Even more broadly, for the universal prior in general, isn't MAP basically equivalent to a random sample from the prior, since some random model happens to be slightly more compressible?)
2michaelcohen
Yeah I think we have different intuitions here; are we at least within a few bits of log-odds disagreement? Even if not, I am not willing to stake anything on this intuition, so I'm not sure this is a hugely important disagreement for us to resolve. I didn't realize that you think that a single consequentialist would plausibly have the largest share of the posterior. I assumed your beliefs were in the neighborhood of: (from your original post on this topic). In a Bayes mixture, I bet that a team of consequentialists that collectively amount to 1/10 or even 1/50 of the posterior could take over our world. In MAP, if you're not first, you're last, and more importantly, you can't team up with other consequentialist-controlled world-models in the mixture.
1michaelcohen
Let's say β=0.9, c=1/20. Oh yeah--that's good news.
1michaelcohen
Although I don't really like to make anything that would fall apart if the world were deterministic. Relying on stochasticity feels wrong to me.
2cousin_it
Are there UDT-ish concerns with breaking isolation of episodes? For example, if the AI receives a low reward at the beginning of episode 117, does it have an incentive to manipulate the external world to make episode 117 happen many times somehow, with most of these times giving it a higher reward? For another example, can the AI at episode 117 realize that it's in a game theory situation with the AI at episodes 116 and 118 and trade rewards with them acausally, leading to long-term goal directed behavior?
1michaelcohen
BoMAI is a causal decision theorist.
1Wei Dai
It's actually updateless within an episode, right, because it finds the optimal policy at the beginning of an episode and then the same policy is used throughout the episode? I think this means there's no reason for the operator to enter a reward at every timestamp, and instead we could let them just input a total reward at any time in the episode (or they could enter any number of rewards, and the optimization is based on the last reward entered). This would remove the incentive for the AI to keep the operator in the room as long as possible and would partially address item 4 in my comment.
1michaelcohen
So "updateless" is a reasonable term to apply to BoMAI, but it's not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn't mediated by actual choice of action, updatelessness ends up having consequences. If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it's isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it's the same calculation being run. And this calculation is CDT. I guess this means, and I've never thought about this before so this could easily be wrong, under the assumption that a policy's effect on the world is screened off by which actions it takes, CDT is reflectively stable. (And yes, you could just give one reward, which ends the episode.)
1Erick Ball
My concern is that since CDT is not reflectively stable, it may have incentives to create non-CDT agents in order to fulfill instrumental goals.
2Wei Dai
If I understand correctly, it's actually updateless within an episode, and that's the only thing it cares about so I don't see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn't have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator's mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
1michaelcohen
With the correction that it is updateless and CDT (see here), I agree with the rest of this.
1Alex Turner
The conclusion seems false; AUP (IJCAI, LW) is a reward maximizer which does not exhibit this behavior. For similar reasons, the recent totalitarian convergence conjecture made here also seems not true.
2michaelcohen
AUP seems really promising. I just meant other algorithms that have been proven generally intelligent, which is really just AIXI, the Thompson Sampling Agent, BayesExp, and a couple other variants on Bayesian agents with large model classes.
1Erick Ball
This may be a dumb question, but how can you asymptotically guarantee human-level intelligence when the world-models have bounded computation time, and the human is a "computable function" that has no such limit? Is it because the number of Turing machines is infinite?
1michaelcohen
Not a dumb question; bounded computation time here means bounded computation time per episode, so really it's linear computation time.

Comment thread: concerns with "the box"

2Joe Collman
Just obvious and mundane concerns: You might want to make clearer that "As long as the door is closed, information cannot leave the room" isn't an assumption but a requirement of the setup. I.e. that you're not assuming based on your description that opening the door is the only means for an operator to get information out; you're assuming every other means of information escape has been systematically accounted for and ruled out (with the assumption that the operator has been compromised by the AI).
1michaelcohen
Responding to this comment here. I didn't go into the details of the physical design for a really secure box in the paper, because it's not really a topic in computer science, but this is a good excuse to write it up. The operator is inside a glass box. Outside the glass box is a vacuum, and then a concrete box. Outside the concrete box is another vacuum and another concrete box with a Faraday cage around it. There are (airtight) doors through each layer. There is a pressure sensor in both the inner and outer vacuum layer. The computer lives in the inner wall of the inner concrete box; the screen is visible from the inside of the glass box. If the pressure increases in either vacuum layer, the computer's memory is erased, and the whole thing is blown up. There is a laser shining through the glass box (Laser 1). If the operator blocks the path of Laser 1, this corresponds to pressing the button which ends the episode and opens the door. The pressure sensor -> dynamite pathway is deactivated, the vacuum layers are filled with air, and the doors are unlocked. The operator "types" observations and rewards by blocking other lasers that pass through the box. For now, just imagine a few dozen lasers in a row, that the operator can reach up and block in order to type. (With a bit of creativity, and requiring that two lasers be blocked at once to "press a key" this could probably be made somewhat like a qwerty setup.) The lasers and the laser-detectors are all on the inner wall of the inner concrete box. So inside the box is only a human body and some clothes. Let's assume: 1) the only ways for information to move in space are with a) physically moving matter, b) light, and c) gravitational waves and 2) it is impossible for a human being to convert a nuclear reaction into useful work with only their body and clothes. Physically moving packets matter cannot leave the inside of the glass box without releasing air into the vacuum and causing an explosion. Light cannot leave

Comment thread: concerns with Assumption 3

1michaelcohen
Wei Dai's response to this comment represents a concern with Assumption 3.

Comment thread: concerns with Assumption 2

3Joe Collman
[Quite possibly I'm confused, but in case I'm not:] I think this assumption might be invalid (or perhaps require more hand-waving than is ideal). The AI has an incentive to understand the operator's mind, since this bears directly on its reward. Better understanding the operator's mind might be achieved in part by running simulations including the operator. One specific simulation would involve simulating the operator's environment and actions after he leaves the room. Here this isn't done to understand the implications of his actions (which can't affect the episode); it's done to better understand his mind (which can). In this way, one branch of forget/not-forget has two useful purposes (better understand mind and simulate future), while the other has one (better understand mind). So a malign memory-based model needn't be slower than a benign model, if it's useful for that benign model to simulate the future too. So either I'm confused, or the justification for the assumption isn't valid. Hopefully the former :). If I'm right, then what you seem to need is an assumption that simulating the outside-world's future can't be helpful in the AI's prediction of its reward. To me, this seems like major hand-waving territory.
<