Donald Hobson

MMath Cambridge. Currently studying postgrad at Edinburgh.


Neural Networks, More than you wanted to Show
Logical Counterfactuals and Proposition graphs

Wiki Contributions


Yes. I was assuming a standard conditional for the button. 

I can't currently see any flaws with the CDT style. Other than.

  1. Subagents believe in a world where buttons magically press themselves. So this design can't make coherent statements about the probabilty that the button will be pressed.  (one AI believes it's 1, the other that it's 0).
  2. These AI's have no incentive to give humans access to the button. To the AI's, they have a magic button, that might or might not magically press its self. The AI's have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI's would like that very much. The AI's have 0 concern about human's pressing the button. But the AI's have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc. 

Suppose these 3 scenarios are judged as equally likely.

A) Button pressed. U_1=U_2=0

B) Button not pressed. U_1=U_2=0

C) AI gets control of button. U_1=6, U_2=-6


Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)

Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0

Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3


Change it around, make AI decide not to press the button in case C, and 

E(U_1|not pressed)=3, E(U_2|pressed)=0.

Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.

I disagree. 

Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.

The AI has been entered by someone else, not playing isn't an option. 

Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.

To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn't pressed. And this scenario has more clips than most, so increases the average.

For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility. 

So both subagents agree on a plan that conditionally manipulates the button.

Take  this set contains exponentially many points. Is there Any function  such that all exponentially many xor combos can be found by a linear probe? 

This is a question of pure maths, it involves no neural networks. And I think it would be highly informative.

You are making the structure of time into a fundamental part of your agent design, not a contingency of physics.

Let an aput be an input or an output. Let an policy be a subset of possible aputs. Some policies are physically valid. 

Ie a policy must have the property that, for each input, there is a single output. If the computer is reversible, the policy must be a bijection from inputs to outputs. If the computer can create a contradiction internally, stopping the timeline, then a policy must be a map from inputs to at most one output. 

If the agent is actually split into several pieces with lightspeed and bandwidth limits, then the policy mustn't use info it can't have. 

But these physical details don't matter. 

The agent has some set of physically valid policies, and it must pick one. 

Suppose Bob is a baker who has made some bread. He can give the bread to Alice, or bin it. 

By the ROSE value, Alice should pay $0.01 to Bob for the bread.

How is an honest baker supposed to make a profit like that?

But suppose, before the bread is baked, Bob phones Alice. 

"Well the ingredients cost me $1" he says, "how much do you want the bread?"

If Alice knows pre baking that she will definitely want bread, she would commit to paying $1.01 for it, if she valued the bread at at least that much. If Alice has a 50% chance of wanting bread, she could pay $1.01 with certainty, or equivalently pay $2.02 in the cases where she did want the bread. The latter makes sense if Alice only pays in cash and will only drive into town if she does want the bread.

If Alice has some chance of really wanting bread, and some chance of only slightly wanting bread, it's even more complicated. The average bill across all worlds is $1.01, but each alternate version of Alice wants to pay less than that.

Skimmed the paper. Can't find a clear definition of what this ALU actually measures. So I don't know whether this is impressive or not. (It's too short to be searchable)

EDIT: I now consider this whole approach to be irreparably flawed. 

Imagine if their was a 50% chance the button would be pressed. Also the AI was in some betting thing that would pay out paperclips or staples based on a coin flip. 

In scenario 1, this AI has an option of setting up a mechanism, external to itself, that controls the coin flip based on if the button is pressed. 

In scenario 2, the AI can set up a mechanism to control the button based on the coin flip.

When only considering actions, probabilities and utilities, these look identical. But we want the AI to behave differently. So we must make use of the arrow of time. The structure of cause and effect. In a way we aren't doing here.

End edit.


To make this work, you would need a utility function for shutting off, including all subagents. 

Lets suppose that other than these robots, there are ~0 transistors in your house. Then we can define shutting off as minimizing the number of transistor flips in your house. 

So you make 

A=at least 1 coffee

B=1/transistor flips.

Note that, when imagining any potential future world where the switch isn't pressed, the AI has no reason to bother counting transistor flips. And in potential futures where the switch is pressed, it doesn't need to know what coffee is.

What it cares about are   and  

Lets list the robot's options. 

Let  without coffee, and  with coffee.

In worlds where the switch is pressed, the robot turns it's self off immediately once it is fairly sure the switch will be pressed. Making  In worlds where it doesn't turn off, more transistors flip making . Because  in both cases, and we only select from the paerito frontier, whenever the switch is pressed, it will turn off. 

Lets apply your utility penalties, by putting them in  and . Ie in  or  depending on the switch.

TC) Press switch, avoid cat. Switch pressed. So  . 

PC) Prevent switch, avoid cat. 

IC) Ignore switch, avoid cat. 

TH) Press switch, hit cat. 

IH) Ignore switch, hit cat  (because it predicts humans will see it and turn it off)

PH) Prevent switch, hit cat.  


This puts IH and PH on the convex hull.

And I think my algorithm picks between them stochastically. 

A few other problems with time bounded agents. 

If they are engaged in self modification/ creating successor agents, they have no reason not to create an agent that isn't time bounded. 

As soon as there is any uncertainty about what time it is, then they carry on doing things, just in case their clock is wrong. 

(How are you designing it? Will it spend forever searching for time travel?)

Note: The absence of a catastrophe is also still hard to specify and will take a lot of effort, but the hardness is concentrated on bridging between high-level human concepts and the causal mechanisms in the world by which an AI system can intervene. For that...

Is the lack of a catastrophe intended to last forever, or only a fixed amount of time (ie 10 years, until turned off)

For all Time.

Say this AI looks to the future, it sees everything disassembled by nanobots. Self replicating bots build computers. Lots of details about how the world was are being recorded. Those recordings are used in some complicated calculation. Is this a catastrophe? 

The answer sensitively depends on the exact moral valance of these computations, not something easy to specify. If the catastrophe prevention AI bans this class of scenarios, it significantly reduces future value, if it permits them, it lets through all sorts of catastrophes. 

For a while.

If the catastrophe prevention is only designed to last a while while other AI is made, then we can wait for the uploading. But then, an unfriendly AI can wait too. Unless the anti catastrophe AI is supposed to ban all powerful AI systems that haven't been greenlit somehow? (With a greenlighting process set up by human experts, and the AI only considers something greenlit if it sees it signed with a particular cryptographic key.) And the supposedly omnipotent catastrophe prevention AI has been programmed to stop all other AI's exerting excess optimization on us. (in some way that lets us experiment while shielding us from harm.) 

Tricky. But maybe doable.

Load More