Donald Hobson

MMath Cambridge. Currently studying postgrad at Edinburgh.


Neural Networks, More than you wanted to Show
Logical Counterfactuals and Proposition graphs

Wiki Contributions


Suppose Bob is a baker who has made some bread. He can give the bread to Alice, or bin it. 

By the ROSE value, Alice should pay $0.01 to Bob for the bread.

How is an honest baker supposed to make a profit like that?

But suppose, before the bread is baked, Bob phones Alice. 

"Well the ingredients cost me $1" he says, "how much do you want the bread?"

If Alice knows pre baking that she will definitely want bread, she would commit to paying $1.01 for it, if she valued the bread at at least that much. If Alice has a 50% chance of wanting bread, she could pay $1.01 with certainty, or equivalently pay $2.02 in the cases where she did want the bread. The latter makes sense if Alice only pays in cash and will only drive into town if she does want the bread.

If Alice has some chance of really wanting bread, and some chance of only slightly wanting bread, it's even more complicated. The average bill across all worlds is $1.01, but each alternate version of Alice wants to pay less than that.

Skimmed the paper. Can't find a clear definition of what this ALU actually measures. So I don't know whether this is impressive or not. (It's too short to be searchable)

EDIT: I now consider this whole approach to be irreparably flawed. 

Imagine if their was a 50% chance the button would be pressed. Also the AI was in some betting thing that would pay out paperclips or staples based on a coin flip. 

In scenario 1, this AI has an option of setting up a mechanism, external to itself, that controls the coin flip based on if the button is pressed. 

In scenario 2, the AI can set up a mechanism to control the button based on the coin flip.

When only considering actions, probabilities and utilities, these look identical. But we want the AI to behave differently. So we must make use of the arrow of time. The structure of cause and effect. In a way we aren't doing here.

End edit.


To make this work, you would need a utility function for shutting off, including all subagents. 

Lets suppose that other than these robots, there are ~0 transistors in your house. Then we can define shutting off as minimizing the number of transistor flips in your house. 

So you make 

A=at least 1 coffee

B=1/transistor flips.

Note that, when imagining any potential future world where the switch isn't pressed, the AI has no reason to bother counting transistor flips. And in potential futures where the switch is pressed, it doesn't need to know what coffee is.

What it cares about are   and  

Lets list the robot's options. 

Let  without coffee, and  with coffee.

In worlds where the switch is pressed, the robot turns it's self off immediately once it is fairly sure the switch will be pressed. Making  In worlds where it doesn't turn off, more transistors flip making . Because  in both cases, and we only select from the paerito frontier, whenever the switch is pressed, it will turn off. 

Lets apply your utility penalties, by putting them in  and . Ie in  or  depending on the switch.

TC) Press switch, avoid cat. Switch pressed. So  . 

PC) Prevent switch, avoid cat. 

IC) Ignore switch, avoid cat. 

TH) Press switch, hit cat. 

IH) Ignore switch, hit cat  (because it predicts humans will see it and turn it off)

PH) Prevent switch, hit cat.  


This puts IH and PH on the convex hull.

And I think my algorithm picks between them stochastically. 

A few other problems with time bounded agents. 

If they are engaged in self modification/ creating successor agents, they have no reason not to create an agent that isn't time bounded. 

As soon as there is any uncertainty about what time it is, then they carry on doing things, just in case their clock is wrong. 

(How are you designing it? Will it spend forever searching for time travel?)

Note: The absence of a catastrophe is also still hard to specify and will take a lot of effort, but the hardness is concentrated on bridging between high-level human concepts and the causal mechanisms in the world by which an AI system can intervene. For that...

Is the lack of a catastrophe intended to last forever, or only a fixed amount of time (ie 10 years, until turned off)

For all Time.

Say this AI looks to the future, it sees everything disassembled by nanobots. Self replicating bots build computers. Lots of details about how the world was are being recorded. Those recordings are used in some complicated calculation. Is this a catastrophe? 

The answer sensitively depends on the exact moral valance of these computations, not something easy to specify. If the catastrophe prevention AI bans this class of scenarios, it significantly reduces future value, if it permits them, it lets through all sorts of catastrophes. 

For a while.

If the catastrophe prevention is only designed to last a while while other AI is made, then we can wait for the uploading. But then, an unfriendly AI can wait too. Unless the anti catastrophe AI is supposed to ban all powerful AI systems that haven't been greenlit somehow? (With a greenlighting process set up by human experts, and the AI only considers something greenlit if it sees it signed with a particular cryptographic key.) And the supposedly omnipotent catastrophe prevention AI has been programmed to stop all other AI's exerting excess optimization on us. (in some way that lets us experiment while shielding us from harm.) 

Tricky. But maybe doable.

It is important to remember that humans, unlike all other species, are able to use complex language. This is a huge confounding factor, when we try to compare the intelligence of humans and animals. It is obviously very powerful to be able to exchange complex ideas, and build up knowledge intergenerationally. This would probably be enough to give humans a very large advantage, even if our intelligence was otherwise exactly the same as that of other primates.


Communication is an aspect of intelligence. It takes place in the brain not the kidneys. Now you could argue that communication is a special extra boost above and beyond the normal gains of intelligence, that humans are near the top of the communication sigmoid, and that there are no other special extra boosts out there. 

Do monkeys have a mind capable of understanding calculus internally and just lack any language capable of learning it? (Such that a monkey given perfect communication but no other increases in intelligence would be able to learn calculus.) 

I am not convinced the question is meaningful. I doubt that "communication" is a clear boundary on the neurochemical level, with a sharp divide between communication neurons and other neurons.

First of all, there have been "feral" humans that grew up surrounded by animals. As far as I know, these humans are not obviously much more intelligent than animals (in terms of their ability to solve problems).


Think of that like a modern supercomputer being used to play pong. 

(well not that big of a gap, but you get the picture)

Animal brains have a relatively simple and limited range of pieces of software they can run. 

Human brains are able to run a much wider range of much more complicated programs. 

In other words, human intelligence shows up in that with the right training, we are able to do all sorts of complicated things. Whereas there are plenty of things most humans can do that animals can't be trained to do. 

To actually be able to do useful stuff, humans need the right training, both in the specific technical details, and more general stuff like the scientific method. (with some ability to figure those things out from large amounts of trial and error)

Train a human in a nutty cult, and their intelligence is useless. But the point is that humans can be trained to do physics. Not that every human crawls out of the womb doing calculus. 


I agree that if your only examples of humans were feral humans, then you would have no reason to think humans were much smarter. And then you would be very surprised by basically any educated human. 

Or another way to put this, for a slightly different definition of the word "intelligence" is that humans can be much more intelligent than animals with the right environment.

I think the assumptions that.

  1. Humans realize the AI exists early on.
  2. Humans are reasonably coordinated and working against the AI.

Are both dubious. 

What is stopping someone sending a missile at GPT-4's servers right now. 

  1. OpenAI hasn't anounced a list of coordinated for where those servers are (as far as I know) This is because
  2. OpenAI doesn't want you to missile strike their servers because
  3. OpenAI thinks their AI is safe and useful not dangerous.

I think seeing large numbers of humans working in a coordinated fashion against an AI is unlikely. 

I think the human level of understanding is a factor, and of some importance. But I strongly suspect the exact level of human understanding is of less importance than exactly what expert we summon. 

Yeah, probably. However, note that it can only use this channel if a human has deliberately made an optimization channel that connects in to this process. Ie the AI isn't allowed to invent DNA printers itself.

I think a bigger flaw is where one human decided to make a channel from A to B, another human made a channel from B to C ... until in total there is a channel from A to Z that no human wants and no human knows exists, built entirely out of parts that humans build. 

Ie person 1 decides the AI should be able to access the internet. Person 2 decided that anyone on the internet should be able to run arbitrary code on their programming website, and the AI puts those together, even when no human did. Is that a failure of this design? Not sure. Can't get a clear picture until I have actual maths.

Load More