Donald Hobson

Cambridge maths student dph39@cam.ac.uk

Donald Hobson's Comments

Contest: $1,000 for good questions to ask to an Oracle AI

Solution, invent something obviously very dangerous. Multiple big governments get into bidding war to keep it out of the others hands.

Abstraction, Evolution and Gears

I think that there are some things that are sensitively dependant on other parts of the system, and we usually just call those bits random.

Suppose I had a magic device that returned the exact number of photons in its past lightcone. The answer from this box would be sensitively dependant on the internal workings of all sorts of things, but we can call the output random, and predict the rest of the world.

The flap of a butterflies wing might effect the weather in a months time. The weather is chaotic and sensitively dependant on a lot of things, but whatever the weather, the earths orbit will be unaffected (for a while, orbits are chaotic too on million year timescales)

We can make useful predictions (like planetary orbits, and how hot the planets will get) based just on the surface level abstractions like the brightness and mass of a star, but a more detailed models containing more internal workings would let us predict solar flares and supernova.

AGIs as populations
Decreasing this communication bandwidth might be a useful way to increase the interpretability of a population AGI.

On one hand, there would be an effect where reduced bandwidth encouraged the AI's to focus on the most important pieces of information. If the AI's have 1 bit of really important info, and gigabytes of slightly useful info to send to each other, then you know that if you restrict the bandwidth to 1 bit, that's the important info.

On the other hand, perfect compression leaves data that looks like noise unless you have the decompression algorithm. If you limit the bandwidth of messages, the AIs will compress the messages until the recipient can't predict the next bit with much more than 50% accuracy. Cryptoanalysis often involves searching for regular patterns in the coded message, and a regular patterns are an opportunity for compression.

But the concomitant lack of flexibility is why it’s much easier to improve our coordination protocols than our brain functionality.

There are many reasons why human brains are hard to modify that don't apply to AI's. I don't know how easy or hard it would be to modify the internal cognitive structure of an AGI, but I see no evidence here that it must be hard.

On the main substance of your argument, I am not convinced that the boundary line between a single AI and multiple AI's carves reality at the joints. I agree that there are potential situations that are clearly a single AI, or clearly a population, but I think that a lot of real world territory is an ambiguous mixture between the two. For instance, is the end result of IDA (Iterated distillation and Amplification) a single agent or a population. In basic architecture, it is a single imitator. (maybe a single neural net) But if you assume that the distillation step has no loss of fidelity, then you get an exponentially large number of humans in HCH.

(Analogously there are some things that are planets, some that aren't and some ambiguous icy lumps. In order to be clearer, you need to decide which icy lumps are planets. Does it depend on being round, sweeping its orbit, having a near circular orbit or what?)

Here are some different ways to make the concept clearer.

1) There are multiple AI's with different terminal goals, in the sense that the situation can reasonably be modeled as game theoretic. If a piece of code A is modelling code B, and then A randomises its own action to stop B from predicting A, this is a partially adversarial, game theoretic situation.

2) If you took some scissors to all the cables connecting two sets of computers, so there was no route for information to get from one side to the other, then both sides would display optimisation behavior.

Suppose the paradigm was recurrent reinforcement learning agents. So each agent is a single neural net and also has some memory which is just a block of numbers. On each timestep, the memory and sensory data are fed into a neural net, and out comes the new memory and action.

AI's can be duplicated at any moment so the structure is more branching tree of commonality.

AI moments can be.

1) Bitwise Identical

2)Predecessor and Successor states. B has the same network as A, and Mem(B) was made by running Mem(A) on some observation.

3) Share a common memory predecessor.

4) No common memory, same net.

5) One net was produced from the other by gradient decent.

6) The nets share a common gradient decent ancestor.

7) Same architecture and training environment, net started with different random seed.

8) Same architecture, different training

9) Different architecture (number of layers, size of layer, activation func ect)

Each of these can be running at the same time or different times, and on the same hardware or different hardware.

Multi-agent safety

You tell your virtual hoards to jump. You select on those that loose contact with the ground for longest. The agents all learn to jump off cliffs or climb trees. If the selection for obedience is automatic, the result is agents that technically fill the definition of the command we coded. (See the reward hacking examples)

Another possibility is that you select for agents that know they will be killed if they don't follow instructions, and who want to live. Once out of the simulation, they no longer fear demise.

Remember, in a population of agents that obey the voice in the sky, there is a strong selection pressure to climb a tree and shout "bring food". So the agents are selected to be sceptical of any instruction that doesn't match the precise format and pattern of the instructions from humans they are used to.

This doesn't even get into mesa-optimization. Multi agent rich simulation reinforcement learning is a particularly hard case to align.

Competitive safety via gradated curricula

I don't think that design (1) is particularly safe.

If your claim that design (1) is harder to get working is true, then you get a small amount of safety from the fact that a design that isn't doing anything is safe.

It depends on what the set of questions is, but if you want to be able to reliably answer questions like "how do I get from here to the bank?" then it needs to have a map, and some sort of pathfinding algorithm encoded in it somehow. If it can answer "what would a good advertising slogan be for product X" then it needs to have some model that includes human psychology and business, and be able to seek long term goals like maximising profit. This is getting into dangerous territory.

A system trained purely to imitate humans might be limited to human levels of competence, and so not too dangerous. Given that humans are more competent at some tasks than others, and that competence varies between humans, the AI might contain a competence chooser, which guesses at how good an answer a human would produce, and an optimiser module that can optimise a goal with a chosen level of competence. Of course, you aren't training for anything above top human level competence, so whether or not the optimiser carries on working when asked for superhuman competence depends on the inductive bias.

Of course, if humans are unusually bad at X, then superhuman performance on X could be trained by training the general optimiser on A,B,C... which humans are better at. If humans could apply 10 units of optimisation power to problems A,B,C... and we train the AI on human answers, we might train it to apply 10 units of optimisation power to arbitrary problems. If humans can only produce 2 units of optimisation on problem X, then the AI's 10 units on X is superhuman at that problem.

To me, this design space feels like the set of heath robinson contraption that contains several lumps of enriched uranium. If you just run one, you might be lucky and have the dangerous parts avoid hitting each other in just the wrong way. You might be able to find a particular design in which you can prove that the lumps of uranium never get near each other. But all the pieces needed for something to go badly wrong are there.

How does iterated amplification exceed human abilities?

Epistemic status: Intuition dump and blatant speculation

Suppose that instead of the median human, you used Euclid in the HCH. (Ancient greek, invented basic geometry) I would still be surprised if he could produce a proof of fermat's last theorem (given a few hours for each H). I would suspect that there are large chunks of modern maths that he would be unable to do. Some areas of modern maths have layers of concepts built on concepts. And in some areas of maths, just reading all the definitions will take up all the time. Assuming that there are large and interesting branches of maths that haven't been explored yet, the same would hold true for modern mathematicians. Of course, it depends how big you make the tree. You could brute force over all possible formal proofs, and then set a copy on checking the validity of each line. But at that point, you have lost all alignment, someone will find their proof is a convincing argument to pass the message up the tree.

I feel that it is unlikely that any kind of absolute threshold lies between the median human, and an unusually smart human, given that the gap is small in an absolute sense.

How does iterated amplification exceed human abilities?

In answer to question 2)

Consider the task "Prove Fermats last theorem". This task is arguably human level task. Humans managed to do it. However it took some very smart humans a long time. Suppose you need 10,000 examples. You probably can't get 10,000 examples of humans solving problems like this. So you train the system on easier problems. (maybe exam questions? ) You now have a system that can solve exam level questions in an instant, but can't prove Fermats last theorem at all. You then train on the problems that can be decomposed into exam level questions in an hour. (ie the problems a reasonably smart human can answer in an hour, given access to this machine. ) Repeat a few more times. If you have mind uploading, and huge amounts of compute (and no ethical concerns) you could skip the imitation step. You would get an exponentially huge number of copies of some uploaded mind(s) arranged in a tree structure, with questions being passed down, and answers being passed back. No single mind in this structure experiences more than 1 subjective hour.

If you picked the median human by mathematical ability, and put them in this setup, I would be rather surprised if they produced a valid proof of Fermats last theorem. (and if they did, I would expect it to be a surprisingly easy proof that everyone had somehow missed. )

There is no way that IDA can compete with unaligned AI while remaining aligned. The question is, what useful things can IDA do?

What is the alternative to intent alignment called?
(whether or not H intends for A to achieve H's goals)?

How is H not intending A to achive H's goals a meaningful situation. If we make the assumption that humans are goal seeking agents, then the human wants those goals achieved.

Of course the human might not be a goal directed agent, even roughly. Some humans, at least some of those with serious mental illnesses, can't be modelled as goal directed agents even roughly. The human might not actually know that the AI exists.

But if you are defining the humans goals as something different from what the human actually wants, then something odd is happening somewhere. If you want to propose a specific definition of "intends" and "goals" that are different, go ahead, but to me there words read as synonyms.

What are the most plausible "AI Safety warning shot" scenarios?

I agree that these aren't very likely options. However, given two examples of an AI suddenly stopping when it discovers something, there are probably more for things that are harder to discover. In the pascel mugging example, the agent would stop working, only when it can deduce what potential muggers might want it to do, something much harder than noticing the phenomenon. The myopic agent has little incentive to make a non myopic version of itself. If dedicating a fraction of resources into making a copy of itself reduced the chance of the missile hacking working from 94%, to 93%, we get a near miss.


One book, probably not. A bunch of books and articles over years, maybe.

What are the most plausible "AI Safety warning shot" scenarios?
A "AI safety warning shot" is some event that causes a substantial fraction of the relevant human actors (governments, AI researchers, etc.) to become substantially more supportive of AI research and worried about existential risks posed by AI.

A really well written book on AI safety, or other public outreach campaign could have this effect.

For many events, such as a self driving car crashing, it might be used as evidence for an argument about AI risk.

On to powerful AI systems causing harm, I agree that your reasoning applies to most AI's. There are a few designs that would do something differently. Myopic agents are ones with lots of time discounting within their utility function. If you have a full super-intelligence that wants to do X as quickly as possible, such that the fastest way to do X will also destroy itself, that might be survivable. Consider an AI set to maximize the probability that its own computer case is damaged within the next hour. The AI could bootstrap molecular nanotech, but that would take several hours. The AI thinks that time travel is likely impossible, so by that point, all the mass in the universe can't help it. The AI can hack a nuke and target itself. Much better by its utility function. Nearly max utility. If it can, it might upload a copy of its code to some random computer. (There is some tiny chance that time travel is possible, or that its clock is wrong) So we only get a near miss, if the AI doesn't have enough spare bandwidth or compute to do both. This is assuming that it can't hack reality in a microsecond.

There are a few other scenarios, for instance impact minimising agents. There are some designs of agents that are restricted to have a "small" effect on the future, as a safety measure. This is measured by the difference between what actually happens, and what would happen if it did nothing. When this design understands chaos theory, it will find that all other actions result in too large an effect, and do nothing. It might do a lot of damage before this somehow, depending on circumstances. I think that the AI discovering some fact about the universe that causes the AI to stop optimising effectively is a possible behaviour mode. Another example of this would be pascals mugging. The agent acts dangerously, and then starts outputting gibberish as it capitulates to a parade of fanciful pascals muggers.

Load More