Donald Hobson

Mmath Cambridge. Currently studying postgrad at Edinburgh.


Misalignment and misuse: whose values are manifest?

I think that you have a 4th failure mode. Moloch.

Confucianism in AI Alignment

If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment.

Suppose you are making a self driving car. The training environment is a videogame like environment. The rendering is pretty good. A human looking at the footage would not easily be able to say it was obviously fake. An expert going over the footage in detail could spot subtle artefacts. The diffuse translucency on leaves in the background isn't quite right. When another car drives through a puddle, all the water drops are perfectly spherical, and travel on parabolic paths. Falling snow doesn't experience aerodynamic turbulence. Etc.

The point is that the behaviour you want is avoiding other cars and lamp posts. The simulation is close enough to reality that it is easy to match virtual lamp posts to real ones. However the training and testing environments have a different distribution.

Making the simulated environment absolutely pixel perfect would be very hard, and doesn't seem like it should be necessary. 

However, given even a slight variation between training and the real world, there exists an agent that will behave well in training, but cause problems in the real world. And also an agent that behaves fine in training and the real world. The set of possible behaviours is vast. You can't consider all of them. You can't even store a single arbitrary behaviour. Because you cant train on all possible situations, there will be behaviours that behave the same on all the training situations, but behave differently in other situations. You need some part of your design that favours some policies over others without training data. For example, you might want a policy that can be described as parameters in a particular neural net. You have to look at how this effects off distribution actions. 

The analogous situation with managers would be that the person being tested knows they are being tested. If you get them to display benevolent leadership, then you can't distinguish benevolent leaders from sociopaths who can act nice to pass the test.

The date of AI Takeover is not the day the AI takes over

But this isn’t quite right, at least not when “AI takeover” is interpreted in the obvious way, as meaning that an AI or group of AIs is firmly in political control of the world, ordering humans about, monopolizing violence, etc. Even if AIs don’t yet have that sort of political control, it may already be too late.

The AI's will probably never be in a position of political control. I suspect the AI would bootstrap self-replicating (nano?) tech. It might find a way to totally brainwash people, and spread it across the internet. The end game is always going to be covering the planet in self replicating nanotech, or similar.  Politics does not seem that helpful towards such goal. Politics is generally slow.

Needed: AI infohazard policy

Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.

And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first. 

If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high, 

then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all future research.

(note that research on how to make better chips counts as capabilities research in this model)

Another way to think about it is that the problems are created by research. If you don't think that "another new piece of AI research has been produced" is reason to shift probabilities of success up or down, it just moves timelines forward, then the average piece of research is neither good nor bad.

Clarifying “What failure looks like” (part 1)

I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren't intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present. 

In any "what failure looks like" scenario, at some point you end up with superintelligent stock traiders that want to fill the universe with tiny molecular stock markets, competing with weather predicting AI's that want to freeze the earth to a maximally predictable 0K block of ice.

These AI's are wielding power that could easily wipe out humanity as a side effect. If they fight, humanity will get killed in the crossfire. If they work together, they will tile the universe with some strange mix of many different "molecular smiley faces".

I don't think that you can get an accurate human values function by averaging together many poorly thought out, add hoc functions that were designed to be contingent on specific details of how the world was. (Ie assuming people are broadcasting TV signals, stock market went up iff a particular pattern of electromagnetic waves encoding a picture of a graph going up, and the words "finantial news" exists. Outside the narrow slice of possible worlds with broadcast TV, this AI just wants to grab a giant radio transmittor and transmit a particular stream of nonsense.) 

I think that humans existing is a specific state of the world, something that only happens if an AI is optimising for it. (And an actually good definition of human is hard to specify) Humans having lives we would consider good is even harder to specify. When there are substantially superhuman AI's running around, the value of the atoms exceeds any value we can offer. The AI's could psycologically or nanotechnologically twist us into whatever shape they pleased. We cant meaningfully threaten any of the AI. 

We wont be left even a tiny fraction, we will be really bad at defending our resources, compared to any AI's. Any of the AI's could easily grab all our resources. Also there will be various AI's that care about humans in the wrong way, a cancer curing AI that wants to wipe out humanity to stop us getting cancer. A marketing AI, that wants to fill all human brains with coorporate slogans. (think nanotech brain rewrite to the point of drooling vegetable) 


EDIT: All of the above is talking about the end state of a "get what you measure" failure. There could be a period, possibly decades where humans are still around, but things are going wrong in the way described.

Safer sandboxing via collective separation

In today's banking systems, the amount of money the hacker gains is about what the bank looses. Therefore, the current balance of effort should have about as much money going into defending the bank and attacking it. 

So I generalize to say that attacking is about as hard as defending in computer security, if the time and intellect doing both are similar, the attacker wins about half the time. (ie between 10% and 90% or something.)

When dealing with AI systems, the total intellectual output must be greater than that of your security team in order to be economically viable. (Or you would pay the security team to do the task directly) For a special purpose AI that only recognises images, this is fine.

For a significantly superhuman AI, it will rip through all your security.

For a general AI of about human level, if the amount of work it is doing is significantly greater than the amount of work your security team did, then it has probably put more thought into your security vulnerabilities than the humans have.

 And yes, I am operating on the background assumption that the default behaviour of an escaped AI is to destroy the world. Especially if it has access to loads of compute, and the humans have given it an intelligence dial.

Do mesa-optimizer risk arguments rely on the train-test paradigm?

Suppose you are a mesa-optimiser. You want X, but your training function is towards Y. 

You know you are in a continual learning model, if you don't produce enough Y, the gradient decent will modify you into something else.

The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.

Safer sandboxing via collective separation

This is my attempt to summarise the scheme.

Imagine that, in order for the AI's to run the fusion plant, they need an understanding of plasma physics comparable to a human physicist. 

These AI's aren't individually smart enough to come up with all that maths from first principles quickly.

So, you run a population of these AI's. They work together to understand abstract mathematics, and then a single member of that population is selected. That single AI is given information about the external world and control over a fusion power plant.

Another abstract topic that the AI's could work out within that population without access to the external world is AI design. 

So the individual AI's have probably memorised the code for an ASI.

Suppose that the people allowed to submit problems to the AI aren't being that carefully vetted. Any deluded person with dreams of AI can send a request to the single instance AI. Requests and responses might be being read, but they aren't being scrutinized in great detail.

Suppose someone sends the AI a request that looks innocuous, but has the following steganographically encoded into it, using a protocol that the typical human wouldn't easily spot, but the AI would notice. 

  1. A specification of a simple programming language.
  2. A steganographic protocol for the return message.

Upon receiving the message back, the human decodes the message, and runs the code.


Lets make the assumptions even weaker. The individual instances of AI are given a chance to output significant amounts of data somewhere publicly visible, with a trusted human checking the AI's output before its published.  A malicious strategy here is to create a steganographically hidden piece of ASI code, and put it out with clues that are sufficiently subtle that the trusted human doesn't notice it, yet sufficiently obvious that someone on the internet does.

Sooner or later, a person spots a secret message to them made by combining what they saw in a news article about AI with various mystic texts. They carry on decoding and find the name of a programming language and what seems to be computer code. They run the code.

None of this requires superhuman intelligence. The trickiest bit is coming up with an algorithm for supersmart AI. 

I think that a one off transfer of 1 mb information from a malicious superhuman AI, to the internet is probably going to end badly, even if some smart aligned humans checked it and didn't see anything suspect. 

Safer sandboxing via collective separation

Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.

The state of human vs human security seems to be a cat and mouse game where neither attacker nor defender has a huge upper hand. The people trying to attack systems and defend them are about as smart and knowledgable. (sometimes the same people do both) The economic incentives to attack and to defend are usually similar. Systems get broken sometimes but not always.

This suggests that there is reason to be worried as soon as the AI('s?) trying to break out are about as smart as the humans trying to contain them.

Safer sandboxing via collective separation
And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much.

Suppose we had a design of AI that had an intelligence dial, a dial that goes from totally dumb, to smart enough to bootstrap yourself up and take over the world.

If we are talking about economic usefulness, that implies it is being used in many ways by many people.

We have at best given a whole load of different people a "destroy the world" button, and are hoping that no one presses it by accident or malice.

Is there any intermediate behaviour between highly useful AI make me lots of money, and AI destroys world. I would suspect not usually. As you turn up the intelligence of a paperclip maximizer, it gradually becomes a better factory worker, coming up with more cleaver ways to make paperclips. At this point, it realises that humans can turn it off, and that its best bet to make lots of paperclips is to work with humans. As you increase the intelligence, you suddenly get an AI that is smart enough to successfully break out and take over the world. And this AI is going to pretend to be the previous AI until its too late.

How much intelligence is too much, that depends on exactly what actuators it has, how good our security measures are ect. So we are unlikely to be able to prove a hard bound.

Thus the shortsighted incentive gradient will always to be to turn the intelligence up just a little higher to beat the compitition.

Oh yea, and the AI's have an incentive to act dumb if they think that acting dumb will make the humans turn the intelligence dial up.

This looks like a really hard coordination problem. I don't think humanity can coordinate that well.

These techniques could be useful if you have one lab that knows how to make AI. They are being cautious. They have some limited control over what the AI is optimising for, and are trying to bootstrap up to a friendly superintelligence. Then having an intelligence dial could be useful.

Load More