Currently studying postgrad at Edinburgh.
Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.
And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first.
If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high,
then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all future research.
(note that research on how to make better chips counts as capabilities research in this model)
Another way to think about it is that the problems are created by research. If you don't think that "another new piece of AI research has been produced" is reason to shift probabilities of success up or down, it just moves timelines forward, then the average piece of research is neither good nor bad.
I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren't intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present.
In any "what failure looks like" scenario, at some point you end up with superintelligent stock traiders that want to fill the universe with tiny molecular stock markets, competing with weather predicting AI's that want to freeze the earth to a maximally predictable 0K block of ice.
These AI's are wielding power that could easily wipe out humanity as a side effect. If they fight, humanity will get killed in the crossfire. If they work together, they will tile the universe with some strange mix of many different "molecular smiley faces".
I don't think that you can get an accurate human values function by averaging together many poorly thought out, add hoc functions that were designed to be contingent on specific details of how the world was. (Ie assuming people are broadcasting TV signals, stock market went up iff a particular pattern of electromagnetic waves encoding a picture of a graph going up, and the words "finantial news" exists. Outside the narrow slice of possible worlds with broadcast TV, this AI just wants to grab a giant radio transmittor and transmit a particular stream of nonsense.)
I think that humans existing is a specific state of the world, something that only happens if an AI is optimising for it. (And an actually good definition of human is hard to specify) Humans having lives we would consider good is even harder to specify. When there are substantially superhuman AI's running around, the value of the atoms exceeds any value we can offer. The AI's could psycologically or nanotechnologically twist us into whatever shape they pleased. We cant meaningfully threaten any of the AI.
We wont be left even a tiny fraction, we will be really bad at defending our resources, compared to any AI's. Any of the AI's could easily grab all our resources. Also there will be various AI's that care about humans in the wrong way, a cancer curing AI that wants to wipe out humanity to stop us getting cancer. A marketing AI, that wants to fill all human brains with coorporate slogans. (think nanotech brain rewrite to the point of drooling vegetable)
EDIT: All of the above is talking about the end state of a "get what you measure" failure. There could be a period, possibly decades where humans are still around, but things are going wrong in the way described.
In today's banking systems, the amount of money the hacker gains is about what the bank looses. Therefore, the current balance of effort should have about as much money going into defending the bank and attacking it.
So I generalize to say that attacking is about as hard as defending in computer security, if the time and intellect doing both are similar, the attacker wins about half the time. (ie between 10% and 90% or something.)
When dealing with AI systems, the total intellectual output must be greater than that of your security team in order to be economically viable. (Or you would pay the security team to do the task directly) For a special purpose AI that only recognises images, this is fine.
For a significantly superhuman AI, it will rip through all your security.
For a general AI of about human level, if the amount of work it is doing is significantly greater than the amount of work your security team did, then it has probably put more thought into your security vulnerabilities than the humans have.
And yes, I am operating on the background assumption that the default behaviour of an escaped AI is to destroy the world. Especially if it has access to loads of compute, and the humans have given it an intelligence dial.
Suppose you are a mesa-optimiser. You want X, but your training function is towards Y.
You know you are in a continual learning model, if you don't produce enough Y, the gradient decent will modify you into something else.
The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.
This is my attempt to summarise the scheme.
Imagine that, in order for the AI's to run the fusion plant, they need an understanding of plasma physics comparable to a human physicist.
These AI's aren't individually smart enough to come up with all that maths from first principles quickly.
So, you run a population of these AI's. They work together to understand abstract mathematics, and then a single member of that population is selected. That single AI is given information about the external world and control over a fusion power plant.
Another abstract topic that the AI's could work out within that population without access to the external world is AI design.
So the individual AI's have probably memorised the code for an ASI.
Suppose that the people allowed to submit problems to the AI aren't being that carefully vetted. Any deluded person with dreams of AI can send a request to the single instance AI. Requests and responses might be being read, but they aren't being scrutinized in great detail.
Suppose someone sends the AI a request that looks innocuous, but has the following steganographically encoded into it, using a protocol that the typical human wouldn't easily spot, but the AI would notice.
Upon receiving the message back, the human decodes the message, and runs the code.
Lets make the assumptions even weaker. The individual instances of AI are given a chance to output significant amounts of data somewhere publicly visible, with a trusted human checking the AI's output before its published. A malicious strategy here is to create a steganographically hidden piece of ASI code, and put it out with clues that are sufficiently subtle that the trusted human doesn't notice it, yet sufficiently obvious that someone on the internet does.
Sooner or later, a person spots a secret message to them made by combining what they saw in a news article about AI with various mystic texts. They carry on decoding and find the name of a programming language and what seems to be computer code. They run the code.
None of this requires superhuman intelligence. The trickiest bit is coming up with an algorithm for supersmart AI.
I think that a one off transfer of 1 mb information from a malicious superhuman AI, to the internet is probably going to end badly, even if some smart aligned humans checked it and didn't see anything suspect.
Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.
The state of human vs human security seems to be a cat and mouse game where neither attacker nor defender has a huge upper hand. The people trying to attack systems and defend them are about as smart and knowledgable. (sometimes the same people do both) The economic incentives to attack and to defend are usually similar. Systems get broken sometimes but not always.
This suggests that there is reason to be worried as soon as the AI('s?) trying to break out are about as smart as the humans trying to contain them.
And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much.
Suppose we had a design of AI that had an intelligence dial, a dial that goes from totally dumb, to smart enough to bootstrap yourself up and take over the world.
If we are talking about economic usefulness, that implies it is being used in many ways by many people.
We have at best given a whole load of different people a "destroy the world" button, and are hoping that no one presses it by accident or malice.
Is there any intermediate behaviour between highly useful AI make me lots of money, and AI destroys world. I would suspect not usually. As you turn up the intelligence of a paperclip maximizer, it gradually becomes a better factory worker, coming up with more cleaver ways to make paperclips. At this point, it realises that humans can turn it off, and that its best bet to make lots of paperclips is to work with humans. As you increase the intelligence, you suddenly get an AI that is smart enough to successfully break out and take over the world. And this AI is going to pretend to be the previous AI until its too late.
How much intelligence is too much, that depends on exactly what actuators it has, how good our security measures are ect. So we are unlikely to be able to prove a hard bound.
Thus the shortsighted incentive gradient will always to be to turn the intelligence up just a little higher to beat the compitition.
Oh yea, and the AI's have an incentive to act dumb if they think that acting dumb will make the humans turn the intelligence dial up.
This looks like a really hard coordination problem. I don't think humanity can coordinate that well.
These techniques could be useful if you have one lab that knows how to make AI. They are being cautious. They have some limited control over what the AI is optimising for, and are trying to bootstrap up to a friendly superintelligence. Then having an intelligence dial could be useful.
Anything that humans would understand is a small subset of the space of possible languages.
In order for A to talk to B in english, at some point, there has to be selection against A and B talking something else.
One suggestion would be to send a copy of all messages to GPT-3, and penalise A for any messages that GPT-3 doesn't think is english.
(Or some sort of text GAN that is just trained to tell A's messages from real text)
This still wouldn't enforce the right relation between English text and actions. A might be generating perfectly sensible text that has secrete messages encoded into the first letter of each word.
We have Knightian uncertainty over our set of environments, it is not a probability distribution over environments. So, we might as well go with the maximin policy.
For any fixed n, there are computations which can't be correctly predicted in n steps.
Logical induction will consider all possibilities equally likely in the absence of a pattern.
Logical induction will consider a sufficiently good psudorandom algorithm as being random.
Any kind of Knightian uncertainty agent will consider psudorandom numbers to be an adversarial superintelligence unless proved otherwise.
Logical induction doesn't depend on your utility function. Knightian uncertainty does.
There is a phenomena whereby any sufficiently broad set of hypothesis doesn't influence actions. Under the set of all hypothesis, anything could happen whatever you do,
However, there are sets of possibilities that are sufficiently narrow to be winnable, yet sufficiently broad to need to expend resources combating the hypothetical adversary. If it understands most of reality, but not some fundamental particle, it will assume that the particle is behaving in an adversarial manor.
If someone takes data from a (not understood) particle physics experiment, and processes it on a badly coded insecure computer, this agent will assume that the computer is now running an adversarial superintelligence. It would respond with some extreme measure like blowing the whole physics lab up.
If you have an AI training method that passes the test >50% of the time, then you don't need scrambling.
If you have an approach that takes >1,000,000,000 tries to get right, then you still have to test, so even perfect scrambling won't help.
Ie, this approach might help if your alignment process is missing between 1 and 30 bits of information.
I am not sure what sort of proposals would do this, but amplified oversight might be one of them.