Currently studying postgrad at Edinburgh.
This definition of a non-obstructionist AI takes what would happen if it wasn't switched on as the base case.
This can give weird infinite hall of mirrors effects if another very similar non-obstructionist AI would have been switched on, and another behind them. (Ie a human whose counterfactual behaviour on AI failure is to reboot and try again.) This would tend to lead to a kind of fixed point effect, where the attainable utility landscape is almost identical with the AI on and off. At some point it bottoms out when the hypothetical U utility humans give up and do something else. If we assume that the AI is at least weakly trying to maximize attainable utility, then several hundred levels of counterfactuals in, the only hypothetical humans that haven't given up are the ones that really like trying again and again at rebooting the non-obstructionist AI. Suppose the AI would be able to satisfy that value really well. So the AI will focus on the utility functions that are easy to satisfy in other ways, and those that would obstinately keep rebooting in the hypothetical where the AI kept not turning on. (This might be complete nonsense. It seems to make sense to me)
What if, the moment the AI boots up, a bunch of humans tell it "our goals aren't on a spike." (It could technically realize this based on anthropic reasoning. If humans really wanted to maximize paperclips, and its easy to build a paperclip maximizer, we wouldn't have built a non-obstructive AI.)
We are talking policies here. If the humans goals were on a spike, they wouldn't have said that. So If the AI takes the policy of giving us a smoother attainable utility function in this case, this still fits the bill.
Actually I think that this definition is pushing much of the work off onto pol . This is a function that can take any utility function, and say how a human would behave if maximising it. Flip that round, this takes the human policy and produces a set of possible utility functions. (Presumably very similar functions.) Over these indistinguishable utility functions, the AI tries to make sure that none of them are lower than they would have been if the AI didn't exist. Whether this is better or worse than maximizing the minimum or average would be sensitively dependant on exactly what this magic set of utility functions generated by pol−1 is.
Neural nets have adversarial examples. Adversarial optimization of part of the input can make the network do all sorts of things, including computations.
If you optimise the inputs to a buggy program hard enough, you get something that crashes the program in a way that happens to score highly.
I suspect that optimal play on most adversarial computer games looks like a game of core wars. https://en.wikipedia.org/wiki/Core_War
Of course, if we really have myopic debate, not any mesaoptimisers, then neither AI is optimizing to have a long term effect or to avoid long term effects. They are optimising for a short term action, and to defend against their adversary.
If AI1 manages to "persuade" the human not to look at AI2's "arguments" early on, then AI1 has free reign to optimise, the human could end up as a UFAI mesa-optimiser. Suppose the AI1 is rewarded by pressing a red button. The human could walk out of their still human level smart, but with the sole terminal goal of maximizing the number of red buttons pressed universewide. If the human is an AI researcher, this could potentially end with them making a button pressing ASI
Another option I consider fairly likely is that the debater ends up permanently non-functional. Possibly dead, possibly a twitching mess. After all, any fragment of behavioural pattern resembling "functional human" is attack surface. Suppose there is a computer running a buggy and insecure code. You are given first go at hacking it. After your go, someone else will try to hack it, and your code has to repel them. You are both optimising for a simple formal objective, like average pixel colour of screen.
You can get your virus into the system, now you want to make it as hard for your opponent to follow after you as possible. Your virus will probably wipe the OS, cut all network connections and blindly output your preferred output.
That's plausibly a good strategy, a simple cognitive pattern that repeats itself and blindly outputs your preferred action, wiping away any complex dependence on observation that could be used to break the cycle.
A third possibility is just semi nonsense that creates a short term compulsion or temporary trance state.
The human brain can recover fairly well from the unusual states caused by psychedelics. I don't know how that compares to recovering from unusual states caused by strong optimization pressure. In the ancestral environment, there would be some psychoactive substances, and some weak adversarial optimization from humans.
I would be more surprised if optimal play gave an answer that was like an actual plan. (This is the case of a plan for a perpetual motion machine, and a detailed technical physics argument for why it should work, that just has one small 0/0 hidden somewhere in the proof.)
I would be even more surprised if the plan actually worked, if the optimal debating AI's actually outputed highly useful info.
For AI's strongly restricted in output length, I think it might produce useful info on the level of maths hints, "renormalize the vector, then differentiate". something that a human can follow and check quickly. If you don't have the bandwidth to hack the brain, you can't send complex novel plans, just a little hint towards a problem the human was on the verge of solving themselves. Of course, the humans might well follow the rules in this situation.
"Do paperclips count as GDP" (Quote from someone)
What is GDP doing in a grey goo scenario. What if there are actually several types of goo that are trading mass and energy between each other?
What about an economy in which utterly vast amounts of money are being shuffled around on computers, but not that much is actually being produced.
There are a bunch of scenarios where GDP could reasonably be interpreted as multiple different quantities. In the last case, once you decide whether virtual money counts or not, then GDP is a useful measure of what is going on, but measures something different in each case.
Excluding AI, and things like human intelligence enhancement, mind uploading ect.
I think that the biggest increases in the economy would be from more automated manufacturing. The extreme case is fully programmable molecular nanotech. The sort that can easily self replicate and where making anything is as easy as saying where to put the atoms. This would potentially lead to a substantially faster economic growth rate than 9%.
There are various ways that the partially developed tech might be less powerful.
Maybe the nanotech uses a lot of energy, or some rare elements, making it much more expensive.
Maybe it can only use really pure feedstock, not environmental raw materials.
Maybe it is just really hard to program, no one has built the equivalent of a compiler yet, we are writing instructions in assembly, and even making a hello world is challenging.
Maybe we have macroscopic clanking replicators.
Maybe we have a collection of autonomous factories that can make most, but not all, of their own parts.
Maybe the nanotech is slowed down by some non-technological constraint, like bureaucracy, proprietary standards and patent disputes.
Mix and match various social and technological limitations to tune the effect on GDP
I think that you have a 4th failure mode. Moloch.
If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment.
Suppose you are making a self driving car. The training environment is a videogame like environment. The rendering is pretty good. A human looking at the footage would not easily be able to say it was obviously fake. An expert going over the footage in detail could spot subtle artefacts. The diffuse translucency on leaves in the background isn't quite right. When another car drives through a puddle, all the water drops are perfectly spherical, and travel on parabolic paths. Falling snow doesn't experience aerodynamic turbulence. Etc.
The point is that the behaviour you want is avoiding other cars and lamp posts. The simulation is close enough to reality that it is easy to match virtual lamp posts to real ones. However the training and testing environments have a different distribution.
Making the simulated environment absolutely pixel perfect would be very hard, and doesn't seem like it should be necessary.
However, given even a slight variation between training and the real world, there exists an agent that will behave well in training, but cause problems in the real world. And also an agent that behaves fine in training and the real world. The set of possible behaviours is vast. You can't consider all of them. You can't even store a single arbitrary behaviour. Because you cant train on all possible situations, there will be behaviours that behave the same on all the training situations, but behave differently in other situations. You need some part of your design that favours some policies over others without training data. For example, you might want a policy that can be described as parameters in a particular neural net. You have to look at how this effects off distribution actions.
The analogous situation with managers would be that the person being tested knows they are being tested. If you get them to display benevolent leadership, then you can't distinguish benevolent leaders from sociopaths who can act nice to pass the test.
But this isn’t quite right, at least not when “AI takeover” is interpreted in the obvious way, as meaning that an AI or group of AIs is firmly in political control of the world, ordering humans about, monopolizing violence, etc. Even if AIs don’t yet have that sort of political control, it may already be too late.
The AI's will probably never be in a position of political control. I suspect the AI would bootstrap self-replicating (nano?) tech. It might find a way to totally brainwash people, and spread it across the internet. The end game is always going to be covering the planet in self replicating nanotech, or similar. Politics does not seem that helpful towards such goal. Politics is generally slow.
Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.
And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first.
If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high,
then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all future research.
(note that research on how to make better chips counts as capabilities research in this model)
Another way to think about it is that the problems are created by research. If you don't think that "another new piece of AI research has been produced" is reason to shift probabilities of success up or down, it just moves timelines forward, then the average piece of research is neither good nor bad.
I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren't intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present.
In any "what failure looks like" scenario, at some point you end up with superintelligent stock traiders that want to fill the universe with tiny molecular stock markets, competing with weather predicting AI's that want to freeze the earth to a maximally predictable 0K block of ice.
These AI's are wielding power that could easily wipe out humanity as a side effect. If they fight, humanity will get killed in the crossfire. If they work together, they will tile the universe with some strange mix of many different "molecular smiley faces".
I don't think that you can get an accurate human values function by averaging together many poorly thought out, add hoc functions that were designed to be contingent on specific details of how the world was. (Ie assuming people are broadcasting TV signals, stock market went up iff a particular pattern of electromagnetic waves encoding a picture of a graph going up, and the words "finantial news" exists. Outside the narrow slice of possible worlds with broadcast TV, this AI just wants to grab a giant radio transmittor and transmit a particular stream of nonsense.)
I think that humans existing is a specific state of the world, something that only happens if an AI is optimising for it. (And an actually good definition of human is hard to specify) Humans having lives we would consider good is even harder to specify. When there are substantially superhuman AI's running around, the value of the atoms exceeds any value we can offer. The AI's could psycologically or nanotechnologically twist us into whatever shape they pleased. We cant meaningfully threaten any of the AI.
We wont be left even a tiny fraction, we will be really bad at defending our resources, compared to any AI's. Any of the AI's could easily grab all our resources. Also there will be various AI's that care about humans in the wrong way, a cancer curing AI that wants to wipe out humanity to stop us getting cancer. A marketing AI, that wants to fill all human brains with coorporate slogans. (think nanotech brain rewrite to the point of drooling vegetable)
EDIT: All of the above is talking about the end state of a "get what you measure" failure. There could be a period, possibly decades where humans are still around, but things are going wrong in the way described.