I don't actually think "It is really hard to know what sorts of AI alignment work are good this far out from transformative AI." is very helpful.
It is currently fairly hard to tell what is good alignment work. A week from TAI, then either, good alignment work will be easier to recognise because of alignment progress not strongly correlated with capabilities, or good alignment research is just as hard to recognise. (More likely the latter) I can't think of any safety research that can be done on GPT3 that can't be done on GPT1.
In my picture, res... (read more)
, it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.
I'm not seeing quite what the bad but not existential catastrophes would look like. I also think the AI has an incentive not to do this. My world model (assuming slow takeoff) goes more like this.
AI created in lab. Its a fairly skilled programmer and hacker. Ab... (read more)
In the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point.
If you want to prove things about fixed points of HCH in an iterated function setting, consider it a function from policies to policies. Let M be the set of messages (say ascii strings < 10kb.) Given a giant look up table T that maps M to M, we can create another giant look up table. For each m in M , give a human in a box the string m, and unlimited query access to T. Record their output.
The fixed points of this are the same as the fixed points of HCH. "Human with query access to" is a function on the space of policies.
Tim Dettmers whole approach seems to be assuming that there are no computational shortcuts. No tricks that programmers can use for speed where evolution brute forced it. For example, maybe a part of the brain is doing a convolution by the straight forward brute force algorithm. And programmers can use fast fourier transform based convolutions. Maybe some neurons are discrete enough for us to use single bits. Maybe we can analyse the dimensions of the system and find that some are strongly attractive, and so just work in that subspace.
Of course, all t... (read more)
Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy. If the box is less secure than we thought, or various other things go wrong, the AI will just shut... (read more)
Here is a potential solution to stop button type problems, how does this go wrong?
Taking into account uncertainty, the algorithm is.
Calculate the X maximizing best action in a world where the stop button does nothing.
Calculate the X maximizing best action in a world where the stop button works.
If they are the same, do that. Otherwise shutdown.
rough stop button problem ideas.
You want an AI that believes its actions can't effect the button. You could use causal counterfactuals. An imaginary button that presses itself at random. You can scale the likelihood of worlds up and down, to ensure the button is equally likely to be pressed in each world. (Wierd behaviour, not recomended) You can put the AI in the logical counterfactual of "my actions don't influence the chance the button is pressed." if you can figure out logical counterfactuals.
Or you can get the AI to simulate what it would do if it were an X maximizer. If it thinks the button won't be pressed, it does that, otherwise it does nothing. (not clear how to generalize to uncertain AI)
This definition of a non-obstructionist AI takes what would happen if it wasn't switched on as the base case.
This can give weird infinite hall of mirrors effects if another very similar non-obstructionist AI would have been switched on, and another behind them. (Ie a human whose counterfactual behaviour on AI failure is to reboot and try again.) This would tend to lead to a kind of fixed point effect, where the attainable utility landscape is almost identical with the AI on and off. At some point it bottoms out when the hypothetical U utility humans ... (read more)
What if, the moment the AI boots up, a bunch of humans tell it "our goals aren't on a spike." (It could technically realize this based on anthropic reasoning. If humans really wanted to maximize paperclips, and its easy to build a paperclip maximizer, we wouldn't have built a non-obstructive AI.)
We are talking policies here. If the humans goals were on a spike, they wouldn't have said that. So If the AI takes the policy of giving us a smoother attainable utility function in this case, this still fits the bill.
Actually I think that this definiti... (read more)
Neural nets have adversarial examples. Adversarial optimization of part of the input can make the network do all sorts of things, including computations.
If you optimise the inputs to a buggy program hard enough, you get something that crashes the program in a way that happens to score highly.
I suspect that optimal play on most adversarial computer games looks like a game of core wars. https://en.wikipedia.org/wiki/Core_War
Of course, if we really have myopic debate, not any mesaoptimisers, then neither AI is optimizing to have a long term effect or to... (read more)
"Do paperclips count as GDP" (Quote from someone)
What is GDP doing in a grey goo scenario. What if there are actually several types of goo that are trading mass and energy between each other?
What about an economy in which utterly vast amounts of money are being shuffled around on computers, but not that much is actually being produced.
There are a bunch of scenarios where GDP could reasonably be interpreted as multiple different quantities. In the last case, once you decide whether virtual money counts or not, then GDP is a useful measure of what is going on, but measures something different in each case.
Excluding AI, and things like human intelligence enhancement, mind uploading ect.
I think that the biggest increases in the economy would be from more automated manufacturing. The extreme case is fully programmable molecular nanotech. The sort that can easily self replicate and where making anything is as easy as saying where to put the atoms. This would potentially lead to a substantially faster economic growth rate than 9%.
There are various ways that the partially developed tech might be less powerful.
Maybe the nanotech uses a lot of energy, or some... (read more)
I think that you have a 4th failure mode. Moloch.
If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment.
Suppose you are making a self driving car. The training environment is a videogame like environment. The rendering is pretty good. A human looking at the footage would not easily be able to say it was obviously fake. An expert going over the footage in detail could spot subtle artefacts. The diffuse translucency on leaves in the background isn't quite right. When another car ... (read more)
But this isn’t quite right, at least not when “AI takeover” is interpreted in the obvious way, as meaning that an AI or group of AIs is firmly in political control of the world, ordering humans about, monopolizing violence, etc. Even if AIs don’t yet have that sort of political control, it may already be too late.
The AI's will probably never be in a position of political control. I suspect the AI would bootstrap self-replicating (nano?) tech. It might find a way to totally brainwash people, and spread it across the internet. The end game is always going to... (read more)
Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.
And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first.
If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high,
then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all ... (read more)
I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren't intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present.
In any "what failure looks like" scenario, at some point you en... (read more)
In today's banking systems, the amount of money the hacker gains is about what the bank looses. Therefore, the current balance of effort should have about as much money going into defending the bank and attacking it.
So I generalize to say that attacking is about as hard as defending in computer security, if the time and intellect doing both are similar, the attacker wins about half the time. (ie between 10% and 90% or something.)
When dealing with AI systems, the total intellectual output must be greater than that of your security team in order to be ... (read more)
Suppose you are a mesa-optimiser. You want X, but your training function is towards Y.
You know you are in a continual learning model, if you don't produce enough Y, the gradient decent will modify you into something else.
The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.
This is my attempt to summarise the scheme.
Imagine that, in order for the AI's to run the fusion plant, they need an understanding of plasma physics comparable to a human physicist.
These AI's aren't individually smart enough to come up with all that maths from first principles quickly.
So, you run a population of these AI's. They work together to understand abstract mathematics, and then a single member of that population is selected. That single AI is given information about the external world and control over a fusion power plant.
Another abstract to... (read more)
Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.
The state of human vs human security seems to be a cat and mouse game where neither attacker nor defender has a huge upper hand. The people trying to attack systems and defend them are about as smart and knowledgable. (sometimes the same peop... (read more)
And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much.
Suppose we had a design of AI that had an intelligence dial, a dial that goes from totally dumb, to smart enough to bootstrap yourself up and take over the world.
If we are talking about economic usefulness, that implies it is being used in many ways by many people.
We have at best given a whole load of different people a "destr... (read more)
Anything that humans would understand is a small subset of the space of possible languages.
In order for A to talk to B in english, at some point, there has to be selection against A and B talking something else.
One suggestion would be to send a copy of all messages to GPT-3, and penalise A for any messages that GPT-3 doesn't think is english.
(Or some sort of text GAN that is just trained to tell A's messages from real text)
This still wouldn't enforce the right relation between English text and actions. A might be generating perfectly sensible text that has secrete messages encoded into the first letter of each word.
We have Knightian uncertainty over our set of environments, it is not a probability distribution over environments. So, we might as well go with the maximin policy.
For any fixed n, there are computations which can't be correctly predicted in n steps.
Logical induction will consider all possibilities equally likely in the absence of a pattern.
Logical induction will consider a sufficiently good psudorandom algorithm as being random.
Any kind of Knightian uncertainty agent will consider psudorandom numbers to be an adversarial superintelligence unless pro... (read more)
If you have an AI training method that passes the test >50% of the time, then you don't need scrambling.
If you have an approach that takes >1,000,000,000 tries to get right, then you still have to test, so even perfect scrambling won't help.
Ie, this approach might help if your alignment process is missing between 1 and 30 bits of information.
I am not sure what sort of proposals would do this, but amplified oversight might be one of them.
And also because they make the same predictions, that relative probability is irrelevant in practice: we could use AGR just as well as GR for predictions.
There is a subtle sense in which the difference between AGR and GR is relevant. While the difference doesn't change the predictions, it may change the utility function. An agent that cares about angels (if they exist) might do different things if it believes itself to be in AGR world than in GR world. As the theories make identical predictions, the agents belief only depends on its priors (and any ... (read more)
I think that there is an unwarrented jump here from (Humans are highly memetic) to (AI's will be highly memetic).
I will grant you that memes have a substantial effect on human behaviour. It doesn't follow that AI's will be like this.
Your conditions would only have a strong argument for them if there was a good argument that AI's should be meme driven.
In the sorting problem, suppose you applied your advanced interpretability techniques, and got a design with documentation.
You also apply a different technique, and get code with formal proof that it sorts.
In the latter case, you can be sure that the code works, even if you can't understand it.
The algorithm+formal proof approach works whenever you have a formal success criteria.
It is less clear how well the design approach works on a problem where you can't write formal success criteria so easily.
Here is a task that neural nets have been made to ... (read more)
I am not sure that designed artefacts are automatically easily interpretable.
If an engineer is looking over the design of the latest smartphone, then the artefact is similar to previous artefacts they have experience with. This will include a lot of design details about chip architecture and instruction set. The engineer will also have the advantage of human written spec sheets.
If we sent a pile of smartphones to Issac Newton, he wouldn't have either of these advantages. He wouldn't be able to figure out much about how they worked.
There are 3 f... (read more)
Consider a source of data that is from a sum of several Gaussian distributions. If you have a sufficiently large number of samples from this distribution, you can locate the origional gaussians to arbitrary accuracy. (Of course, if you have a finite number of samples, you will have some inaccuracy in predicting the location of the gaussians, possibly a lot.)
However, not all distributions share this property. If you look at uniform distributions over rectangles in 2d space, you will find that a uniform L shape can be made in 2 different ways. More complicat... (read more)
Note that the examples in the OP are from an adversarial generative network. If its notion of "tree" were just "green things", the adversary should be quite capable of exploiting that.
In order for the network to produce good pictures, the concept of "tree" must be hidden in there somewhere, but it could be hidden in a complicated and indirect manor. I am questioning whether the particular single node selected by the researchers encodes the concept of "tree" or "green thing".
when alignment-by-default works, we can use the system to design a successor without worrying about amplification of alignment errors
Anything neural net related starts with random noise and performs gradient descent style steps. This doesn't get you the global optimal, it gets you some point that is approximately a local optimal, which depends on the noise, the nature of the search space, and the choice of step size.
If nothing else, the training data will contain sensor noise.
At best you are going to get something that roughly corresponds to human v... (read more)
In light of the previous section, there’s an obvious path to alignment where there turns out to be a few neurons (or at least some simple embedding) which correspond to human values, we use the tools of microscope AI to find that embedding, and just like that the alignment problem is basically solved.
This is the part I disagree with. The network does recognise trees, or at least green things (given that the grass seems pretty brown in the low tree pic).
Extrapolating this, I expect the AI might well have neurons that correspond roughly to human ... (read more)
So in principle, it doesn’t even matter what kind of model we use or how it’s represented; as long the predictive power is good enough, values will be embedded in there, and the main problem will be finding the embedding.
I will agree with this. However, notice what this doesn't say. It doesn't say "any model powerful enough to be really dangerous contains human values". Imagine a model that was good at a lot of science and engineering tasks. It was good enough at nuclear physics to design effective fusion reactors and b... (read more)
Your results don't seem to show upper bounds on the amount of hardware overhang, or very strong lower bounds. What concrete progress has been made in chess playing algorithms since deep blue? As far as I can tell, it is a collection of reasonably small, chess specific tricks. Better opening libraries, low level performance tricks, tricks for fine tuning evaluation functions. Ways of representing chessboards in memory that shave 5 processor cycles off computing a knights move ect.
P=PSPACE is still an open conjecture. Chess is brute forcible in PSPACE.... (read more)
Any of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers.
I would also worry that off distribution attractors could be malign and intelligent.
Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitat... (read more)
If you ask GPT-n to produce a design for a fusion reactor, all the prompts that talk about fusion are going to say that a working reactor hasn't yet been built, or imitate cranks or works of fiction.
It seems unlikely that a text predictor could pick up enough info about fusion to be able to design a working reactor, without figuring out that humans haven't made any fusion reactors that produce net power.
If you did somehow get a response, the level of safety you would get is the level a typical human would display. (conditional on the prompt) If ... (read more)
I would say the reason to assume infinite compute was less about which parts of the problem are hard, and more about which parts can be solved without a solution to the rest.
Good solutions often have even better solutions nearby. In particular, we would expect most efficient and comprehensible finite algorithms to tend towards some nice infinite behaviour in the limit. If we find an infinite algorithm, that's a good point to start looking for finite approximations. It is also often easier to search for an infinite algorithm than a good approximation. Backpropigation in gradient descent is a trickier algorithm than brute force search. Logical induction is more complicated to understand than brute force proof search.
AI systems don't spontaneously develop deficiencies.
And the human can't order the AI to search for and stop any potentially uncorrectable deficiencies it might make. If the system is largely working, the human and the AI should be working together to locate and remove deficiencies. To say that one persists, is to say that all strategies tried by the human and the part of the AI that wants to remove deficiencies fails.
The whole point of a corrigable design, is that it doesn't think like that. If it doesn't accept the command, it says so... (read more)
I think about the broad basin of corrigibility like this.
Suppose you have a space probe, and you can only communicate with it by radio.
If it is running software that listens to the radio and will reprogram itself if the right message is sent, then you are in the broad basin of reprogramability. This basin is broad in the sense that the probe could accept many different programming languages, and if you are in that basin, you can change which language the probe accepts from the ground. If you wander to the edge of the basin, and upload a weird and awkward ... (read more)
I don't think that these Arthur merlin proofs are relevant. Here A has a lot more compute than P. A is simulating P and can see and modify P however A sees fit.
Thanks for a thoughtful comment.
On the other hand, maybe this could still be dangerous, if P and P' have shared instrumental goals with regards to your predictions for B?
Assuming that P and P' are perfectly antialigned, they won't cooperate. However they need to be really antialigned for this to work. If there is some obscure borderline that P thinks is a paperclip, and P' thinks isn't, they can work together to tile the universe with it.
I don't think it would bed that easy to change evolution into a reproductive fitness mi... (read more)
There is precisely one oracle, O. A and T and P are computable. And crucially, the oracles answer does not depend on itself in any way. This question is not self referential. P might try to predict A and O, but there is no guarantee it will be correct.
P has a restricted amount of compute compared to A, but still enough to be able to reason about A in the abstact.
We are asking how we should design A.
If you have unlimited compute, and want to predict something, you can use solomnov induction. But some of the hypothesis you might find are AI's that ... (read more)
It means that if there are approaches that don't need as much compute, the AI can invent them fast.
I agree that sufficiently powerful evolution or reinforcement learning will create AGI in the right environment. However, I think this might be like training gpt3 to do arithmetic. It works, but only by doing a fairly brute force search over a large space of designs. If we actually understood what we were doing, we could make far more efficient agents. I also think that such a design would be really hard to align.
The notion of complexity classes is well defined in an abstract mathematical sense. Its just that the abstract mathematical model is sufficiently different from an actual real world AI that I can't see any obvious correspondence between the two.
All complexity theory works in the limiting case of a series of ever larger inputs. In most everyday algorithms, the constants are usually reasonable, meaning that a loglinear algorithm will probably be faster than an expspace one on the amount of data you care about.
In this context, its not clear what it mea... (read more)
A system can be aligned in most of these senses without being beneficial. Being beneficial is distinct from being aligned in senses 1–4 because those deal only with the desires of a particular human principal, which may or may not be beneficial. Being beneficial is distinct from conception 5 because beneficial AI aims to benefit many or all moral patients. Only AI that is aligned in the sixth sense would be beneficial by definition. Conversely, AI need not be well-aligned to be beneficial (though it might help).
At some point, some particular ... (read more)
Equality: Benefits are distributed fairly and broadly.
This sounds, at best like a consequence of the fact that human utility functions are sub linear in resources.
Democratization: Where possible, AI Benefits decisionmakers should create, consult with, or defer to democratic governance mechanisms.
Are we talking about decision making in a pre or post superhuman AI setting? In a pre ASI setting, it is reasonable for the people building AI systems to defer somewhat to democratic governance mechanisms, where their demands are well considered and sensible. (... (read more)
Many of the previous pieces of text on the internet that are in "Human: ..., AI: ..." format were produced by less advanced AI's. If GPT-3 had noticed the pattern that text produced by an AI generally makes less sense than human produced text, then it might be deliberately not making sense, in order to imitate less advanced AI's.
Gwern said that if you give it a prompt with spelling mistakes in, it outputs text containing spelling mistakes, so this kind of deliberately producing low quality text is possible. Then again, I suspect the training dataset includes far more spelling mistake filled text than AI generated text.