Sure, an AI that ignores what you ask, and implements some form of CEV or whatever isn't corrigible. Corrigibility is more following instructions than having your utility function.
You seem to believe that any plan involving what you call "godzilla strategies" is brittle. This is something I am not confidant in. Someone may find some strategy that can be shown to not be brittle.
Any network big enough to be interesting is big enough that the programmers don't have the time to write decorative labels. If you had some algorithm that magically produced a bays net with a billion intermediate nodes that accurately did some task, then it would also be an obvious black box. No one will have come up with a list of a billion decorative labels.
Retain or forget information and skills over long time scales, in a way that serves its goals. E.g. if it does forget some things, these should be things that are unusually unlikely to come in handy later.
If memory is cheap, designing it to just remember everything may be a good idea. And there may be some architectural reason why choosing to forget things is hard.
Suppose I send a few lines of code to a remote server. Those lines are enough to bootstrap a superintelligence which goes on to strongly optimize every aspect of the world. This counts as a fairly small amount of optimization power, as the chance of me landing on those lines of code by pure chance isn't that small.
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.
Flat out wrong. Its quite possible for A and B to have 0 mutual information. But A and B always have mutual information conditional on some C (assuming A and B each have information) Its possible for there to be absolutely no mutual i... (read more)
This fails if there are closed timelike curves around.
There is of course a very general formalism, whereby inputs and outputs are combined into aputs. Physical laws of causality, and restrictions like running on a reversible computer are just restrictions on the subsets of aputs accepted.
Well covid was pretty much a massive obvious biorisk disaster. Did it lead to huge amounts of competence and resources being put into pandemic prevention?
My impression is not really.
I mean I also expect an AI accident that kills a similar number of people to be pretty unlikely. But https://www.lesswrong.com/posts/LNKh22Crr5ujT85YM/after-critical-event-w-happens-they-still-won-t-believe-you
I think we have a very long track record of embedding our values into law.
I mean you could say that if we haven't figured out how to do it well in the last 10,000 years, maybe don't plan on doing it in the next 10. That's kind of being mean though.
If you have a functioning arbitration process, can't you just say "don't do bad things" and leave everything down to the arbitration?
I also kind of feel that adding laws is going in the direction of more complexity. And we really want as simple as possible. (Ie the minimal AI that can sit in a MIRI basement... (read more)
First, I explicitly define LFAI to be about compliance with "some defined set of human-originating rules ('laws')." I do not argue that AI should follow all laws, which does indeed seem both hard and unnecessary.
Sure, some of the failure modes mentioned at the bottom disappear when you do that.
I think, a stronger claim: that the set of laws worth having AI follow is so small or unimportant as to be not worth trying to follow. That seems unlikely.
If some law is so obviously a good idea in all possible circumstances, the AI will do it whether it ... (read more)
A(Y), with the objective of maximizing the value of Y's shares.
A sphere of self replicating robots, expanding at the speed of light. Turning all available atoms; including atoms that used to consist of judges, courts and shareholders; into endless stacks of banknotes. (With little "belongs to Y" notes pinned to them)
Yes the superintelligent AI could out-lawyer everyone else in the courtrooms if it wanted to. My background assumptions would more be that the AI develops nanotech, and can make every lawyer court and policeman in the world vanish the moment the AI sees fit. The human legal system can only effect the world to the extent that humans listen to it, and enforce it. With a superintelligent AI in play. This may well be king Canute commanding the sea to halt. (And a servant trying to bail the sea back with a teaspoon).
You don't really be considering the AI ... (read more)
Human written laws are written with our current tech level in mind. There are laws against spewing out radio noise on the wrong frequencies, laws about hacking and encryption, laws about the safe handling of radioactive material. There are vast swaths of safety regulations. This is stuff that has been written for current tech levels (How well would roman law work today?)
This kind of "law following AI" sounds like it will produce nonsensical results as it tries to build warp drives and dyson spheres to the letter of modern building codes. Followin... (read more)
Imagine a contract. "We the undersigned agree that AGI is a powerful, useful but also potentially dangerous technology. To help avoid needlessly taking the same risk twice we agree that upon development of the worlds first AI, we will stop all attempts to create our own AI on the request of the first AI or its creators. In return the creator of the first AI will be nice with it."
Then you aren't stopping all competitors. Your stopping the few people that can't cooperate.
That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)
A functioning Bayesian should have probably have updated to that position long before they actually have the A... (read more)
Suppose you develop the first AGI. It fooms. The AI tells you that it is capable of gaining total cosmic power by hacking physics in a millisecond. (Being an aligned AI, its waiting for your instructions before doing that.) It also tells you that the second AI project is only 1 day behind, and they have screwed up alignment.
In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.
Lets consider the opposite. Imagine you are programming a self driving ca... (read more)
Consider the strategy "do whatever action you predict to maximize the electricity in this particular piece of wire in your reward circuitry". This is a very general cognitive pattern that would maximize reward in the training runs. Now there are many different cognitive patterns that maximize reward in the training runs. But this is one simple one, so its at least reasonably plausible it is used.
What I was thinking when I wrote it was more like. When someone proposes a fancy concrete and vacuum box, they are claiming that the fancy box is doing something. ... (read more)
Like 10s of atoms across. So you aren't scaling down that much. (Most of your performance gains are in being able to stack your chips or whatever.
One thing in the posts I found surprising was Eliezers assertion that you needed a dangerous superintelligence to get nanotech. If the AI is expected to do everything itself, including inventing the concept of nanotech, I agree that this is dangerously superintelligent.
However, suppose Alpha Quantum can reliably approximate the behaviour of almost any particle configuration. Not literally any, it can't run a quantum computer factorizing large numbers better than factoring algorithms, but enough to design a nanomachine. (It has been trained to approxi... (read more)
I think that given good value learning, safety isn't that difficult. I think even a fairly halfharted attempt at the sort of Naive safety measures discussed will probably lead to non catastrophic outcomes.
Tell it about mindcrime from the start. Give it lots of hard disks, and tell it to store anything that might possibly resemble a human mind. It only needs to work well enough with a bunch of Miri people guiding it and answering its questions. Post singularity, a superintelligence can see if there are any human minds in the simulations it created when young and dumb. If there are, welcome those minds to the utopia.
I think you might be able to design advanced nanosystems without AI doing long term real world optimization.
Well a sufficiently large team of smart humans could probably design nanotech. The question is how much an AI could help.
Suppose unlimited compute. You program a simulation of quantum field theory. Add a GUI to see visualizations and move atoms around. Designing nanosystems is already quite a bit easier.
Now suppose you brute force search over all arrangements of 100 atoms within a 1nm box, searching for the configuration that most efficiently t... (read more)
Under the Eliezerian view, (the pessimistic view that is producing <10% chances of success). These approaches are basically doomed. (See logistic success curve)
Now I can't give overwhelming evidence for this position. Whisps of evidence maybe, but not an overwheming mountain of it.
Under these sort of assumptions, building a container for an arbitrary superintelligence such that it has only 80% chance of being immediately lethal, and a 5% chance of being marginally useful is an achievment.
(and all possible steelmannings, that's a huge space)
Lets say you use all these filtering tricks. I have no strong intuitions about whether these are actually sufficient to stop those kind of human manipulation attacks. (Of course, if your computer security isn't flawless, it can hack whatever computer system its on and bypass all these filters to show the humans arbitrary images and probably access the internet.)
But maybe you can at quite significant expense make a Faraday cage sandbox, and then use these tricks. This is beyond what most companies will do in the name of safety. But Miri or whoever cou... (read more)
Firstly this would be AI's looking at their own version of the AI alignment problem. This is not random mutation or anything like it. Secondly I would expect there to only be a few rounds maximum of self modification that runs risk to goals. (Likely 0 rounds) Firstly damaging goals looses a lot of utility. You would only do it if its a small change in goals for a big increase in intelligence. And if you really need to be smarter and you can't make yourself smarter while preserving your goals.
You don't have millions of AI all with goals different from each other. The self upgrading step happens once before the AI starts to spread across star systems.
Error correction codes exist. They are low cost in terms of memory etc. Having a significant portion of your descendent mutate and do something you don't want is really bad.
If error correcting to the point where there is not a single mutation in the future only costs you 0.001% resources in extra hard drive, then <0.001% resources will be wasted due to mutations.
Evolution is kind of stupid compared to super-intelligences. Mutations are not going to be finding improvements. Because the superintelligence will be designing their own hardware and the hardwa... (read more)
Darwinian evolution as such isn't a thing amongst superintelligences. They can and will preserve terminal goals. This means the number of superintelligences running around is bounded by the number humans produce before the point the first ASI get powerful enough to stop any new rivals being created. Each AI will want to wipe out its rivals if it can. (unless they are managing to cooperate somewhat) I don't think superintelligences would have humans kind of partial cooperation. Either near perfect cooperation, or near total competition. So this is a scenario where a smallish number of ASI's that have all foomed in parallel expand as a squabbling mess.
I don't think this research, if done, would give you strong information about the field of AI as a whole.
I think that, of the many topics researched by AI researchers, chess playing is far from the typical case.
It's [chess] not the most relevant domain to future AI, but it's one with an unusually long history and unusually clear (and consistent) performance metrics.
An unusually long history implies unusually slow progress. There are problems that computers couldn't do at all a few years ago that they can do fairly efficiently now. Are there pro... (read more)
In a game with any finite number of players, and any finite number of actions per player.
Let O=A1×A2×... the set of possible outcomes.
Player i implements policy Pi:P(O)→Ai . For each outcome in o∈O , each player searches for proofs (in PA) that the outcome is impossible. It then takes the set of outcomes it has proved impossible, and maps that set to an action.
There is always a unique action that is chosen. Whatsmore, given oracles for
Ie the set of actions you might take if you can pr... (read more)
The next task to fall to narrow AI is adversarial attacks against humans. Virulent memes and convincing ideologies become easy to generate on demand. A small number of people might see what is happening, and try to shield themselves off from dangerous ideas. They might even develop tools that auto-filter web content. Most of society becomes increasingly ideologized, with more decisions being made on political rather than practical grounds. Educational and research institutions become full of ideologues crowding out real research. There are some w... (read more)
I would be potentially concerned that this is a trick that evolution can use, but human AI designers can't use safely.
In particular, I think this is the sort of trick that produces usually fairly good results when you have a fixed environment, and can optimize the parameters and settings for that environment. Evolution can try millions of birds, tweaking the strengths of desire, to get something that kind of works. When the environment will be changing rapidly; when the relative capabilities of cognitive modules are highly uncertain and when self mod... (read more)
There are various ideas along the lines of "however much you tell the AI X it just forgets it". https://www.lesswrong.com/posts/BDXvRp8w9T8KkDw5A/policy-restrictions-and-secret-keeping-ai
I think that would be the direction to look in if you have a design tha'ts safe as long as it doesn't know X.
There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).
If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think).
Maybe you did. I find it hard to distinguish inventing and half remembering ideas.
If the training procedure either
Then using different copies of GPT-n trained from different seeds doesn't help.
If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net struc... (read more)
I don't think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn't start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on.
So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?
There seems to be some technical problem with the link. It gives me a "Our apologies, your invite link has now expired (actually several hours ago, but we hate to rush people).
We hope you had a really great time! :)" message. Edit: As of a few minutes after stated start time. It worked last week.
My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn't considering similarity in code space at all.
Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn't be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in compute... (read more)
On its face, this story contains some shaky arguments. In particular, Alpha is initially going to have 100x-1,000,000x more resources than Alice. Even if Alice grows its resources faster, the alignment tax would have to be very large for Alice to end up with control of a substantial fraction of the world’s resources.
This makes the hidden assumption that "resources" is a good abstraction in this scenario.
It is being assumed that the amount of resources an agent "has" is a well defined quantity. It assumes agent can only grow their resources slowly by ... (read more)
+1. Another way of putting it: This allegation of shaky arguments is itself super shaky, because it assumes that overcoming a 100x - 1,000,000x gap in "resources" implies a "very large" alignment tax. This just seems like a weird abstraction/framing to me that requires justification.
I wrote this Conquistadors post in part to argue against this abstraction/framing. These three conquistadors are something like a natural experiment in "how much conquering can the few do against the many, if they have various advantages?" (If I just selected a lone conqueror, ... (read more)
Firstly, why is the rest of the starting state random? In a universe where info can't be destroyed, like this one, random=max entropy. AI is only possible in this universe because the starting state is low entropy.
Secondly, reaching an arbitrary state can be impossible for reasons like conservation of mass energy momentum and charge. Any state close to an arbitrary state might be unreachable due to these conservation laws. Ie a state containing lots of negitive electric charges, and no positive charges being unreachable in our universe.
Well, q... (read more)
I think that it isn't clear what constitutes "fully understanding" an algorithm.
Say you pick something fairly simple, like a floating point squareroot algorithm. What does it take to fully understand that.
You have to know what a squareroot is. Do you have to understand the maths behind Newton raphson iteration if the algorithm uses that? All the mathematical derivations, or just taking it as a mathematical fact that it works. Do you have to understand all the proofs about convergence rates. Or can you just go "yeah, 5 iterations seems to be eno... (read more)
"These technologies are deployed sufficiently narrowly that they do not meaningfully accelerate GWP growth." I think this is fairly hard for me to imagine (since their lead would need to be very large to outcompete another country that did deploy the technology to broadly accelerate growth), perhaps 5%?
I think there is a reasonable way it could happen even without an enormous lead. You just need either,
For example, suppose it is obviou... (read more)
I don't think technological deployment is likely to take that long for AI's. With a physical device like a car or fridge, it takes time for people to set up the factories, and manufacture the devices. AI can be sent across the internet in moments. I don't know how long it takes google to go from say an algorithm that detects streets in satellite images to the results showing up in google maps, but its not anything like the decades it took those physical techs to roll out.
The slow roll-out scenario looks like this, AGI is developed using a technique that fu... (read more)
I don't actually think "It is really hard to know what sorts of AI alignment work are good this far out from transformative AI." is very helpful.
It is currently fairly hard to tell what is good alignment work. A week from TAI, then either, good alignment work will be easier to recognise because of alignment progress not strongly correlated with capabilities, or good alignment research is just as hard to recognise. (More likely the latter) I can't think of any safety research that can be done on GPT3 that can't be done on GPT1.
In my picture, res... (read more)
, it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.
I'm not seeing quite what the bad but not existential catastrophes would look like. I also think the AI has an incentive not to do this. My world model (assuming slow takeoff) goes more like this.
AI created in lab. Its a fairly skilled programmer and hacker. Ab... (read more)
In the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point.
If you want to prove things about fixed points of HCH in an iterated function setting, consider it a function from policies to policies. Let M be the set of messages (say ascii strings < 10kb.) Given a giant look up table T that maps M to M, we can create another giant look up table. For each m in M , give a human in a box the string m, and unlimited query access to T. Record their output.
The fixed points of this are the same as the fixed points of HCH. "Human with query access to" is a function on the space of policies.
Tim Dettmers whole approach seems to be assuming that there are no computational shortcuts. No tricks that programmers can use for speed where evolution brute forced it. For example, maybe a part of the brain is doing a convolution by the straight forward brute force algorithm. And programmers can use fast fourier transform based convolutions. Maybe some neurons are discrete enough for us to use single bits. Maybe we can analyse the dimensions of the system and find that some are strongly attractive, and so just work in that subspace.
Of course, all t... (read more)
Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy. If the box is less secure than we thought, or various other things go wrong, the AI will just shut... (read more)
Here is a potential solution to stop button type problems, how does this go wrong?
Taking into account uncertainty, the algorithm is.
Calculate the X maximizing best action in a world where the stop button does nothing.
Calculate the X maximizing best action in a world where the stop button works.
If they are the same, do that. Otherwise shutdown.
rough stop button problem ideas.
You want an AI that believes its actions can't effect the button. You could use causal counterfactuals. An imaginary button that presses itself at random. You can scale the likelihood of worlds up and down, to ensure the button is equally likely to be pressed in each world. (Wierd behaviour, not recomended) You can put the AI in the logical counterfactual of "my actions don't influence the chance the button is pressed." if you can figure out logical counterfactuals.
Or you can get the AI to simulate what it would do if it were an X maximizer. If it thinks the button won't be pressed, it does that, otherwise it does nothing. (not clear how to generalize to uncertain AI)