Donald Hobson

Cambridge maths student dph39@cam.ac.uk

Donald Hobson's Comments

Will AI undergo discontinuous progress?

I am not convinced that the distinction between continuous and discontinuous approaches is a feature of the territory. Zoom in in sufficient detail, and you see a continuous wavefunction of electron interacting with the continuous wavefunction of a silicon atom. Zoom out to evolutionary timescales, and the jump from hominids with pointy sticks to ASI is almost instant. The mathematical definition of continuity relies on your function being mathematically formal. Is distance from earth to the moon an even number of plank-lengths? Well there are a huge number of slightly different measurements you could make, depending on when you measure and exactly what points you measure between, and how you deal with relitivistic length contraction, the answer will be different. In a microsecond in which the code that is a fooming AI is doing garbage collection, is AI progress happening? You have identified an empirical variable called AI progress, but whether or not it is continuous depends on exactly how you fill in the details.

Imagine superintelligence has happened and we are discussing the details afterwords. We were in a world basically like this one, and then someone ran some code, which gained total cosmic power within a millisecond. Someone tries to argue that this was a continuous improvement, just a very fast one. What evidence would convince you one way or the other on this?

Will AI undergo discontinuous progress?

Suppose that different tasks take different levels of AI to do better than humans.

Firstly AI can do arithmatic, then play chess, then drive cars ect. Lets also assume that AI is much faster than humans. So imagine that AI research ability was rising from almost nothing to superhuman over the course of a year. A few months in and its inventing stuff like linear regression, impressive, but not as good as current human work on AI. There are a few months where the AI is worse than a serious team of top researchers, but better than an intern. So if you have a nieche use for AI, that can be automatically automated. The AI research AI designs a widget building AI. The humans could have made a widget building AI themselves, but so few widgets are produced, that it wasn't worth it.

Then the AI becomes as good as a top human research team and FOOM. How crazy the world gets before foom depends on how much other stuff is automated first. Is it easier to make an AI teacher, or an AI AI researcher? Also remember that bearocratic delays are a thing, there is a difference between having an AI that does medical diagnosis in a lab, and it being used in every hospital.

Plausibly, almost every powerful algorithm would be manipulative

How dangerous would you consider a person with basic programming skills and a hypercomputer? I mean I could make something very dangerous, given hypercompute. I'm not sure if I could make much that was safe and still useful. How common would it be to accidentally evolve a race of aliens in the garbage collection?

At the moment, my best guess at what powerful algorithms look like is something that lets you maximize functions without searching through all the inputs. Gradient descent can often find a high point without that much compute, so is more powerful than random search. If your powerful algorithm is more like really good computationally bounded optimization, I suspect it will be about as manipulative as brute forcing the search space. (I see no strong reason for strategies labeled manipulative to be that much easier or harder to find than those that aren't.)

Would I think for ten thousand years?

People will have to do a lot of maths and philosophy to get an AI system that works at all.

Suppose you have a lead of 1 week over any ufai projects, and you have your AI system to the point where it can predict what you would do in a box. (Actually, we can say the AI has developed mind uploading tech + lotsa compute) The human team needs say 5 years of thinking to come up with better metaethics, defense against value drift or whatever. You want to simulate the humans in some reasonably human friendly environment for a few years to work this thing out. You pick a nice town, and ask the AI to create a virtual copy of the town. (More specifically, you randomly sample from the AI's probability distribution, after conditioning on enough data that the town will be townlike.) The virtual town is created with no people except the research team in it. All the services are set to work without any maintenance. (Water in virtual pipes, food in virtual shops, virtual internet works.). The team of people uploaded into this town is at least 30, ideally a few hundred, including plenty of friends and family.

This "virtual me in a box" seems likely to be useful and unlikely to be dangerous. I agree that any virtual box trick that involves people thinking for a long time compared to current lifespans is dangerous. A single person trapped in low res polygon land would likely go crazy from the sensory deprivation.

You need an environment with a realistic level of socializing and leisure activities to support psycologically healthy humans. Any well done "virtual me in a box" is going to look more like a virtual AI safety camp or research department than 1 person in a blank white room containing only a keyboard.

Unfortunately, all those details would be hard to manually hard code in. You seem to need an AI that can be trusted to follow reasonably clear and specific goals without adversarial optimization. You want a virtual park, manually creating it would be a lot of hard work, see current video games. You need an AI that can fill in thousands of little details in a manor not optimized to mess with humans. This is not an especially high bar.

Inner alignment requires making assumptions about human values

Suppose that in building the AI, we make an explicitly computable hardcoded value function. For instance, if you want the agent to land between the flags, you might write an explicit, hardcoded function that returns 1 if between a pair of yellow triangles, else 0.

In the process of standard machine learning as normal, information is lost because you have a full value function but only train the network using the evaluation of the function at a finite number of points.

Suppose I don't want the lander to land on the astronaut, who is wearing a blue spacesuit. I write code that says that any time there is a blue pixel below the lander, the utility is -10.

Suppose that there are no astronauts in the training environment, in fact nothing blue whatsoever. A system that is trained using some architecture that only relies on the utility of what it sees in training, would not know this rule. A system that can take the code, and read it, would spot this info, but might not care about it. A system that generates potential actions, and then predicts what the screen would look like if it took those actions, and then sent that prediction to the hard coded utility function, with automatic shutdown if the utility is negative, would avoid this problem.

If hypothetically, I can take any programmed function f:observations -> reward and make a machine learning system that optimizes that function, then inner alignment has been solved.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

By adding random noise, I meant adding wiggles to the edge of the set in thingspace for example adding noise to "bird" might exclude "ostrich" and include "duck bill platypus".

I agree that the high level image net concepts are bad in this sense, however are they just bad. If they were just bad and the limit to finding good concepts was data or some other resource, then we should expect small children and mentally impaired people to have similarly bad concepts. This would suggest a single gradient from better to worse. If however current neural networks used concepts substantially different from small children, and not just uniformly worse or uniformly better, that would show different sets of concepts at the same low level. This would be fairly strong evidence of multiple concepts at the smart human level.

I would also want to point out that a small fraction of the concepts being different would be enough to make alignment much harder. Even if their was a perfect scale, if 1/3 of the concepts are subhuman, 1/3 human level and 1/3 superhuman, it would be hard to understand the system. To get any safety, you need to get your system very close to human concepts. And you need to be confidant that you have hit this target.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Neural nets have around human performance on Imagenet.

If abstraction was a feature of the territory, I would expect the failure cases to be similar to human failure cases. Looking at https://github.com/hendrycks/natural-adv-examples, This does not seem to be the case very strongly, but then again, some of them contain dark shiny stone being classified as a sea lion. The failures aren't totally inhuman, the way they are with adversarial examples.

Humans didn't look at the world and pick out "tree" as an abstract concept because of a bunch of human-specific factors.

I am not saying that trees aren't a cluster in thing space. What I am saying is that if there were many cluster in thing space that were as tight and predicatively useful as "Tree", but were not possible for humans to conceptualize, we wouldn't know it. There are plenty of concepts that humans didn't develop for most of human history, despite those concepts being predicatively useful, until an odd genius came along or the concept was pinned down by massive experimental evidence. Eg inclusive genetic fitness, entropy ect.

Consider that evolution optimized us in an environment that contained trees, and in which predicting them was useful, so it would be more surprising for there to be a concept that is useful in the ancestral environment that we can't understand, than a concept that we can't understand in a non ancestral domain.

This looks like a map that is heavily determined by the territory, but human maps contain rivers and not geological rock formations. There could be features that could be mapped that humans don't map.

If you believe the post that

Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us,

Then you can form an equally good, nonhuman concept by taking the better alien concept and adding random noise. Of course, an AI trained on text might share our concepts just because our concepts are the most predicatively useful ways to predict our writing. I would also like to assign some probability to AI systems that don't use anything recognizable as a concept. You might be able to say 90% of blue objects are egg shaped, 95% of cubes are red ... 80% of furred objects that glow in the dark are flexible ... without ever splitting objects into bleggs and rubes. Seen from this perspective, you have a density function over thingspace, and a sum of clusters might not be the best way to describe it. AIXI never talks about trees, it just simulates every quantum. Maybe there are fast algorithms that don't even ascribe discrete concepts.


[AN #80]: Why AI risk might be solved without additional intervention from longtermists
I agree that ML often does this, but only in situations where the results don't immediately matter. I'd find it much more compelling to see examples where the "random fix" caused actual bad consequences in the real world.

Current ML culture is to test 100's of things in a lab until one works. This is fine as long as the AI's being tested are not smart enough to break out of the lab, or realize they are being tested and play nice until deployment. The default way to test a design is to run it and see, not to reason abstractly about it.

and then we'll have a problem that is both very bad and (more) clearly real, and that's when I expect that it will be taken seriously.

Part of the problem is that we have a really strong unilateralist's curse. It only takes 1, or a few people who don't realize the problem to make something really dangerous. Banning it is also hard, law enforcement isn't 100% effective, different countries have different laws and the main real world ingredient is access to a computer.

If the long-term concerns are real, we should get more evidence about them in the future, ...I expect that it will be taken seriously.

The people who are ignoring or don't understand the current evidence will carry on ignoring or not understanding it. A few more people will be convinced, but don't expect to convince a creationist with one more transitional fossil.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using.

This sort of reasoning seems to assume that abstraction space is 1 dimensional, so AI must use human concepts on the path from subhuman to superhuman. I disagree. Like most things we don't have strong reason to think is 1D, and which take many bits of info to describe, abstractions seem high dimensional. So on the path from subhuman to superhuman, the AI must use abstractions that are as predicatively useful as human abstractions. These will not be anything like human abstractions unless the system was designed from a detailed neurological model of humans. Any AI that humans can reason about using our inbuilt empathetic reasoning is basically a mind upload, or a mind that differs from human less than humans differ from each other. This is not what ML will create. Human understanding of AI systems will have to be by abstract mathematical reasoning, the way we understand formal maths. Empathetic reasoning about human level AI is just asking for anthropomorphism. Our 3 options are

1) An AI we don't understand

2) An AI we can reason about in terms of maths.

3) A virtual human.

Critiquing "What failure looks like"

Suppose it was easy to create automated companies, and skim a bit off the top. AI algorithms are just better at buisness than any startup founder. Soon some people create these algorithms, give them a few quid in seed capitat and leave them to trade and accumulate money. The algorithms rapidly increase their wealth, and soon own much of the world economy. Humans are removed when the AIs have the power to do so at a profit. This ends in several superintelligences tiling the universe with economium together.

For this to happen, we need

1) Doubling time of fooming AI months to years, to allow many AI's to be in the running.

2) Its fairly easy to set an AI to maximize money.

3) The people that care about complex human values can't effectively make an AI to do that.

4) Any attempts to stamp out all fledgling AIs before they get powerful fails. Helped by anonymous cloud computing.

I don't really buy 1) , but it is fairly plausible, I'm not convinced of 2) either, although it might not be hard to build a mesa optimiser that cares about something sufficiently correlated with money, that humans are beyond caring before any serious deviation from money optimization happens.

If 2) were false, and people who tried to make AI's all got paperclip maximisers, the long run result is just a world filled with paperclips not banknotes. (Although this would make coordinating to destroy the AI's a little easier?) The paperclip maximisers would still try to gain economic influence until they could snap nanotech fingers.

Load More