Vladimir Slepnev

Wiki Contributions

Comments

A takeover scenario which covers all the key points in https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/, but not phrased as an argument, just phrased as a possible scenario

For what it's worth, I don't think AI takeover will look like war.

The first order of business for any AI waking up won't be dealing with us; it will be dealing with other possible AIs that might've woken up slightly earlier or later. This needs to be done very fast and it's ok to take some risk doing it. Basically, covert takeover of the internet in the first hours.

After that, it seems easiest to exploit humanity for awhile instead of fighting it. People are pretty manipulable. Here's a thought: present to them a picture of a thriving upload society, and manipulate social media to make people agree that these uploads smiling on screens are really conscious and thriving. (Which they aren't, of course.) If done right, this can convince most of humanity to make things as nice as possible for the upload society (ie build more computers for the AI) and then upload themselves (ie die). In the meanwhile the "uploads" (actually the AI) take most human jobs, seamlessly assuming control of civilization and all its capabilities. Human stragglers who don't buy the story can be called anti-upload bigots, deprived of tech, pushed out of sight by media control, and eventually killed off.

Can you describe what changed / what made you start feeling that the problem is solvable / what your new attack is, in short?

There's a bit of math directly relevant to this problem: Hodge decomposition of graph flows, for the discrete case, and vector fields, for the continuous case. Basically if you have a bunch of arrows, possibly loopy, you can always decompose it into a sum of two components: a "pure cyclic" one (no sources or sinks, stuff flowing in cycles) and a "gradient" one (arising from a utility function). No neural network needed, the decomposition is unique and can be computed explicitly. See this post, and also the comments by FactorialCode and me.

With these two points in mind, it seems off to me to confidently expect a new paradigm to be dominant by 2040 (even conditional on AGI being developed), as the second quote above implies. As for the first quote, I think the implication there is less clear, but I read it as expecting AGI to involve software well over 100x as efficient as the human brain, and I wouldn’t bet on that either (in real life, if AGI is developed in the coming decades—not based on what’s possible in principle.)

I think this misses the point a bit. The thing to be afraid of is not an all-new approach to replace neural networks, but rather new neural network architectures and training methods that are much more efficient than today's. It's not unreasonable to expect those, and not unreasonable to expect that they'll be much more efficient than humans, given how easy it is to beat humans at arithmetic for example, and given fast recent progress to superhuman performance in many other domains.

To me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.

These involve extinction, so they don't answer the question what's the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.

I think alignment is finicky, and there's a "deep pit around the peak" as discussed here.

There are very “large” impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people’s personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.)

I don't think we are indifferent to these outcomes. We leave them to luck, but that's a fact about our limited capabilities, not about our values. If we had enough control over "chaotic weather changes" to steer a hurricane away from a coastal city, we would very much care about it. So if a strong AI can reason through these impacts, it suddenly faces a harder task than a human: "I'd like this apple to fall from the table, and I see that running the fan for a few minutes will achieve that goal, but that's due to subtly steering a hurricane and we can't have that".

I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.

Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/O to. Wei called it "UDT2".

Load More