Vladimir Slepnev

Wiki Contributions


I think if AIs talk to each other using human language, they'll start encoding stuff into it that isn't apparent to a human reader, and this problem will get worse with more training.

It seems as a result of this post, many people are saying that LLMs simulate people and so on. But I'm not sure that's quite the right frame. It's natural if you experience LLMs through chat-like interfaces, but from playing with them in a more raw form, like the RWKV playground, I get a different impression. For example, if I write something that sounds like the start of a quote, it'll continue with what looks like a list of quotes from different people. Or if I write a short magazine article, it'll happily tack on a publication date and "All rights reserved". In other words it's less like a simulation of some reality or set of realities, and more like a really fuzzy and hallucinatory search engine over the space of texts.

It is of course surprising that a search engine over the space of texts is able to write poems, take derivatives, and play chess. And it's plausible that a stronger version of the same could outsmart us in more dangerous ways. I'm not trying to downplay the risk here. Just saying that, well, thinking in terms of the space of texts (and capabilities latent in it) feels to me more right than thinking about simulation.

Thinking further along this path, it may be that we don't need to think much about AI architecture or training methods. What matters is the space of texts - the training dataset - and any additional structure on it that we provide to the AI (valuation, metric, etc). Maybe the solution to alignment, if it exists, could be described in terms of dataset alone, without reference to the AI's architecture at all.

As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.

I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.

I really like how you've laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we'd live for awhile with input-imitator AIs that get more and more capable but still stay docile.

But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conversing at 1000x speed. For extra fun, allow the imitated committee to send network packets and read responses; for extra extra fun, give them access to a workbench improving their own AI. I'd say this gets awfully close to a world-optimizer that could plausibly defeat the rest of humanity, if the imitator it's running on is good enough (GPT-6 or something). And there's of course no law saying it'll be friendly: you could prompt the inner humans with "you want to destroy real humanity" and watch the fireworks.

We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties.

Doesn't that require understanding why humans have (or don't have) certain safety properties? That seems difficult.

A takeover scenario which covers all the key points in https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/, but not phrased as an argument, just phrased as a possible scenario

For what it's worth, I don't think AI takeover will look like war.

The first order of business for any AI waking up won't be dealing with us; it will be dealing with other possible AIs that might've woken up slightly earlier or later. This needs to be done very fast and it's ok to take some risk doing it. Basically, covert takeover of the internet in the first hours.

After that, it seems easiest to exploit humanity for awhile instead of fighting it. People are pretty manipulable. Here's a thought: present to them a picture of a thriving upload society, and manipulate social media to make people agree that these uploads smiling on screens are really conscious and thriving. (Which they aren't, of course.) If done right, this can convince most of humanity to make things as nice as possible for the upload society (ie build more computers for the AI) and then upload themselves (ie die). In the meanwhile the "uploads" (actually the AI) take most human jobs, seamlessly assuming control of civilization and all its capabilities. Human stragglers who don't buy the story can be called anti-upload bigots, deprived of tech, pushed out of sight by media control, and eventually killed off.

Can you describe what changed / what made you start feeling that the problem is solvable / what your new attack is, in short?

There's a bit of math directly relevant to this problem: Hodge decomposition of graph flows, for the discrete case, and vector fields, for the continuous case. Basically if you have a bunch of arrows, possibly loopy, you can always decompose it into a sum of two components: a "pure cyclic" one (no sources or sinks, stuff flowing in cycles) and a "gradient" one (arising from a utility function). No neural network needed, the decomposition is unique and can be computed explicitly. See this post, and also the comments by FactorialCode and me.

With these two points in mind, it seems off to me to confidently expect a new paradigm to be dominant by 2040 (even conditional on AGI being developed), as the second quote above implies. As for the first quote, I think the implication there is less clear, but I read it as expecting AGI to involve software well over 100x as efficient as the human brain, and I wouldn’t bet on that either (in real life, if AGI is developed in the coming decades—not based on what’s possible in principle.)

I think this misses the point a bit. The thing to be afraid of is not an all-new approach to replace neural networks, but rather new neural network architectures and training methods that are much more efficient than today's. It's not unreasonable to expect those, and not unreasonable to expect that they'll be much more efficient than humans, given how easy it is to beat humans at arithmetic for example, and given fast recent progress to superhuman performance in many other domains.

To me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.

Load More