The problem of future unaligned AI leaking into human imitation is something I wrote about before. Notice that IDA-style recursion help a lot, because instead of simulating a process going deep into the external timeline's future, you're simulating a "groundhog day" where the researcher wakes up over and over at the same external time (more realistically, the restart time is drifting forward with the time outside the simulation) with a written record of all their previous work (but no memory of it). There can still be a problem if there is a positive probability of unaligned AI takeover in the present (i.e. during the time interval of the simulated loop), but it's a milder problem. It can be further ameliorated if the AI has enough information about the external world to make confident predictions about the possibility of unaligned takeover during this period. The out-of-distribution problem is also less severe: the AI can occasionally query the real researcher to make sure its predictions are still on track.

Reply

[-]Adam Jermyn3y57

Another angle: number of bits of optimization required is a direct measure of “how far out of distribution” we need to generalize.

I think it's useful to distinguish between the amount of optimization we ask the model to do versus the unlikelihood of the world we ask it to simulate.

For instance, I can condition on something trivial like "the weather was rainy on 8/14, sunny on 8/15, rainy on 8/16...". This specifies a very unlikely world, but so long as the pattern I specify is plausible it doesn't require much optimization on the part of the model or take me far out of distribution. There can be many, many plausible patterns like this because the weather is a chaotic system and so intrinsically has a lot of uncertainty, so there's actually a lot of room to play here.

That's a silly example, but there are more useful ones. Suppose I condition on a sequence of weather patterns (all locally plausible) that affect voter turnout in key districts such that politicians get elected who favor policies that shift the world towards super-tight regulatory regimes on AI. That let's me push down the probability that there's a malicious AI in the simulated world without requiring the model itself to perform crazy amounts of optimization.

Granted, when the model tries to figure out what this world looks like, there's a danger that it says "Huh, that's a strange pattern. I wonder if there's some master-AGI engineering the weather?" and simulates that world. That's possible, and the whole question is about whether the things you conditioned on pushed down P(bad AGI controls the world) faster than they made the world-writ-large unlikely.

Reply

[-]Donald Hobson3y30

This is great, but also misses the loopyness. If GPT12 looks at the future, surely most of that future is massively shaped by GPT12. We are in fixed point, self fulfilling prophecy land. (Or, if you somehow condition on its current output being nothing, then the next slightly different attempt with GPT13. ) If GPT-n doubles the chance of success, the only fixed point is success.

Reply

[-]johnswentworth3y20

Yup, I intentionally didn't go into the whole stable fixed-point thing in this post, it's a whole complicated can of worms which applies in multiple different ways to multiple different schemes.

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

37

Human Mimicry Mainly Works When We’re Already Close

37

Simulation vs Prediction

Generalization Problems

So What’s Different About Simulation?

General Principle: Human Mimicry Buys A Limited Number Of Bits

The Weird Shit Problem

Generalization Again

Expect More Problems

Takeaways