Summary of the Acausal Attack Issue for AIXI

by Diffractor6 min read13th Dec 20211 comment

8

AI RiskAIXISolomonoff InductionAI
Frontpage

Attention conservation notice: To a large extent, this is redundant with Paul's previous post about this, but I figured that some people might be interested in my restatement of the argument in my own words, as I did not start out believing that it was an issue, or start out agreeing with Vanessa about the attack operating via bridge rules.


Solomonoff induction/AIXI runs all possible computations, so it's possible in theory to alter which predictions a particular Turing machine outputs, by making stuff happen in your own universe, and this would then influence any process running that Turing machine, such as an AIXI agent. Of course, doing such a thing is subject to the obvious limitation where, if, say, you make the Turing machine that's reading your universe output a 0, you'll have less ability to influence the predictions and decisions of AIXI-like agents in the worlds where they see a 1 instead, because the Turing machine that you're controlling got eliminated for mispredictions.

Taking the attacker's perspective, if you were trying to influence one particular universe (this assumption will be loosened later) containing an AIXI or sufficiently AIXI-like target agent, and you had sufficiently high predictive abilities, you could try finding two low description complexity spots in your universe, one to check the state of, and one to write data to, and committing to the strategy "if the input data from this simple spot looks like the data I'd predict to receive from the world I'm interested in, I will submit output data accordingly (mostly accurate predictions of the victim's environment, but with whatever tweaks are needed), in order to influence what the target agent does in the targeted universe."

Basically, if you learn that there's an input channel to your universe, it's worthwhile to try to hack the output channel. It doesn't even take much effort to do, you just need to keep an eye on the conjectured input channel, commit to responding accordingly if it looks like it's transmitting data, and do other things in the meantime, and the Turing machine "run your universe, signal in via this channel, read output via this channel" is now under your control.

So... from the perspective of the targeted agent/universe, how well would your hacking attempt work? Well, it critically depends on whether the complexity of specifying "your universe + the I/O channels" is more than, or less than, the complexity of the shortest "honest" predictor of the observations of the targeted agent. If the honest predictor of the victim's observations is less complex than the specification of your universe and the I/O channels to it, then you messing around with the output channel and its predictions of observations would end up just affecting the 100th decimal place of the target AIXI's probabilities or something, because each bit is a factor of 2 difference in probability, and so you're at a huge disadvantage if you want to intentionally screw up the victim's probability estimates.

However, if the complexity of specifying your universe and the I/O channels is shorter than the "honest" predictor of the victim's observations, then after Solomonoff induction is done weeding out the mispredicting "chaff" hypotheses, the Turing machine that you're controlling is dominant over the "honest" predictor by an overwhelming factor (because each extra bit is a 2x difference in probability, so the situation is reversed). Now, just predict doom if the victim doesn't do what you want, and bam! You've taken control of the future of that target universe.

"But wait, how could specifying an entire universe containing an agent interested in hacking a target universe and competent enough to do so end up simpler than just... accurately specifying the target universe?". Ah, it's because we're just measuring the complexity of the shortest recipe (Turing machine code) for specifying the universe (and I/O channels) interested in hacking others. Very short recipes/TM's can unpack into exceptionally intricate and long-running computations, and specifying aspects of the intermediate state, such as the specification of the universe that's being targeted, does take a lot of bits. There's no obstacle against a complex structure showing up as a (complex to specify) intermediate result of a simple computation.

But wait, there can only be so many low-complexity universes, and if they're launching successful attacks, said attacks would be distributed amongst a far far far larger population of more-complex universes. So, switching perspective to whoever is nervously wondering whether to run an AIXI agent, there's probably no low-complexity jerk (low-complexity enough to beat the "right" predictor for your universe) interested in your universe in particular (well... it's a bit more complicated, but it looks like that at first glance). In a certain sense, launching the prediction attack exactly as specified here means the attacker is only able to "punch down" in K-complexity.

Admittedly, it's possible to launch multiple attacks, by using a bunch of low-complexity channels in the attacker's universe instead of just one, but there's only so many low-complexity spots available to go around in the attacker's universe, which means that the basic analysis of "the low-complexity universe can only target a relatively small amount of high-complexity universes compared to their total number" still holds.

So the next question is, is there a way to intensify this to a generic problem for anything AIXI/Solomonoff-like, instead of it just being a problem for the few unlucky high-complexity universes specifically being targeted by low-complexity ones? From the perspective of a world with a higher-complexity true predictor of observations, is there a generic argument that there's a simple Turing machine interested in targeting that world in particular? Surely no, as there's only so many simple Turing machines to go around, right?

Well, by virtue of running an AIXI-like agent that will have large influence on the future, that's an especially interesting property of a universe which would tend to draw a whole lot more attention from agents interested in influencing other computations than just being some generic high-complexity computation.

The other issue, which ties into Vanessa's worries about bridge rules, is as follows: The complexity measure the attacker must beat is "target universe + the bridge rule for the observations of the victim", not just the complexity of the target universe. So, if the bridge rules are complex, then "target universe + bridge rule for observations of victim agent" might end up more complex than "very simple universe containing a powerful optimization process that cares about affecting the target universe (among others)+simple I/O channels"

The compression of observations is achieved via the complexity of the bridge rule being tucked into the process of the attacker going "time to figure out where the influential points in this target universe are" (and working out the complex bridge rule itself), since, again, short-description-length computations can unfold into huge computations, where it's complex to slice out a particular intermediate result of the computation.

But, if it's possible to shunt the bridge rule complexity elsewhere like that, then couldn't the target agent compress its prediction of its sensory data with a simple hypothesis containing another agent motivated to make accurate predictions (and so it'd figure out the bridge rule)?

Well, the problem with that is that it'd probably be less complex to specify physical laws that eventually produce an agent that is incentivized to perform acausal attacks, than to specify an agent which has a utility function that, at its maximum, makes accurate predictions of your target universe.

So, that's the basic argument, as I understand it. For complex enough bridge rules relative to the complexity of your universe, hypotheses that produce powerful optimizers that target your universe (and an output channel), can come in substantially shorter than "here's the description of the universe, here's the bridge rule", because the former hypothesis is shunting the bridge complexity into the process of computation itself, and hypotheses like the former are practically guaranteed to have goals that are not your own and so mess with your beliefs to get you to take particular actions.

8

1 comments, sorted by Highlighting new comments since Today at 10:54 AM
New Comment

I still feel like there's just too many pigeons and not enough holes.

Like, if you're an agent in some universe with complexity K(U) and you're located by a bridging rule with complexity K(B), you are not an agent with complexity K(U). Average case you have complexity (or really you think the world has some complexity) K(U)+K(B) minus some small constant. We can illustrate this fact by making U simple and B complicated - like locating a particular string within the digits of pi.

And if an adversary in a simple universe (complexity K(U')) "hijacks" you by instantiating you at an easy-to-bridge location (cost K(B')), in their universe, what you've learned is your complexity is actually K(U')+K(B').

But of course there are vastly fewer agents with the small "hijacked" complexity than there are with the large "natural" complexity. I'm really skeptical that you can bridge this gap with arguments like "the universe will be simpler than it naively seems because we can search for ones that are majorly impacted by agents."