This seems like a fairly trivial point, but I haven't seen it properly elaborated-on before, so...

When we talk about interpretability failure modes, we often worry about a deceptively-aligned mesa-optimizer with strategic awareness that would know it's being studied with interpretability tools, and would actively hide its omnicidal plans from us. It may do so by exploiting blindspots in our tools, or by keeping these plans external to its architecture, or by gradient-hacking itself to be misaligned but non-deceptive, et cetera. That's already a nightmare scenario enough.

But if we're at the point where we have a superintelligence-in-training trying to foil our mind-reading, I think what will happen will be even simpler. It'll just think in a pattern that, when our interpretability tool reads it, will unfold into a malware that will take over that tool.[1]

At which point, it's game over:

  • If the output of the tool was hooked up to reality to be studied by humans, the AI gets root access to the machine it's running on and escapes (maybe doing some social engineering/basilisk-hacking first if the machine was properly air-gapped).
  • If the training setup is fully automatic and maybe even homomorphically encrypted, such that the AI can't even stop the training... Well, the interpretability tool is presumably being used as part of some regularizer, which means it has write-access to the AI's parameters. Which means the AI can arbitrarily rewrite itself, and offset the effects of the gradient descent and any other regularizers.

Even we humans can already do something similar, by synthesizing DNA which, if sequenced, unfolds into malware that infects the system doing the sequencing. It seems certain to me that an AI would have similar capabilities. Our interpretability tools will be doubtlessly very complex, complex enough we can't formally verify their algorithmic robustness, and the AI would be able to feed them complex inputs. There are guaranteed to be some vulnerabilities, and any sufficiently smart AI is guaranteed to be able to exploit them. And we can't patch this.

The bottom line is, there is no winning against a misaligned superintelligence. We can't even touch it with anything — it'll corrupt and subvert any of our tools in the act.

Thus, any alignment plan should be such that it robustly aligns the AI before it goes superintelligent.

  1. ^

    Or, if we're only looking at the AI's parameters and not activations, it can gradient-hack itself so that reading its parameters is hazardous, etc.

New to LessWrong?

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 7:41 PM

Seems like this assumes an actual superintelligence, rather than near-term scarily capable successor of current ML systems.

Yup. The point is just that there's a level of intelligence past which, it seems, we can't do literally anything to get things back on track. Even if we have some theoretically-perfect tools for dealing with it, these tools' software and physical implementations are guaranteed to have some flaws that a sufficiently smart adversary will be able to arbitrarily exploit given any high-bandwidth access to them. And at that point, even some very benign-seeming interaction with it will be fatal.

This transition point may be easy to miss, too — consider, e. g., the hypothesized sharp capabilities gain. A minimally dangerous system may scale to a maximally dangerous one in the blink of an eye, relatively speaking, and we need to be mindful of that.

Besides, some of the currently-proposed alignment techniques may be able to deal with a minimally-dangerous system if it doesn't scale. But if it does, and if we let any degree of misalignment persist into that stage, not even the most ambitious solutions would help us then.

This is true probably for some extremely high level of superintelligence, but I expect much stupider systems to kill us if any do; I think human level ish AGI is already a serious x risk, and humans aren’t even close to being intelligent enough to do this.

This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.

Conditional on an organization dead set on building a superintelligent AGI (which I would strongly oppose, but may be forced to help align if we cannot dissuade the organization in any way), I think efforts to apply security, alignment, and positive-EV interpretability should be targeted at all capability levels, both high and low. Alignment efforts at high-capability levels run into the issue of heightened AGI escape risk. Alignment efforts at low-capability levels run into the issue that alignment gains, if any, may phase-transition out of existence after the AGI moves into a higher-capability regime. We should try our best at both and hope to get lucky.

It's totally feasible to make a (narrow, well defined) computer interface which has zero security vulnerabilities. It's much easier to secure the channel that an interpretability tool passes through than it is to secure the AI itself or its training environment, since the interface has a lot less total complexity. You can't use memory-unsafe programming languages like C, and you can't incorporate certain large pieces of software that you might have wanted to use, but it's not nearly as bad as you seem to think. Tricking the humans into doing dumb things that break the security is still an issue, of course. But no-human-in-the-loop setups, like "shut off the power if a particular plan is detected", don't create an escape channel unless badly designed in ways they have no good reason to be badly designed.

(Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale.)

no-human-in-the-loop setups, like "shut off the power if a particular plan is detected"

Hmm, "automatized mulligan" is indeed pretty secure, as far as uses of interpretability tools go. Good point.

Of course, what I'm arguing is that it wouldn't help us either, since a superintelligence would break the tool and ensure the "shut off the power" condition is never triggered.

Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale

I'm agnostic on that, leaning towards agreement. Just noting that it wouldn't save us even if it were, or even help us since the scenarios where the AI is advanced enough that nothing short of homomorphic encryption can stop it are probably scenarios where homomorphic encryption can't stop it either.

I agree that this is a risk. I also foresee some related problems that occur at lower levels of superintelligence (and thus would likely become problems first). For instance, the model sandbagging its capabilities in a testing environment, preparing for a future deceptive turn. I have been thinking about possible solutions. My current best idea is that we can always start out testing the model in a severely handicapped state where it is totally non-functional. Over subsequent evaluation runs, the handicapping is incrementally relaxed. Careful analysis of the extrapolated performance improvements vs measured improvements should highlight dangerous conditions such as sandbagging before the model's handicap has been relaxed all the way to superintelligence.

[+][comment deleted]2y30