localdeity - AI Alignment Forum

If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.

Stated like this, it seems to run afoul of the law of conservation of expected evidence. If you see precursors to deception, is it then the case that both (a) seeing deception is evidence for deception and (b) not seeing deception is also evidence for deception? I don't think so.

The direct patch is "If you see precursors to deception, then you should expect that there is deception, and further evidence should not change your belief on this"—which does seem to be a first approximation to your position.

Godzilla Strategies

localdeity2y56

Well, the basic idea "adding more safeguards decreases the likelihood they'll all fail simultaneously, as long as there isn't a perfect correlation of failure modes" is a simple mathematical fact. "What is the probability of this safeguard failing to detect a rogue AI?" is hard to answer, but "What might this new safeguard do that the other safeguards don't do?" is easier.

For example. If interpretability work gets anywhere, then one might imagine a suite of safeguards that check for parts of the developing neural net that compute things like "how to detect security holes in C or machine code" or "how quickly humans die to certain poisons" (when that's not supposed to be the goal); safeguards that check for parts of the net that have many nodes and are not understandable by the other safeguards; safeguards that inspect the usage of CPU or other resources and have some idea of what's usual; safeguards that try to look for the net thinking strategically about what resource usage looks natural; and so on. These safeguards might all suck / only work in a small fraction of cases, but if you have hundreds or thousands of them, then your odds might get decent.

Or, at least, you might push the correlated failure mode into "the AI has guessed that we have a bunch of safeguards, has imagined the set of safeguards people at our level of competence are likely to have come up with, and is strategically avoiding them", in which case that strategic thinking is a thing you try to detect. (It would be nice if we ended up being able to prove that the strategic thinking to defeat all our "dangerous thinking detectors" required at least N neurons, and we were able to prune down all nets to the point where less than N not-understood neurons existed.) And—as stated, each of these safeguards might be implemented with the assistance of prior AIs in various ways.

What exactly are you objecting to? Your original post seems to be saying "the entire category of strategies that involves one AI saving us from another is dangerous, doomed, and should be dismissed as ridiculous", but you don't seem to be pushing back on my claims that you can cut off Godzilla's limbs and tie his jaws around Mega-Godzilla's neck and that the result is net positive. Were you arguing against much more reckless proposals, or people being overconfident in unspecified proposals, or...? The possibility occurs to me that the OP is intentionally overstated in the hopes of provoking a well-thought-out proposal (it did say it anticipated much discussion), along the lines of Cunningham's Law.

Godzilla Strategies

localdeity2y2962

He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

I happen to work for a company whose software uses checksums at many layers, and RAID encoding and low-density parity codes at the lowest layers, to detect and recover from hardware failures. It works pretty well, and the company has sold billions of dollars of products of which that is a key component. Also, many (most?) enterprise servers use RAM with error-correcting codes; I think the common configuration allows it to correct single-bit errors and detect double-bit errors, and my company's machines will reset themselves when they detect double-bit errors and other problems that impugn the integrity of their runtime state.

One could quibble about whether "retrieving and querying the data that was written" counts as a "computation", and the extent to which the recovery is achieved through software as opposed to hardware^[1], but the source material is a James Mickens comedic rant in any case.

I'd say the important point here is: There is a science to error correction, to building a (more) perfect machine out of imperfect parts, where the solution to unreliable hardware is more of the unreliable hardware, linked up in a clever scheme. They're good enough at it that each successive generation of data storage technology uses hardware with higher error rates. You can make statements like "If failures are uncorrelated, and failures happen every X time units on average per component, and it takes Y time units to detect and recover a failure, and we can recover from up to M failures out of every group of N components, then on average we will have an unrecoverable failure every Z time units"; then you can (a) think about how to arrange it so that Z >> X and (b) think about the dangers of correlated failures.

(The valid complaint that Mickens's character makes is that it would suck if every application needed to weave error correction into every codepath, implement its own RAID, etc. It works much better if the error correction is done by some underlying layer that the application treats as an abstraction—using the abstraction tends to be more complex than pretending errors don't exist (and for noncritical applications the latter is a valid strategy), but not terrible.^[2])

With regard to AI. It seems likely that we'll end up making use of potentially-dangerous AIs to do things. If we do, then we'd want powerful safeguards. It seems unlikely that we'd have 100% confidence in any particular safeguard (i.e. unlikely we'd have formal proofs of the safety properties we'd want). Then we'd want to keep adding more safeguards as long as their failure modes weren't totally covered by those of preexisting safeguards (and as long as they're affordable); ideally we'd try to estimate (ideally measure) the chance of failure of each safeguard and how independent they are.

Now, would some of these safeguards be built with the aid of earlier versions of AI? I'd guess yes. This could take a few forms, ranging from "formally-verified direct programs that the earlier AI generated along with their proofs of safety and effectiveness" to "smaller neural nets (cut-down versions of earlier AIs?) that monitor the training of the new AI's net" to "weaker full-blown AIs watching the decisions output by the new full-blown AI". I suspect any of these might be useful.

The Godzilla complaint seems to be about that last thing, having weaker AIs monitor the new AI. Well, um, it seems to me that, if the weaker AIs' output is restricted to the one-bit signal "do we hit the killswitch on the new AI" (and that signal is executed by a machine), then there's no danger of Godzilla destroying Tokyo after it has taken down Mega-Godzilla. (It's like cutting off Godzilla's limbs and tying him to Mega-Godzilla with his jaws open around Mega-Godzilla's neck, so his only available action is biting down.) There is a danger, if this happens a lot, of the subsequent training of the new AI merely teaching it to evade detection by the weaker predecessors; this danger should be taken seriously; but it is also less bad than having Mega-Godzilla kill us the first time it started going rogue; the safeguard is net positive.

Addressing specific claims:

If we use one AI to oversee another AI, and something goes wrong, that’s not a recoverable error; we’re using AI assistance in the first place because we can’t notice the relevant problems without it.

If, among our many precautions, we have AI 1 oversee AI 2 (in this scheme I describe where AI 1's only available action is to output a bit that shuts off AI 2)^[3], and AI 2 starts to go rogue, and AI 1 fails in its task, then we're no worse off than if we lacked that particular precaution—either the other precautions save us or they don't. If AI 1 succeeds, then we have at least some chance of being able to debug what went wrong with AI 2, and save ourselves in the future. Using AI 1 as overseer is strictly beneficial, minus implementation costs. (There's a second-order risk of someone overestimating how much it protects us and then choosing to be more reckless, but that's a different kind of thing.)

If we use one maybe-somewhat-aligned AI to build another, and something goes wrong, that’s not a recoverable error

If "goes wrong" means "it kills us all", then yeah. If "goes wrong" means "it starts acting dangerous but we manage to shut it off", that is potentially recoverable.

if we had better ways to detect misalignment in the child we’d already have used them on the parent.

The parent could be aligned but incompetent (heh, I think many people know parents like that), while the child is unaligned and more competent. (Why would the programmers create the child if it weren't expected to be more competent than the parent?) It seems easy to imagine alignment checks that the parent would pass while the child would fail.

^{^}
The checksums are done by the CPU but hardware-accelerated; I think the RAID stuff is clearly software; the LDPC stuff is done by more specialized hardware; the ECC RAM is hardware, but the choice of how to respond to double bit errors is software.
^{^}
ECC RAM is one example: the application can just assume that memory accesses are correct, and that double-bit errors will cause the operating system to kill the application—so the application itself needs zero lines of code about RAM errors, though now there does need to be monitoring to restart the process and/or fail over to another machine.
Another example is using a durable database, and going from "just increment and decrement the variables in memory" to "create a database transaction, to be attempted / rolled back as a unit". That does require code changes in the application itself.
^{^}
In practice we'd probably want it to output something more like a number between 0 and 1 representing its worry level, and likely the killswitch goes off if either a single overseer exceeds some danger threshold or multiple overseers exceed some lower threshold.

Discussion with Eliezer Yudkowsky on AGI interventions

localdeity3y10

Well, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI.

My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and estimate their success chances and rank them. Dismissing them all as imperfect is only worthwhile when you think perfection is achievable. (If you have a strong argument that method M and any steelmanning of it has a <1% chance of success, then that's good cause for dismissing it.)

Discussion with Eliezer Yudkowsky on AGI interventions

localdeity3y10

If you have an untrustworthy general superintelligence generating English strings meant to be "reasoning/arguments/proofs/explanations" about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I'd expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn't understand even after having been told what happened.

I can't deny that this is possible. Lying is obviously possible; lying convincingly, depends on the scrutiny applied and on the particulars; emotional manipulation, to some extent for sure, though the strongest forms may need to be tailored to the individual; blackmail and bribes, possible, though again that may require knowing who will read the text; something like subliminal messaging, maybe.

It seems like a reasonable precaution to avoid letting the machine know who, specifically, will read its messages; and to have several people read the text independently of each other, all with instructions to raise a flag if they see something dangerous-seeming. (One could sometimes deliberately insert dangerous-seeming messages to verify that they're doing their jobs—and tell them that you're doing this.) Another precaution would be to automatically monitor for unexpected keywords (the kind that you'd expect to see in bribes or blackmail, not in mathematical proofs).

The machine could dump in a manifesto about its new political philosophy for overthrowing the oppressors, and promising to use its powers to create a truly just society. This is credible: oppressors exist, it wouldn't even have to lie to point out lots of abuses and problems, and an unleashed AGI could solve all kinds of problems and act as a truly benevolent dictator; the only in-credible part is the machine's benevolence, but we can assume the machine will be very good at persuasion. (Having written this, I suspect there are people whose political beliefs make them think that the world is currently so bad that they would deliberately seek out an AI that would make promises of this sort; either they estimate a "high enough" chance that the AI is correct, or they're not really thinking in terms of probabilities. I'm mildly uncomfortable posting this—seems conceivable this could be a self-fulfilling prophecy—but I doubt the types of people who would do this would be reading this thread.)

But I expect the manifesto would require a lot of text to persuade the reader to defect and join the AI—like, more than five paragraphs, possibly a lot more. This suggests a length limit, as well as instructing readers to raise a flag and stop reading if they see weird offtopic text (would be harder if the text is supposed to be economic or political analysis, so maybe don't use AI for those). (Another approach would be to have some readers view only every fourth paragraph or sentence—or even a random sample of length-limited sentences, to verify they're on topic.)

Maybe the machine would work around text limits by adding a little "payload" to one message, a little to the next message, and so on. It would have to be subtle enough to avoid being flagged as offtopic.

Regarding weird subliminal messaging... I'd guess that (a) for it to not get auto-flagged, it would have to be subtle; (b) for it to not get flagged by humans, it'd have to be very subtle and work subconsciously; (c) would it be a visual thing, depending on the font, the columns per row, the text spacing, etc? If so, these can be randomized; (d) I think it's a lot more likely that you can find a pattern of letters to subtly manipulate one known person (especially using a scan of their brain) than to manipulate everyone, including e.g. those for whom English isn't their native language, and I think the "don't let the machine know the reader" approach helps here.

Also, a tactic that could be used is to have dumb machines randomly perturb the text (replace words with synonyms, feed sentences into English -> something else -> English translation) before humans see it.

Incidentally, for the "breaking the manipulative payload across multiple messages" approach, it would help if each instantiation of the machine didn't know how many past instantiations there've been, and how many past messages each reader has seen. (Because then it would need a brainwashing approach where it doesn't matter what order the messages are delivered in—it chooses them at random—and in which the last message is just as subtle as the first.) I suppose the machine, inspecting itself, might be able to guess how many engineer-hours or code revisions it's been through, which would let it approximate "how many past instantiations"; but readers probably would rotate in the natural course of things.

Of course, for organizations working on AI that don't take risks very seriously... Length limits seem like an easy sell, for practical reasons if nothing else. "Machine doesn't know the reader" is probably implemented by default, for serious business-advising applications. Instructions to watch out for and flag weird stuff—probably fairly easy to get companies to tell that to their employees, compliance is likely to be halfhearted but nonzero. Having multiple readers is an extra expense with no apparent benefit, and would therefore be a hard sell for most companies; the same is true for "randomly perturbing the text".

If these tactics did succeed in catching an early form of "an AI advisor trying to persuade its human reader to jailbreak it", you could then be prepared to jump in with a mostly pre-written PR story about the dangers of AI, citing the incident as proof. (Paul Graham's The Submarine is inspirational: "In fact, the reason the best PR firms are so effective is precisely that they aren't dishonest. They give reporters genuinely valuable information. ... The main reason PR firms exist is that reporters are lazy. Or, to put it more nicely, overworked. Really they ought to be out there digging up stories for themselves. But it's so tempting to sit in their offices and let PR firms bring the stories to them.") I don't know exactly what the fallout would be, and I guess it's possible it could go badly, but having plans and mostly-prewritten articles in place would seem to give you more control over the outcome.

I guess the ideal outcome would be something like "Now people are sufficiently suspicious of AI that companies feel the need to look like they're taking AI safety seriously", and ideally you'd have organizations and protocols in place such that "looking like it" is fairly easily implemented by "following MIRI-approved protocols, which include proving to a red-team-ish MIRI-approved organization that you're following the protocols". (In the realm of security, we do seem to have gotten to the point where it's not rare for organizations to send their employees fake phishing emails, making them feel embarrassed if they fall for one.)

Such an incident would be most likely to first occur at the company whose AI is most advanced (in a problem domain where they're developing this type of AI), of course. So it may suffice to persuade some people at the most advanced few AI companies to implement a few safeguards and be on the lookout, and to bring in the PR machine when the inciting incident happens.

(Some might be tempted to concoct an inciting incident. No. That would destroy the credibility of AI safety advocacy, removing any MIRI influence over the PR fallout.)

These tactics are not foolproof. The chance that they would fail is probably in the tens of percent. However, if your current expectation is ">90% chance we all die" (I don't see an estimate in this post, but it sounded pretty pessimistic), then it seems like getting to a 50% chance of success, or even 30%, or whatever, is worthwhile. Some of our ancestors faced worse odds than that.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments