JBlack - AI Alignment Forum

The assumption here is that we can implement a system that deterministically computes mathematical functions with no side effects. We can get much closer for this than we can for leaking information outward, but we still fail to do this perfectly. There are real-world exploits that have broken this abstraction to cause undesired behaviour, and which could be used to gather information about some properties of the real world.

For example, cosmic rays cause bit errors and we can already write software that observes them with high probability of not crashing. We can harden our hardware such as by adding error correction, but this reduces the error rate without eliminating it. There are also CPU bugs, heat-related errors, and DRAM charge levels that have been exploited in real applications, and no doubt many other potential vectors that we haven't discovered yet.

It would certainly be extremely difficult to discern much about the real world purely via such channels, but it only takes one exploitable bug that opens a completely unintended inward channel for the game to be over.

Are limited-horizon agents a good heuristic for the off-switch problem?

JBlack3y00

It seems to me that there are some serious practical problems in trying to train this sort of behaviour. After all, a successful execution shuts the system off and it never updates on the training signal. You could train it for something like "when the date from the clock input exceeds the date on input SDD, output high on output SDN (which in the live system will feed to a shutdown switch)", but that's a distant proxy. It seems unlikely to generalize correctly to what you really want, which is much fuzzier.

For example, what you really want is more along the lines of determining the actual date (by unaltered human standards) and comparing with the actual human-desired shutdown date (without manipulating what the humans want), and actually shut down (by means that don't harm any humans or anything else they value). Except that this messy statement isn't nearly tight enough, and a superintelligent system would eat the world in a billion possible ways even assuming that the training was done in a way that the system actually tried to meet this objective.

How are we going to train a system to generalize to this sort of objective without it already being Friendly AGI?

Agency and the unreliable autonomous car

JBlack3y10

Yes, I'm referring to 5a/5b in conjunction with self-reflection as a deduction rule, since the agent is described as being able to use these to derive new propositions.

There is also a serious problem with 5a itself: the agent needs to try to prove that some new proposition P is "consistent with its model of the world". That is, for its existing axioms and derivation rules T, prove that T+P does not derive a contradiction.

If T is consistent, then this is impossible by Gödel's second incompleteness theorem. Hence for any P, Step 5a will always exhaust all possible proofs of up to whatever bound and return without finding such a proof. Otherwise T is inconsistent and it may be able to find such a proof, but it is also obvious that its proofs can't be trusted and not at all surprising that it will take the wrong route.

Agency and the unreliable autonomous car

JBlack3y10

This looks full of straightforward errors.

Firstly, the agent's logical deduction system is unsound. It includes something comparable with Peano arithmetic (or else Löb's theorem can't be applied), and then adds a deduction rule "if P can be proved consistent with the system-so-far then assume P is true". But we already know that for any consistent extension T of Peano arithmetic there is at least one proposition G for which both G and ~G are consistent with T. So both of these are deducible using the rule. Now, the agent might not find the contradiction, because...

This system can only find proofs up to a length of a million characters. The requirements of Löb's theorem include a formal system that is an extension of PA and is closed under deduction: that is, derives proofs of unbounded length. So we can't apply Löb's theorem to this system anyway.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments