James Payor

Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...

My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

To look for a true hacker I'd try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.

It looks like you're investigating an angle that I can't follow, but here's my two cents re bounded agents:

My main idea to port this to the bounded setting is to have a bot that searches for increasingly long proofs, knowing that if it takes longer to find a proof then it is itself a bit harder to reason about.

We can instantiate this like:

The idea is that if there is a short way to prove that the opponent would cooperate back, then it takes just a constant steps more to prove that we cooperate. So it doesn't open us up to exploitation to assume that our own cooperation is provable in steps.

The way in which this works at all is by cutting the loop at the point where the opponent is thinking about our own behaviour. This bot cuts it rather aggressively: it assumes that no matter the context, when thinks about whether cooperates, it's provable that does cooperate. (I think this isn't great and can be improved to a weaker assumption that would lead to more potential cooperation.)

If you construct similarly, I claim that and mutually cooperate if and are large enough, and mutually defect otherwise.

Similarly, I claim can mutually cooperate with other bots like .

A point of confusion: is it enough to prove that ? What about ? I'm not sure I can say this well, but here goes:

We might not be able to prove in the theory, or even that (which would mean "there are no proofs of inconsistency"). But if we believe our theory is sound, we believe that it can't show that a copy of itself proves something false.

So tells us that if is true, the theory would show that a copy of itself proves is true. And this is enough to convince *us* that we can't simultaneously have true and false.

Also, here's a proof that a bot is never exploited. It only cooperates when its partner provably cooperates.

First, note that , i.e. if cooperates it provably cooperates. (Proof sketch: .)

Now we show that (i.e. if chooses to cooperate, its partner is provably cooperating):

- We get by distributing.
- We get by applying internal necessitation to .
- By (1) and (2), .

(PS: we can strengthen this to , by noticing that .)

If I follow what you mean, we can derive:

So there's a Löbian proof, in which the provability is self-fulfilling. But this isn't sufficient to avoid this kind of proof.

(Aside on why I don't like the Löbian method: I moreso want the agents to be doing "correct" counterfactual reasoning about how their actions affect their opponent, and to cooperate because they see that mutual cooperation is possible and then choose it. The Löbian proof style isn't a good model of that, imo.)

(In fact, I think "assume they will think I cooperate" turns out to be too strong, and leads to unnecessary defection. I'm still working through the details.)

Thanks! I also wrote up my proof in another comment, which should help triangulate intuition.

In case this helps folks' intuition, my generating idea was something like: "look at what my opponent is thinking, but assume that whenever they check if I cooperate, they think the answer is yes". This is how we break the loop.

This results in a proof of the lemma like thus:

- (given)
- (unrolling one step)
- is straightforward intuitively, and we can get there by applying box-distributivity to

(EDIT: This is basically the same proof as in the post, but less simple. Maybe the interesting part is the "unroll once then break the loop" intuition.)

Relatedly, with two agents and , we can have:

- (given)
- (given)
- (substituting into )
- is straightforward as before

Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.

(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)

As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5:

In english: