## AI ALIGNMENT FORUMAF

Benya Fallenstein

# Comments

Waterfall Truth Predicates

We should be more careful, though, about what we mean by saying that only depends on for , though, since this cannot be a purely syntactic criterion if we allow quantification over the subscript (as I did here). I'm pretty sure that something can be worked out, but I'll leave it for the moment.

Waterfall Truth Predicates

I would suggest changing this system by defining to mean that no is the Gödel number of a proof of an inconsistency in ZFC (instead of just asserting that isn't). The purpose of this is to make it so that if ZFC were inconsistent, then we only end up talking about a finite number of levels of truth predicate. More specifically, I'd define to be PA plus the axiom schema

Then, it seems that Jacob Hilton's proof that the waterfalls are consistent goes through for this waterfall:

Work in ZFC and assume that ZFC is inconsistent. Let be the lowest Gödel number of a proof of an inconsistency. Let be the following model of our language: Start with the standard model of PA; it remains to give interpretations of the truth predicates. If , then is false for all . If , then is true iff is the Gödel number of a true formula involving only for . Then, it's clear that , and hence all (since is the strongest of the systems) is sound on , and therefore consistent.

Thus, we have proven in ZFC that if ZFC is inconsistent, then is consistent; or equivalently, that if is inconsistent, then ZFC is consistent. Stepping out of ZFC, we can see that if is inconsistent, then ZFC proves this, and therefore in this case ZFC proves its own consistency, implying that it is inconsistent. Hence, if ZFC is consistent, then so is .

(Moreover, we can formalize this reasoning in ZFC. Hence, we can prove in ZFC (i) that if ZFC is inconsistent, then is consistent, and (ii) that if ZFC is consistent, then is consistent. By the law of the excluded middle, ZFC proves that is consistent.)

No Good Logical Conditional Probability

Hm; we could add an uninterpreted predicate symbol to the language of arithmetic, and let and . Then, it seems like the only barrier to recursive enumerability of is that 's opinions about aren't computable; this seems worrying in practice, since it seems certain that we would like logical uncertainty to be able to reason about the values of computations that use more resources than we use to compute our own probability estimates. But on the other hand, all of this makes this sound like an issue of self-reference, which is its own can of worms (once we have a computable process assigning probabilities to the value of computations, we can consider the sentence saying "I'm assigned probability " etc.).

Agents that can predict their Newcomb predictor

The other direction follows from the fact that the algorithm is bounded, and PA can simply show the execution trace of in steps.

Unimportant technical point: I think the length of the PA proof grows faster than this. (More precisely, the length in symbols, rather than the length in number of statements; we're almost always interested in the former, since it determines how quickly a proof can be checked or be found by exhaustive search.) The obvious way of showing in PA is to successively show for higher and higher that "after ticks, the Turing machine computing is in the following state, and its tapes have the following content". But even to write out the contents of the tape of a Turing machine that has run for ticks will in general take symbols. So the obvious way of doing the proof takes at least symbols. Moreover, I'm not sure whether we can get from "at time , the configuration is such-and-such" to "at time , the configuration is such-and-such" in a constant number of proof steps. I suspect we don't need more than symbols overall, but I'm not extremely confident. (I'd be quite surprised if the time turned out not to be polynomial, though!)

An Informal Conjecture on Proof Length and Logical Counterfactuals

Next, we consider the case that PA is consistent and work through the agent’s decision. PA can’t prove , since we used the chicken rule, so since the sentence is easily provable, the sentence (ie. the first sentence that the agents checks for proofs of) must be unprovable.

It seems like this argument needs soundness of PA, not just consistency of PA. Do you see a way to prove in PA that if , then PA is inconsistent?

[edited to add:] However, your idea reminds me of my post on the odd counterfactuals of playing chicken, and I think the example I gave there makes your idea go through:

The scenario is that you get 10 if you take action 1 and it's not provable that you don't take action 1; you get 5 if you take action 2; and you get 0 if you take action 1 and it's provable that you don't. Clearly you should take action 1, but I prove that modal UDT actually takes action 2. To do so, I show that PA proves . (Then from the outside, follows from the outside by soundness of PA.)

This seems to make your argument go through if we can also show that PA doesn't show . But if it did, then modal UDT would take action 1 because this comes first in its proof search, contradiction.

Thus, PA proves (because this follows from ), and also PA doesn't prove . As in your argument, then, the trolljecture implies that we should think "if the agent takes action 1, it gets utility 0" is a good counterfactual, and we don't think that's true.

Still interested in whether you can make your argument go through in your case as well, especially if you can use the chicken step in a way I'm not seeing yet. Like Patrick, I'd encourage you to develop this into a post.

Reflective probabilistic logic cannot assign positive probability to its own coherence and an inner reflection principle

If you replace the inner "" by "", then the literal thing you wrote follows from the reflection principle: Suppose that the outer probability is . Then

Now, implies , which by the converse of the outer reflection principle yields , whence . Now, by the forward direction of the outer reflection principle, we have

which, by the outer reflection principle again, implies

a contradiction to the assumption that had outer probability .

However, what we'd really like is an inner reflection principle that assigns probability one to the statement *quantified over all , , and Gödel numbers . I think Paul Christiano has a proof that this is impossible for small enough , but I don't remember how the details worked.

Paraconsistent Tiling Agents (Very Early Draft)

This is very interesting!

My main question, which I didn't see an answer to on my first read, is: If the agent proves that action leads to the goal being achieved, are there any conditions under which this implies that the goal actually is achieved? The procrastination paradox shows that this isn't true in general. Is there a class of sentences (e.g., all sentences, though I don't expect that to be true in this case) such that if then ? In other words, do we have some guarantee that we do better at actually achieving than an agent which uses the system ?

A more technical point: Given that you claim that Theorem 3.2 (reflectively coherent quantified belief) follows "trivially" from Theorem 2.4, you presumably mean Theorem 2.4 to state that , rather than . (Even if you consider to implicitly contain an occurrence of , it is essential that in the free variable is replaced by the numeral corresponding to the outer value of .)

Forum Digest: Corrigibility, utility indifference, & related control ideas

Categorization is hard! :-) I wanted to break it up because long lists are annoying to read, but there was certainly some arbitrariness in dividing it up. I've moved "resource gathering agent" to the odds & ends.

Identity and quining in UDT

Want to echo Nate's points!

One particular thing that I wanted to emphasize is that I think you can see as a thread on this forum (in particular, the modal UDT work is relevant) is that it's useful to make formal toy models where the math is fully specified, so that you can prove theorems about what exactly an agent would do (or, sometimes, write a program that figures it out for you). When you write out things that explicitly, then, for example, it becomes clearer that you need to assume that a decision problem is "fair" (extensional) to get certain results, as Nate points out (or if you don't assume it, someone else can look at your result and point out that it's not true as stated). In your post, you're using "logical expectations" that condition on something being true, without defining exactly what all of this means, and as a result you can argue about what these agents will do, but not actually prove it; that's certainly a reasonable part of the research process, but I'd like to encourage you to turn your work into models that are fully specified, so that you can actually prove theorems about them.

Meta- the goals of this forum

Some additions about how the initial system is going to work:

• Non-member contributions (comments and links) are going to become publicly visible when they have received 2 likes (from members---only members can like things).

• However, members will be able to reply to a contribution as soon as it has received 1 like. This means that if you think someone's made a useful contribution, and you want to reply to them about it, you don't have to wait for a second person to Like it before you can write that reply. (It won't be publicly visible until the contribution has two Likes, though.)

Load More