# All of Benya_Fallenstein's Comments + Replies

We should be more careful, though, about what we mean by saying that only depends on for , though, since this cannot be a purely syntactic criterion if we allow quantification over the subscript (as I did here). I'm pretty sure that something can be worked out, but I'll leave it for the moment.

I would suggest changing this system by defining to mean that no is the Gödel number of a proof of an inconsistency in ZFC (instead of just asserting that isn't). The purpose of this is to make it so that if ZFC were inconsistent, then we only end up talking about a finite number of levels of truth predicate. More specifically, I'd define to be PA plus the axiom schema

Then, it seems that Jacob Hilton's proof that the waterfalls are consistent goes through for this waterfall:

Work in ZFC and assume that ZFC is i

...
0Benya Fallenstein9y
We should be more careful, though, about what we mean by saying that φ(x) only depends on Trm for m>n, though, since this cannot be a purely syntactic criterion if we allow quantification over the subscript (as I did here). I'm pretty sure that something can be worked out, but I'll leave it for the moment.

Hm; we could add an uninterpreted predicate symbol to the language of arithmetic, and let and . Then, it seems like the only barrier to recursive enumerability of is that 's opinions about aren't computable; this seems worrying in practice, since it seems certain that we would like logical uncertainty to be able to reason about the values of computations that use more resources than we use to compute our own probability estimates. But on the other hand, all of this makes this sound like an issue of self-reference,

...

The other direction follows from the fact that the algorithm is bounded, and PA can simply show the execution trace of in steps.

Unimportant technical point: I think the length of the PA proof grows faster than this. (More precisely, the length in symbols, rather than the length in number of statements; we're almost always interested in the former, since it determines how quickly a proof can be checked or be found by exhaustive search.) The obvious way of showing in PA is to successively show for higher and higher that "after ticks, the T

...

Next, we consider the case that PA is consistent and work through the agent’s decision. PA can’t prove , since we used the chicken rule, so since the sentence is easily provable, the sentence (ie. the first sentence that the agents checks for proofs of) must be unprovable.

It seems like this argument needs soundness of PA, not just consistency of PA. Do you see a way to prove in PA that if , then PA is inconsistent?

[edited to add:] However, your idea reminds me of my post on the odd counterfactuals of playing chicken

...
2Sam Eisenstat9y
The argument that I had in mind was that if PA⊢A()≠1, then PA⊢□┌A()≠1┐, so PA⊢A()=1 since PA knows how the chicken rule works. This gives us PA⊢⊥, so PA can prove that if PA⊢A()≠1, then PA is inconsistent. I'll include this argument in my post, since you're right that this was too big a jump. Edit: We also need to use this argument to show that the modal UDT agent gets to the part where it iterates over utilities, rather than taking an action at the chicken rule step. I didn't mention this explicitly, since I felt like I had seen it before often enough, but now I realize it is nontrivial enough to point out.

If you replace the inner "" by "", then the literal thing you wrote follows from the reflection principle: Suppose that the outer probability is . Then

Now, implies , which by the converse of the outer reflection principle yields , whence . Now, by the forward direction of the outer reflection principle, we have

which, by the outer reflection principle again, impl

...
1Paul Christiano9y
Here is the basic problem. I think that you can construct an appropriate liar's sentence by using a Lipschitz function without an approximate fixed point. But someone might want to check that more carefully and write it up, to make sure and to see where the possible loopholes are. I think that it may not have ruled out this particular principle, just something slightly stronger (but the two were equivalent for the kinds of proof techniqeus we were considering).

This is very interesting!

My main question, which I didn't see an answer to on my first read, is: If the agent proves that action leads to the goal being achieved, are there any conditions under which this implies that the goal actually is achieved? The procrastination paradox shows that this isn't true in general. Is there a class of sentences (e.g., all sentences, though I don't expect that to be true in this case) such that if then ? In other words, do we have some guarantee that we do better at actually achieving than an agent which us

...

Categorization is hard! :-) I wanted to break it up because long lists are annoying to read, but there was certainly some arbitrariness in dividing it up. I've moved "resource gathering agent" to the odds & ends.

Want to echo Nate's points!

One particular thing that I wanted to emphasize is that I think you can see as a thread on this forum (in particular, the modal UDT work is relevant) is that it's useful to make formal toy models where the math is fully specified, so that you can prove theorems about what exactly an agent would do (or, sometimes, write a program that figures it out for you). When you write out things that explicitly, then, for example, it becomes clearer that you need to assume that a decision problem is "fair" (extensional) to get certain result

...
0Vanessa Kosoy9y
Hi Benja, thx for commenting! I agree that it's best to work on fully specified models. Hopefully, soon I will write about my own approach to logical uncertainty via complexity theory.

Some additions about how the initial system is going to work:

• Non-member contributions (comments and links) are going to become publicly visible when they have received 2 likes (from members---only members can like things).

• However, members will be able to reply to a contribution as soon as it has received 1 like. This means that if you think someone's made a useful contribution, and you want to reply to them about it, you don't have to wait for a second person to Like it before you can write that reply. (It won't be publicly visible until the contribution has two Likes, though.)

Yeah, that sounds good! Of course, by the Kripke levels argument, it's sufficient to consider 's of the form . And we might want to have a separate notion of " leads to at level ", which we can actually implement in a finite modal formula. This seems to suggest a version of modal UDT that tries to prove things in PA, then if that has ambiguous counterfactuals (i.e., it can't prove for any ) we try PA+1 and so on up to some finite ; then we can hope these versions of UDT approximate optimality according to your revised version of " leads t

...

Fixed---thanks, Patrick!

Regarding the example, I earlier defined "action leads to outcome " to mean the conjunction of and ; i.e., we check for spurious counterfactuals before believing that tells us something about what action leads to, and we only consider ourselves "fully informed" in this sense if we have non-spurious information for each . (Of course, my follow-up post is about how that's still unsatisfactory; the reason to define this notion of "fully informative" so explicitly was really to be able to say more clearly i

...
1orthonormal9y
Ah! I'd failed to propagate that somehow. Given that we're using PA+n to defeat evil problems, the true modal definition of "action i leads to outcome j" might be something like "there exists a closed formula ϕ such that N⊨ϕ, GL ⊢ϕ→(Ai→Uj), and GL ⊬ϕ→¬Ai". But that's an unnecessary complication for this post.

(Thanks for the feedback!)

In our chat about this, Eliezer said that, aside from the difficulty of making a human-level mathematical philosophy engine aligned with our goals, an additional significant relevant disagreement with Paul is that Eliezer thinks it's likely that we'll use low-level self-improvement on the road to human-level AI; he used the analogies of programmers using compilers instead of writing machine code directly, and of EURISKO helping Lenat. (Again, hoping I'm not misrepresenting.)

This seems like a plausible scenario to me, but I'm not convinced it argues for the

...
0Paul Christiano9y
I agree that many parts of AI research are already automated, and by the time we have broadly human-level AI, many more will be. I would not be surprised if the great majority of tasks in current AI research are automated long before we have broadly human-level AI. I wouldn't normally describe this as "low-level self-improvement," but that seems like a semantic issue. I am skeptical of the Eurisko example (and of claims about Eurisko in general) but I don't know if it's relevant to this disagreement.

...though although I'm guessing the variant of the reflective oracle discussed in the comment thread may be approximable, it seems less likely that a version of AIXI can be defined based on it that would be approximable.

Sorry for the delayed reply! I like this post, and agree with much of what you're saying. I guess I disagree with the particular line you draw, though; I think there's an interesting line, but it's at "is there a Turing machine which will halt and give the answer to this problem" rather than "is there a Turing machine which will spit out the correct answer for a large enough input (but will spit out wrong answers for smaller inputs, and you don't know what number is large enough)". The latter of these doesn't seem that qualitatively different to me from "i

...
0Benya Fallenstein9y
...though although I'm guessing the variant of the reflective oracle discussed in the comment thread may be approximable, it seems less likely that a version of AIXI can be defined based on it that would be approximable.

I think the main disagreement is about whether it's possible to get an initial system which is powerful in the ways needed for your proposal and which is knowably aligned with our goals; some more about this in my reply to your post, which I've finally posted, though there I mostly discuss my own position rather than Eliezer's.

Actually, drnickbone's original LessWrong post introducing evil problems also gives an extension to the case you are considering: The evil decision problem gives the agent three or more options, and rewards the one that the "victim" decision theory assigns the least probability to (breaking ties lexicographically). Then, no decision theory can put probability on the action that is rewarded in its probabilistic evil problem.

0Paul Christiano9y
This setup plays a computational trick, and as a result I don't think it violates the optimality standard I proposed. In order to decide what it should do, the CDT agent needs to think strictly longer than the UDT agent. But if the CDT agent thinks longer than the UDT agent, it's totally unsurprising that it does better! (Basically, the problem just consists of a computational question which is chosen to be slightly too complex for the UDT agent. But the CDT agent is allowed to think as long as it likes. This entire family of problems appears to be predicated on the lack of computational limits for our agents.) As a result, if the UDT agent is told what the CDT agent decides, then it can get the same performance as the CDT agent. This seems to illustrate that the CDT agent isn't doing better by being wiser, just by knowing something the UDT agent doesn't. (I wasn't actually thinking about this case when I introduced the weakened criterion; the weakening is obviously necessary for UDT with 10 years of time to compete with CDT with 11 years of time, and I included it for that reason.) Does this seem right? If so, is there a way to set up the problem that violates my weakened standard? Incidentally, this problem involves a discontinuous dependence on UDT's decision(both by the competitor and by the environment). I wonder if this discontinuous dependence is necessary?

True. This looks to me like an effect of EDT not being stable under self-modification, although here the issue is handicapping itself through external means rather than self-modification---like, if you offer a CDT agent a potion that will make it unable to lift more than one box before it enters Newcomb's problem (i.e., before Omega makes its observation of the agent), then it'll cheerfully take it and pay you for the privilege.

Thread for proofs of results claimed above (not limited to stuff not found in Lindström); contributions appreciated. Stuff not in Lindström includes the uniqueness of arithmetic fixed points in PA (found in the modal agents paper), and I think the version of the fixed point theorem with more than one (the latter should be provable by iterated application of the fixed point theorem for a single ).

2Jaime Sevilla Molina8y
Generalized fixed point theorem: Suppose that Ai(p1,...,pn) are n modal sentences such that Ai is modalized in pn (possibly containing sentence letters other than pjs). Then there exists H1,...,Hn in which no pj appears such that GL⊢∧i≤n{⊡(pi↔Ai(p1,...,pn)}↔∧i≤n{⊡(pi↔Hi)}. ---------------------------------------- We will prove it by induction. For the base step, we know by the fixed point theorem that there is H such that GL⊢⊡(p1↔Ai(p1,...,pn))↔⊡(p1↔H(p2,...,pn)) Now suppose that for j we have H1,...,Hj such that GL⊢∧i≤j{⊡(pi↔Ai(p1,...,pn)}↔∧i≤j{⊡(pi↔Hi(pj+1,...,pn))}. By the second substitution theorem, GL⊢⊡(A↔B)→[F(A)↔F(B)]. Therefore we have that GL⊢⊡(pi↔Hi(pj+1,...,pn)→[⊡(pj+1↔Aj+1(p1,...,pn))↔⊡(pj+1↔Aj+1(p1,...,pi−1,Hi(pj+1,...,pn),pi+1,...,pn))]. If we iterate the replacements, we finally end up with GL⊢∧i≤j{⊡(pi↔Ai(p1,...,pn)}→⊡(pj+1↔Aj+1(H1,...,Hj,pj+1,...,pn)). Again by the fixed point theorem, there is H′j+1 such that GL⊢⊡(pj+1↔Aj+1(H1,...,Hj,pj+1,...,pn))↔⊡[pj+1↔H′j+1(pj+2,...,pn)]. But as before, by the second substitution theorem, GL⊢⊡[pj+1↔H′j+1(pj+2,...,pn)]→[⊡(pi↔Hi(pj+1,...,pn))↔⊡(pi↔Hi(H′j+1,...,pn)). Let H′i stand for Hi(H′j+1,...,pn), and by combining the previous lines we find that GL⊢∧i≤j+1{⊡(pi↔Ai(p1,...,pn)}→∧i≤j+1{⊡(pi↔H′i(pj+2,...,pn))}. By Goldfarb's lemma, we do not need to check the other direction, so GL⊢∧i≤j+1{⊡(pi↔Ai(p1,...,pn)}↔∧i≤j+1{⊡(pi↔H′i(pj+2,...,pn))} and the proof is finished □ ---------------------------------------- An immediate consequence of the theorem is that for those fixed points Hi and every Ai, GL⊢Hi↔Ai(H1,...,Hn). Indeed, since GL is closed under substitution, we can make the change pi for Hi in the theorem to get that GL⊢∧i≤n{⊡(Hi↔Ai(H1,...,Hn)}↔∧i≤n{⊡(Hi↔Hi)}. Since the righthand side is trivially a theorem of GL, we get the desired result. ---------------------------------------- One remark: the proof is wholly constructive. You can iterate the construction of fixed point following the procedur
2Jaime Sevilla Molina8y
Uniqueness of arithmetic fixed points: Notation: ⊡A=□A∧A Let H be a fixed point on p of ϕ(p); that is, GL⊢⊡(p↔ϕ(p))↔(p↔H). Suppose I is such that GL⊢H↔I. Then by the first substitution theorem, GL⊢F(I)↔F(H) for every formula F(q). If F(q)=⊡(p↔q), then GL⊢⊡(p↔H)↔⊡(p↔I), from which it follows that GL⊢⊡(p↔ϕ(p))↔(p↔I). Conversely, if H and I are fixed points, then GL⊢⊡(p↔H)↔⊡(p↔I), so since GL is closed under substitution, GL⊢⊡(H↔H)↔⊡(H↔I). Since GL⊢⊡(H↔H), it follows that GL⊢(H↔I). (Taken from The Logic of Provability, by G. Boolos.)

Thanks for expanding on your construction! I hadn't thought of the recursive construction, that's really neat.

I'm not that worried about the application to AIXI: unless I'm missing something, we can just additionally give our machines access to a double halting oracle (for ordinary Turing machines), and recover the same power. I considered doing that and didn't go with it because the stuff in my post seemed slightly more elegant, but if there's a reason to prefer the other version, it seems fine to use it.

I'm not clear on what you mean by "

...

Thanks! I didn't really think at all about whether or not "money-pump" was the appropriate word (I'm not sure what the exact definition is); have now changed "way to money-pump EDT agents" into "way to get EDT agents to pay you for managing the news for them".

0Daniel Dewey9y
Hm, I don't know what the definition is either. In my head, it means "can get an arbitrary amount of money from", e.g. by taking it around a preference loop as many times as you like. In any case, glad the feedback was helpful.

About determining whether something's a query: A formulation of the statement we want to test is, (A) "for every (infinite) input on the oracle tape, and every , there is a such that with probability , the program halts in steps."

I think this is equivalent to, (B) "For every , there is a such that, for any input on the oracle tape, the program halts within timesteps with probability ."

Reason why I think this is equivalent: Clearly (B) implies (A). Suppose that (B) is false; then

...
0Benya Fallenstein9y
In the other direction, suppose that φ(m,n) is a primitive recursive predicate, and consider the following probabilistic Turing machine: First, randomly choose an m∈N (placing positive probability on every natural number). Then, search exhaustively for an n∈N such that φ(m,n) is true. If you find one, halt and output 1. This machine is a query if and only if ∃m.∀n.φ(m,n). Together with the parent comment, this shows that having an oracle that tells you whether or not something is a query is equivalent to having an oracle for Π2 statements (i.e., a double halting oracle).

In the other direction, suppose that is a primitive recursive predicate, and consider the following probabilistic Turing machine: First, randomly choose an (placing positive probability on every natural number). Then, search exhaustively for an such that is true. If you find one, halt and output . This machine is a query if and only if . Together with the parent comment, this shows that having an oracle that tells you whether or not something is a query is equivalent to having an oracle for statements (i.e., a double h

...

Aagh... I'm conflicted. I really like the simplicity of having players correspond 1:1 to calls to the oracle, and their mixed strategies to the probability that the oracle returns "true", but the implication that the oracle must return the same answer if it's called twice on the same arguments during the execution of a single program interacts very badly with the intuitive picture of the applications I have in mind.

The intuitive picture is that the agents under consideration are living in a world whose laws of physics are probabilistic and allow for the co

...
0Jessica Taylor9y
Hmm...this seems like less of a problem to me. The thing we need for the Nash equilibrium equivalence is that, within an execution of a single program, equal calls to the oracle return equal results (or equivalently, you can't give the same call to the oracle twice). But we don't need to give the same call to the oracle twice in order to model the matching pennies game, because you just need one call (to determine whether your opponent plays heads with greater than 50% probability), and the two players are represented as different programs. The thing we're basing the Nash equilibrium equivalence on is the question "given fixed return values of calls, what is the probability that this program returns 1?" in order to write in utility values for our normal form game. Since we're basing everything on this process, it follows that within an execution of a program, it only make sense to make a given call once. But nothing about this rules out having different programs reason about the same call, and having the oracle give them different answers.

Fixed the typo, thanks!

I considered describing the probabilities of the oracle returning "true" by a different function , but it seemed too pedantic to have a different letter. Maybe that's wrong, but it still feels too pedantic. If I do things that way I probably shouldn't be writing " returns 'true' if...", though...

Thanks! I've been also thinking about making a finite version; my main motivation was that I've been hoping to take my still-incompletely-specified AIXI variant and find an analog variant of AIXItl, the computable version of AIXI that considers only hypotheses which terminate in time and whose source length is .

I think that would in fact be a pretty reasonable finite set of programs to use in what you're suggesting, since "we want it to terminate in time " provides a motivation for why we would require programs to terminate surely, as opposed to al

...
1Jessica Taylor9y
I thought about it some more and realized that this is equivalent to finding a Nash equilibrium. What we do is, we create a player for every call (M,p). This player has 2 actions (call them action 0 and action 1). The player's utilities are set up so that they always get p utility for choosing action 0. Also, given fixed actions for the other players, their utility for action 1 is set to be the probability that M[O′] returns 1 given that O′ is an oracle that returns the action of a different player corresponding to the call it receives. Now, the player's expected utility is p for action 0, and for action 1 it will be M[O′′]'s probability of returning 1 where O′′ is an oracle that returns answers to calls under the same distribution as the other players' mixed strategies. So, it will be allowed to choose action 0 iff P(M,O)<=p, action 1 iff P(M,O)>=p, and will be allowed to mix iff P(M,O)=p, as desired. It should be noted that within a program, repeated calls to the oracle must return the same value, so they should be cached. Going the other direction is easy: to find the Nash equilibrium for a game where each player has 2 actions, set up a program for each player that queries the oracle for the actions of each other player and then returns the action that maximizes expected utility. For games with more than 2 actions (say, each player has actions 1...k), we can create a different set of programs for each player: program i where i∈{1..k−1} returns 1 iff some maximum expected utility action's index is less that or equal to i. It seems to follow that the full problem is equivalent to a game with countably infinite players. I was thinking of how you would solve the full problem using some kind of oracle machine. It would at least be necessary to determine whether a program halts for every infinite input you give it, because this is necessary to detect queries. This interacts weirdly with the requirement that queries almost surely halt; it may require something analogo

Hm, I think it would be expectedly helpful to explain things more, but it would also take more time, and aiming much lower would take much more time. (That's part of why it's taking me so long to turn these things into research papers, although in that case the length limitations make the constraint even more difficult.) At the moment, I'm trying to get the things I'm working on out there at all, especially with something like this where I haven't yet worked through all the details of whether I can actually do my intended application based on it. After all

...

A model of UDT with a halting oracle searches only for one utility value for each action. I'm guessing the other formulation just wasn't obvious at the time? (I don't remember realizing the possibility of playing chicken implicitly before Will Sawin advertised it to me, though I think he attributed it to you.)

Thanks! :-)

The reflective assignments correspond very directly to Nash equilibria (albeit in an -to- manner, because given any finite game, a reflective assignment contains information about that game but also information about many other things). So I wouldn't quite say that they correspond to methods of equilibrium selection---e.g., if two methods of equilibrium selection give you the same Nash equilibrium, they're not distinguishable on the level of reflective assignments. But yeah, the question of what equilibrium gets played gets outsourced to the o

...
Now I'm wondering how this compares to Paul's probabilistic reflection. Are there some nice "consistency" axioms that are satisfied by one approach but not the other, or vice versa?

However, the approach is designed so that these "spurious" logical implications are unprovable, so they don’t interfere with decision-making. The proof of that is left as an easy exercise.

I don't think this is technically true as stated; it seems to be possible that the agent proves some spurious counterfactuals as long as the outcome it does in fact obtain is the best possible one. (This is of course harmless!) Say the agent has two possible actions, and , leading to outcomes and , respectively. The latter is preferred, and thes

...
0orthonormal9y
Ooh, nice: we don't need to eliminate all spurious counterfactuals, only the malignant ones!