Review

Mentioned in

An Approach to Logically Updateless Decisions

2Sam Eisenstat

2Abram Demski

0Sam Eisenstat

0Abram Demski

New Comment

4 comments, sorted by Click to highlight new comments since: Today at 8:39 AM

In counterfactual mugging with a logical coin, AsDT always uses a logical inductor’s best-estimate of the utility it would get right now, so it sees the coin as already determined, and sees no benefit from giving Omega money in the cases where Omega asks for money.

The way I would think about what's going on is that if the coin is already known at the time that the expectations are evaluated, then the problem isn't convergent in the sense of AsDT. The agent that pays up whenever asked has a constant action, but it doesn't receive a constant expected utility. You can think of the averaging as introducing artificial logical uncertainty to make more things convergent, which is why it's more updateless. (My understanding is that this is pretty close to how you're thinking of it already.)

I think AsDT has a limited notion of convergent problem. It can only handle situations where the optimal strategy is to make the same move each time. Tail-dependence opens this up, partly by looking at the limit of average payoff rather than the limit of raw payoff. This allows us to deal with problems where the optimal strategy is complicated (and even somewhat dependent on what's done in earlier instances in the sequence).

I wasn't thinking of it as introducing artificial logical uncertainty, but I can see it that way.

Yeah, I like tail dependence.

There's this question of whether for logical uncertainty we should think of it more as trying to "un-update" from a more logically informed perspective rather than trying to use some logical prior that exists at the beginning of time. Maybe you've heard such ideas from Scott? I'm not sure if that's the right perspective, but it's what I'm alluding to when I say you're introducing artificial logical uncertainty.

I don't think it's much like un-updating. Un-updating takes a specific fact we'd like to pretend we don't know. Plus, the idea there is to back up the inductor. Here I'm looking at average performance *as estimated by the latest stage of the inductor*. The artificial uncertainty is more like pretending you don't know which problem in the sequence you're at.

Scott describes two obstacles for logical inductor decision theory: reasoning sanely about untaken actions, and getting a working notion of logical updatelessness. Here, I consider the question: if we could solve the first problem, could we solve the second? I provide a somewhat plausible substitute for updatelessness which relies on taking a logical counterfactual.

## Motivation

Suppose we have a notion of logical counterfactual, □→, represented by a hypothetical axiomatic system LC which is also rich enough to talk about computations. We trust LC; specifically, if it is a theorem of LC that an agent

would haveachieved a particular utilityhad itfollowed an alternative strategy, we put as much or more faith in this than our own intuition about what could have been. The 5-and-10 problem is solved by decree: if an agent reasoning with LC thinks taking the $10 would be worse, then thinks so because it's true.I take it that this does not solve problems which involve updateless-like reasoning about logical uncertainty. In counterfactual mugging with a logical coin, the agent can see that counterfactually giving Omega $100 doesn't get it anything in the real mathematical universe; it doesn't seem like the correct hypothetical would be anything like "If I give Omega $100 having seen that the 1000th digit of pi is even, then the digit might turn out to be odd rather than even, so I might get $10000.". This is the realm of updateless reasoning rather than counterfactuals; but, we have no satisfactory account of logical updatelessness.

So, we can't directly say that it would be better to give Omega $100. However, in a sequence of scenarios using consecutive digits of pi, we can see that the strategy which always gives the $100 when Omega asks for it will do better in the average case than the strategy which refuses; half the time, we get $10K, if we're the sort of person who gives up $100 the other half of the time.

## Average DT

I'll represent a decision problem U as a computation which takes no arguments and yields a rational number; The evaluation of U will be written Eval[U]. A sequence of decision problems, Un, can be defined by a function from natural numbers to computations. The average utility up to n is ¯Un=1n∑ni=1Eval[Ui]. An agent A(n) will be a function of n. We'll also consider the utility ¯UBn which might have been achieved had agent A acted like agent B, which is the value u such that A:=B□→¯Un=u. (:= is supposed to be a directional assignment from the language of LC, which says that A(n) outputs what B(n) normally would for all n. Hopefully LC can define such a concept.) I make the assumption that ¯UBn is computable.

Definition:Average DT.Take a sequence of decision problems Un, an accuracy ϵ>0, and a finite set of alternative strategies Bm(n). The average-DT agent A(n) computes the strategy which A(n−1) selected; call it Bx. It then computes ¯UBmn−1 for all Bm. If ¯UByn−1>¯UBxn−1+ϵ for any By, then A(n) selects By to determine the output of this round; A(n)=By(n). Otherwise, A(n)=Bx(n).The idea is that a good way to achieve high utility for large n is to choose outputs according to the strategy which does best on average in the limit. We can eventually do that if we keep switching to whatever strategy currently does best on average (but with ϵ "stickiness" to prevent us from oscillating strategies forever). This only works under some conditions, though.

Definition:Independent.A sequence of decision problems Un isindependentif m≠n implies Un=u→(A(m):=a□→Un=u). That is, hypothetically changing the output of the agent on m has no effect on the utility at n.Definition:Tail-dependent.Call a pair of strategies B,C convergent if there exists an n such that for all m>n, B(m)=C(m). A sequence of decision problems Un istail-dependentif, for a convergent pair B,C, if limn→∞¯UBn exists, then limn→∞¯UCn also exists, and limn→∞¯UBn=limn→∞¯UCn.Tail-dependence is a more general condition than independence. Whereas independence says that results at n only depend on actions at n, tail-dependence allows Eval[Un] to depend on A(n−1), A(n−2), and so on, as long as the effects of any initial mistakes can eventually be washed out by following an optimal strategy in the limit. Independence is only defined here because I think of it as the more typical case. The subscript n does not represent time; each instance is thought of as a decision problem of its own, and the aggregation of the utility which average-DT does is only in service of getting high utility on individual instances. If there is some dependence between instances, it is to be thought of as interaction with a separate version of oneself.

Theorem.For an average-decision-theory agent A, if Un is tail-dependent, and the limits limn→∞¯UBmn exist for all Bm in the set of strategies, then the limit limn→∞¯UAn exists, and furthermore limn→∞¯UAn≥limn→∞¯UBmn−ϵ for all Bm.Proof.In the tail-dependent case, since the limits of the ¯UBmn exist, eventually ¯UBmn will vary by less than ϵ/2 for all m. At that point, the agent will never switch strategies again, because the running average for the strategy currently in use, Bx, will never fall more than ϵ below the running average for any other. This means A is convergent with Bx. Since limn→∞¯UBxn exists, limn→∞¯UAn does as well, and the two are equal. □(If we knew that only one of the Bm is optimal, we could allow ϵ=0, since the ordering of the running-average utility of strategies can only oscillate forever when the limits are the same.)

Note that the definition of $A(n) computes ¯UBmn−1 rather than ¯UBmn. This was because, at least in the independent case, ¯Un will typically involve A(n) itself; it seems potentially unrealistic to assume that this case can be computed by A(n) (although the fact that we're only evaluating it under the hypothetical case that we act like Bm may make this feasible). It also costs us nothing, in terms of the optimality guarantee. However, it may be that even Un−1 is too intractable for A(n) to realistically evaluate. We can solve this in theory by using logical induction to approximate ¯UBmn for policy selection. No matter how slowly the ¯UBmn become computable, limn→∞¯UBmn=limn→∞En[¯UBmn], and the same optimality proof applies.

## Discussion

This is very similar in flavor to asymptotic decision theory. I'll use the abbreviations AsDT and AvDT. Like AsDT, AvDT has an ϵ accuracy parameter,has a finite set of strategies which we choose from, treats the method of taking counterfactuals as given, deals with infinite sequences of decisions, and cares only about properties in the limit. Unlike AsDT, I choose the highest average reward over all rounds so far rather than the highest expected reward on this round. AsDT is also an epsilon-exploration approach, whereas AvDT relies on □→ to give the correct expected utility alternate actions would have yielded.

These differences make AvDT behave in a more updateless manner than AsDT. In counterfactual mugging with a logical coin, AsDT always uses a logical inductor's best-estimate of the utility it would get

right now, so it sees the coin as already determined, and sees no benefit from giving Omega money in the cases where Omega asks for money. AvDT uses logical induction (or simple evaluation) to determine theaverageutility which a policy gets, so it sees giving Omega money as beneficial.I doubt that an approach to learning □→ analogous to AsDT's optimistic learning of counterfactuals will work very well here, but that is mostly because I don't think it works very well for AsDT in the first place. (It does not learn the counterfactual structure we would like it to when it plays prisoner's dilemma against a copy of itself, for example.)

A big limitation of this approach is the need to represent decision problems as part of some infinite sequence. Different infinite sequences will yield different 'optimal' approaches for the same decision problem. For example, AvDT's strategy for counterfactual mugging would be much different if Un were the subsequence of cases where the digit of pi is even. On the one hand, this is like saying that Omega only goes through the whole procedure in cases where it will make Omega money. But on the other hand, we're facing the exact same situation in the world -- it's just the representation as part of a sequence of problems which has changed. It seems the sequence Un plays a role similar to the prior in UDT, telling us what spread of possibilities to prepare for.

Perhaps there are some conditions under which we can guarantee good asymptotic behavior for not only one sequence of decision problems, but many. If none of the sequences are subsequences of each other, then it seems possible to guarantee eventual optimality for all of them. However, that's a very limiting condition, as illustrated by the counterfactual mugging example in the previous paragraph; we need to know which sequence to pick when one is a subsequence of the other.

But my uneasiness here may mainly be due to the fact that I am accustomed to an agent requiring an arbitrary prior, but unaccustomed to an agent requiring an arbitrary problem sequence. As with the problem of selecting a prior, it seems a good rule of thumb will be to pick the broader one, in absence of contrary information. I'm not holding my breath for a good notion of "universal problem sequence" here, but it would be nice if there could be such a concept.

The desirability of the approach will also depend on whether the dependence on ϵ and the restriction to finite sets of strategies can be eliminated. I'm skeptical it will be possible to get rid of either one.

Oh, and another problem: the behavior breaks down quickly outside the optimality conditions. If decision of even one early instance A(n) matters forever, the agent will ignore this in its choices. This seems problematic.

Despite all these issues, my current intuition is that this approach has "the right flavor" in some sense: an optimality condition showing that we converge to the best strategy in the limit seems about as good a property as we can hope for in a decision theory compatible with logical uncertainty.