Diffractor

The Many Faces of Infra-Beliefs

Looks good.

Re: the dispute over normal bayesianism: For me, "environment" denotes "thingy that can freely interact with any policy in order to produce a probability distribution over histories". This is a different type signature than a probability distribution over histories, which doesn't have a degree of freedom corresponding to which policy you pick.

But for infra-bayes, we can associate a classical environment with the *set* of probability distributions over histories (for various possible choices of policy), and then the two distinct notions become the same sort of thing (set of probability distributions over histories, some of which can be made to be inconsistent by how you act), so you can compare them.

The Many Faces of Infra-Beliefs

I'd say this is mostly accurate, but I'd amend number 3. There's still a sort of non-causal influence going on in pseudocausal problems, you can easily formalize counterfactual mugging and XOR blackmail as pseudocausal problems (you need acausal specifically for *transparent newcomb*, not vanilla newcomb). But it's specifically a sort of influence that's like "reality will adjust itself so contradictions don't happen, and there may be correlations between what happened in the past, or other branches, and what your action is now, so you can exploit this by acting to make bad outcomes inconsistent". It's purely action-based, in a way that manages to capture some but not all weird decision-theoretic scenarios.

In normal bayesianism, you do *not* have a pseudocausal-causal equivalence. Every ordinary environment is straight-up causal.

Stuart_Armstrong's Shortform

Sounds like a special case of crisp infradistributions (ie, all partial probability distributions have a unique associated crisp infradistribution)

Given some , we can consider the (nonempty) set of probability distributions equal to where is defined. This set is convex (clearly, a mixture of two probability distributions which agree with about the probability of an event will also agree with about the probability of an event).

Convex (compact) sets of probability distributions = crisp infradistributions.

Introduction To The Infra-Bayesianism Sequence

You're completely right that hypotheses with unconstrained Murphy get ignored because you're doomed no matter what you do, so you might as well optimize for just the other hypotheses where what you do matters. Your "-1,000,000 vs -999,999 is the same sort of problem as 0 vs 1" reasoning is good.

Again, you are making the serious mistake of trying to think about Murphy verbally, rather than thinking of Murphy as the personification of the "inf" part of the definition of expected value, and writing actual equations. is the available set of possibilities for a hypothesis. If you really want to, you can think of this as constraints on Murphy, and Murphy picking from available options, but it's highly encouraged to just work with the math.

For mixing hypotheses (several different sets of possibilities) according to a prior distribution , you can write it as an expectation functional via (mix the expectation functionals of the component hypotheses according to your prior on hypotheses), or as a set via (the available possibilities for the mix of hypotheses are all of the form "pick a possibility from each hypothesis, mix them together according to your prior on hypotheses")

This is what I meant by "a constraint on Murphy is picked according to this probability distribution/prior, then Murphy chooses from the available options of the hypothesis they picked", that set (your mixture of hypotheses according to a prior) corresponds to selecting one of the sets according to your prior , and then Murphy picking freely from the set .

Using (and considering our choice of what to do affecting the choice of , we're trying to pick the best function ) we can see that if the prior is composed of a bunch of "do this sequence of actions or bad things happen" hypotheses, the details of what you do sensitively depend on the probability distribution over hypotheses. Just like with AIXI, really.

Informal proof: if and (assuming ), then we can see that

and so, the best sequence of actions to do would be the one associated with the "you're doomed if you don't do blahblah action sequence" hypothesis with the highest prior. Much like AIXI does.

Using the same sort of thing, we can also see that if there's a maximally adversarial hypothesis in there somewhere that's just like "you get 0 reward, screw you" no matter what you do (let's say this is psi_0), then we have

And so, that hypothesis drops out of the process of calculating the expected value, for all possible functions/actions. Just do a scale-and-shift, and you might as well be dealing with the prior , which a-priori assumes you aren't in the "screw you, you lose" environment.

Hm, what about if you've just got two hypotheses, one where you're like "my knightian uncertainty scales with the amount of energy in the universe so if there's lots of energy available, things could e really bad, while if there's little energy available, Murphy can't make things bad" () and one where reality behaves pretty much as you'd expect it to(? And your two possible options would be "burn energy freely so Murphy can't use it" (the choice , attaining a worst-case expected utility of in and in ), and "just try to make things good and don't worry about the environment being adversarial" (the choice , attaining 0 utility in , 1 utility in ).

The expected utility of (burn energy) would be

And the expected utility of (act normally) would be

So "act normally" wins if , which can be rearranged as . Ie, you'll act normally if the probability of "things are normal" times the loss from burning energy when things are normal exceeds the probability of "Murphy's malice scales with amount of available energy" times the gain from burning energy in that universe.

So, assuming you assign a high enough probability to "things are normal" in your prior, you'll just act normally. Or, making the simplifying assumption that "burn energy" has similar expected utilities in both cases (ie, ), then it would come down to questions like "is the utility of burning energy closer to the worst-case where Murphy has free reign, or the best-case where I can freely optimize?"

And this is assuming there's just two options, the actual strategy selected would probably be something like "act normally, if it looks like things are going to shit, start burning energy so it can't be used to optimize against me"

Note that, in particular, the hypothesis where the level of attainable badness scales with available energy is very different from the "screw you, you lose" hypothesis, since there are actions you can take that do better and worse in the "level of attainable badness scales with energy in the universe" hypothesis, while the "screw you, you lose" hypothesis just makes you lose. And both of these are very different from a "you lose if you don't take this exact sequence of actions" hypothesis. *Murphy is not a physical being, it's a personification of an equation, thinking verbally about an actual Murphy doesn't help because you start confusing very different hypotheses, think purely about what the actual set of probability distributions ** corresponding to hypothesis ** looks like*. I can't stress this enough.

Also, remember, the goal is to maximize worst-case * expected* value, not worst-case value.

Introduction To The Infra-Bayesianism Sequence

There's actually an upcoming post going into more detail on what the deal is with pseudocausal and acausal belief functions, among several other things, I can send you a draft if you want. "Belief Functions and Decision Theory" is a post that hasn't held up nearly as well to time as "Basic Inframeasure Theory".

Introduction To The Infra-Bayesianism Sequence

If you use the Anti-Nirvana trick, your agent just goes "nothing matters at all, the foe will mispredict and I'll get -infinity reward" and rolls over and cries since all policies are optimal. Don't do that one, it's a bad idea.

For the concave expectation functionals: Well, there's another constraint or two, like monotonicity, but yeah, LF duality basically says that you can turn any (monotone) concave expectation functional into an inframeasure. Ie, all risk aversion can be interpreted as having radical uncertainty over some aspects of how the environment works and assuming you get worst-case outcomes from the parts you can't predict.

For your concrete example, that's why you have multiple hypotheses that are learnable. Sure, one of your hypotheses might have complete knightian uncertainty over the odd bits, but another hypothesis might not. Betting on the odd bits is advised by a more-informative hypothesis, for sufficiently good bets. And the policy selected by the agent would probably be something like "bet on the odd bits occasionally, and if I keep losing those bets, stop betting", as this wins in the hypothesis where some of the odd bits are predictable, and doesn't lose too much in the hypothesis where the odd bits are completely unpredictable and out to make you lose.

Introduction To The Infra-Bayesianism Sequence

Maximin, actually. You're maximizing your worst-case result.

It's probably worth mentioning that "Murphy" isn't an actual foe where it makes sense to talk about destroying resources lest Murphy use them, it's just a personification of the fact that we have a set of options, any of which could be picked, and we want to get the highest lower bound on utility we can for that set of options, so we assume we're playing against an adversary with perfectly opposite utility function for intuition. For that last paragraph, translating it back out from the "Murphy" talk, it's "wouldn't it be good to use resources in order to guard against worst-case outcomes within the available set of possibilities?" and this is just ordinary risk aversion.

For that equation , B can be *any* old set of probabilistic environments you want. You're not spending any resources or effort, a hypothesis just **is** a set of constraints/possibilities for what reality will do, a guess of the form "Murphy's operating under these constraints/must pick an option from this set."

You're completely right that for constraints like "environment must be a valid chess board", that's too loose of a constraint to produce interesting behavior, because Murphy is always capable of screwing you there.

This isn't too big of an issue in practice, because it's possible to mix together several infradistributions with a prior, which is like "a constraint on Murphy is picked according to this probability distribution/prior, then Murphy chooses from the available options of the hypothesis they picked". And as it turns out, you'll end up completely ignoring hypotheses where Murphy can screw you over no matter what you do. You'll choose your policy to do well in the hypotheses/scenarios where Murphy is more tightly constrained, and write the "you automatically lose" hypotheses off because it doesn't matter *what* you pick, you'll lose in those.

But there *is* a big unstudied problem of "what sorts of hypotheses are nicely behaved enough that you can converge to optimal behavior in them", that's on our agenda.

An example that might be an intuition pump, is that there's a very big difference between the hypothesis that is "Murphy can pick a coin of unknown bias at the start, and I have to win by predicting the coinflips accurately" and the hypothesis "Murphy can bias each coinflip individually, and I have to win by predicting the coinflips accurately". The important difference between those seems to be that past performance is indicative of future behavior in the first hypothesis and not in the second. For the first hypothesis, betting according to Laplace's law of succession would do well in the long run no matter *what* weighted coin Murphy picks, because you'll catch on pretty fast. For the second hypothesis, no strategy you can do can possibly help in that situation, because past performance isn't indicative of future behavior.

Belief Functions And Decision Theory

So, first off, I should probably say that a lot of the formalism overhead involved in *this post in particular* feels like the sort of thing that will get a whole lot more elegant as we work more things out, but "Basic inframeasure theory" still looks pretty good at this point and worth reading, and the basic results (ability to translate from pseudocausal to causal, dynamic consistency, capturing most of UDT, definition of learning) will still hold up.

Yes, your current understanding is correct, it's rebuilding probability theory in more generality to be suitable for RL in nonrealizable environments, and capturing a much broader range of decision-theoretic problems, as well as whatever spin-off applications may come from having the basic theory worked out, like our infradistribution logic stuff.

It copes with unrealizability because its hypotheses are not probability distributions, but sets of probability distributions (actually more general than that, but it's a good mental starting point), corresponding to properties that reality may have, without fully specifying everything. In particular, if an agent learns a class of belief functions (read: properties the environment may fulfill) is learned, this implies that for all properties within that class that the true environment fulfills (you don't know the true environment exactly), the infrabayes agent will match or exceed the expected utility lower bound that can be guaranteed if you know reality has that property (in the low-time-discount limit)

There's another key consideration which Vanessa was telling me to put in which I'll post in another comment once I fully work it out again.

Also, thank you for noticing that it took a lot of work to write all this up, the proofs took a while. n_n

Less Basic Inframeasure Theory

So, we've also got an analogue of KL-divergence for crisp infradistributions.

We'll be using and for crisp infradistributions, and and for probability distributions associated with them. will be used for the KL-divergence of infradistributions, and will be used for the KL-divergence of probability distributions. For crisp infradistributions, the KL-divergence is defined as

I'm not entirely sure why it's like this, but it has the basic properties you would expect of the KL-divergence, like concavity in both arguments and interacting well with continuous pushforwards and semidirect product.

Straight off the bat, we have:

**Proposition 1:**

Proof: KL-divergence between probability distributions is always nonnegative, by Gibb's inequality.

**Proposition 2:**

And now, because KL-divergence between probability distributions is 0 only when they're equal, we have:

**Proposition 3:** *If ** is the uniform distribution on **, then *

And the cross-entropy of any distribution with the uniform distribution is always , so:

**Proposition 4:** *is a concave function over* .

Proof: Let's use as our number in in order to talk about mixtures. Then,

Then we apply concavity of the KL-divergence for probability distributions to get:

**Proposition 5: **

At this point we can abbreviate the KL-divergence, and observe that we have a multiplication by 1, to get:

And then pack up the expectation

Then, with the choice of and fixed, we can move the choice of the all the way inside, to get:

Now, there's something else we can notice. When choosing , it doesn't matter what is selected, you want to take every and maximize the quantity inside the expectation, that consideration selects your . So, then we can get:

And pack up the KL-divergence to get:

And distribute the min to get:

And then, we can pull out that fixed quantity and get:

And pack up the KL-divergence to get:

**Proposition 6:**

To do this, we'll go through the proof of proposition 5 to the first place where we have an inequality. The last step before inequality was:

Now, for a direct product, it's like semidirect product but all the and are the same infradistribution, so we have:

Now, this is a constant, so we can pull it out of the expectation to get:

**Proposition 7:**

For this, we'll need to use the Disintegration Theorem (the classical version for probability distributions), and adapt some results from Proposition 5. Let's show as much as we can before showing this.

Now, hypothetically, if we had

then we could use that result to get

and we'd be done. So, our task is to show

for any pair of probability distributions and . Now, here's what we'll do. The and gives us probability distributions over , and the and are probability distributions over . So, let's take the joint distribution over given by selecting a point from according to the relevant distribution and applying . By the classical version of the disintegration theorem, we can write it either way as starting with the marginal distribution over and a semidirect product to , or by starting with the marginal distribution over and you take a semidirect product with some markov kernel to to get the joint distribution. So, we have:

for some Markov kernels . Why? Well, the joint distribution over is given by or respectively (you have a starting distribution, and lets you take an input in and get an output in ). But, breaking it down the other way, we start with the marginal distribution of those joint distributions on (the pushforward w.r.t. ), and can write the joint distribution as semidirect product going the other way. Basically, it's just two different ways of writing the same distributions, so that's why KL-divergence doesn't vary at all.

Now, it is also a fact that, for semidirect products (sorry, we're gonna let be arbitrary here and unconnected to the fixed ones we were looking at earlier, this is just a general property of semidirect products), we have:

To see this, run through the proof of Proposition 5, because probability distributions are special cases of infradistributions. Running up to right up before the inequality, we had

But when we're dealing with probability distributions, there's only one possible choice of probability distribution to select, so we just have

Applying this, we have:

The first equality is our expansion of semidirect product for probability distributions, second equality is the probability distributions being equal, and third equality is, again, expansion of semidirect product for probability distributions. Contracting the two sides of this, we have:

Now, the KL-divergence between a distribution and itself is 0, so the expectation on the left-hand side is 0, and we have

And bam, we have which is what we needed to carry the proof through.

I don't know, we're hunting for it, relaxations of dynamic consistency would be extremely interesting if found, and I'll let you know if we turn up with anything nifty.