samshap

10

Phew! Thanks for de-gaslighting me.

70

I definitely missed a few things on the first read through - thanks for repeating the ratio argument in your response.

I'm still confused about this statement:

Virtual evidence requires probability functions to take arguments which aren't part of the event space.

Why can't virtual evidence messages be part of the event space? Is it because they are continuously valued?

As to *why *one would want to have Bayesian updates be normative: one answer is that they maximize our predictive power, given sufficient compute. Given the name of this website, that seems a sufficient reason.

A second answer you hint at here:

The second seems more practical for the working Bayesian.

As a working Bayesian myself, having a practical update rule is quite useful! As far as I can tell, I don't see a good alternative in what you have provided.

Then we have to ask *why not *(steelmanned) classical Bayesianism? I think you've two arguments, one of which I buy, the other I don't.

The practical problem with this, in contrast to a more radical-probabilism approach, is that the probability distribution then has to explicitly model all of that stuff.

This is the weak argument. Computing P(A*|X) "the likelihood I recall seeing A given X" is not a fundamentally different thing than modeling P(A|X) "the likelihood signal A happened given X". You have to model an extra channel effect or two, but that's just a difference of degree.

Immediately after, though, you have the better argument:

As Scott and I discussed in Embedded World-Models, classical Bayesian models require the world to be in the hypothesis space (AKA realizability AKA grain of truth) in order to have good learning guarantees; so, in a sense, they require that the world is smaller than the probability distribution. Radical probabilism does not rest on this assumption for good learning properties.

if I were to paraphrase - Classical Bayesianism can fail entirely when the world state does not fit into one of its nonzero probability hypotheses, which must be of necessity limited in any realizable implementation.

I find this pretty convincing. In my experience this is a problem that crops up quite frequently, and requires meta-Bayesian methods you mentioned like calibration (to notice you are confused) and generation of novel hypotheses.

(Although Bayesianism is not completely dead here. If you reformulate your estimation problem to be over the hypothesis space and model space jointly, then Bayesian updates can get you the sort of probability shifts discussed in Pascal's Muggle. Of course, you still run into the 'limited compute' problem, but in many cases it might be easier than attempting to cover the entire hypothesis space. Probably worth a whole other post by itself.)

30

Why is a dogmatic Bayesian not allowed to update on virtual evidence? It seems like you (and Jeffries?) have overly constrained the types of observations that a classical Bayesian is allowed to use, to essentially sensory stimuli. It seems like you are attacking a strawman, given that by your definition, Pearl isn't a classical Bayesian.

I also want to push back on this particular bit:

Richard Jeffrey (RJ):Tell me one peice of information you're absolutely certain of in such a situation.

DP:I'm certain I had that experience, of looking at the cloth.

RJ:Surely you aren't 100% sure you were looking at cloth. It's merely very probable.

DP:Fine then. The experience of looking at ... what I was looking at.

I'm pretty sure we can do better. How about:

**DP: ***Fine then. I'm certain I remember believing that I had seen that cloth.*

For an artificial dogmatic probabilist, the equivalent might be:

**ADP: ***Fine then. I'm certain of evidence A* : that my probability inference algorithm received a message with information about an observation A.*

Essentially, we update on A* instead of A. When we compute the likelihood P(A*|X), we can attempt to account for all the problems with our senses, neurons, memory, etc. that result in P(A*|~A) > 0.

RJ still has a counterpoint here:

RJ:Again I doubt it. You're engaging ininner-outer hocus pocus.*There is no clean dividing line before which a signal is external, and after which that signal has been "observed". The optic nerve is a noisy channel, warping the signal. And the output of the optic nerve itself gets processed at V1, so the rest of your visual processing doesn't get direct access to it, but rather a processed version of the information. And all this processing is noisy. Nowhere is anything certain. Everything is a guess. If, anywhere in the brain, there were a sharp 100% observation, then the nerves carrying that signal to other parts of the brain would rapidly turn it into a 99% observation, or a 90% observation...

But I don't find this compelling. At some point there is a boundary to the machinery that's performing the Bayesian update itself. If the message is being degraded *after* this point, then that means we're no longer talking about a Bayesian updater.

That's not quite what I had in mind, but I can see how my 'continuously valued' comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:

b- my house was burgled- my alarm went offaz- my neighbor calls to tell me the alarm went offPearl's method is to take what would be uncertain information about

a(via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I'm saying is that you can just treatzas being an event itself, and do a Bayesian update from the likelihood P(z|b)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value

is not the amount you update(at least not generally), because its not generated from your own models, but rather by the messenger. Consider eventz99, where my neighbor calls to say she's 99% sure the alarm went off. This doesn't mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.Definitely the second one, as optimal update policy. Responding to your specific objections:

As you'll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.

I believe I previously conceded this point - the true hypothesis (or at least a 'good enough' one) must have a nonzero probability, which we can't guarantee.

Re: calibration - I still believe that this can be included if you are jointly estimating your model and your hypothesis.

Re: convergence - how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?

Likewise! This has certainly been educational, especially in light of this:

The solution is too large to fit in the margins, eh? j/k, I know there's a real paper. Should I go break my brain trying to read it, or wait for your explanation?