You're right, you could have an event in the event space which is just "the virtua-evidence update [such-and-such]". I'm actually going to pull out this trick in a future follow-up post.
I note that that's not how Pearl or Jeffrey understand these updates. And it's a peculiar thing to do -- something happens to make you update a particular amount, but you're just representing the event by the amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you're updating on.
That's not quite what I had in mind, but I can see how my 'continuously valued' comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
b - my house was burgled
a - my alarm went off
z - my neighbor calls to tell me the alarm went off
Pearl's method is to take what would be uncertain information about a (via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I'm saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood P(z|b)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.
The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value is not the amount you update (at least not generally), because its not generated from your own models, but rather by the messenger. Consider event z99, where my neighbor calls to say she's 99% sure the alarm went off. This doesn't mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.
In what sense? What technical claim about Bayesian updates are you trying to refer to?
Definitely the second one, as optimal update policy. Responding to your specific objections:
This is only true if the only information we have coming in is a sequence of propositions which we are updating 100% on.
As you'll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
This optimality property only makes sense if we believe something like grain-of-truth.
I believe I previously conceded this point - the true hypothesis (or at least a 'good enough' one) must have a nonzero probability, which we can't guarantee.
But properties such as calibration and convergence also have intuitive appeal
Re: calibration - I still believe that this can be included if you are jointly estimating your model and your hypothesis.
Re: convergence - how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
(By the way, I really appreciate your in-depth engagement with my position.)
Likewise! This has certainly been educational, especially in light of this:
Sadly, the actual machinery of logical induction was beyond the scope of this post, but there are answers. I just don't yet know a good way to present it all as a nice, practical, intuitively appealing package.
The solution is too large to fit in the margins, eh? j/k, I know there's a real paper. Should I go break my brain trying to read it, or wait for your explanation?
Phew! Thanks for de-gaslighting me.
I definitely missed a few things on the first read through - thanks for repeating the ratio argument in your response.
I'm still confused about this statement:
Virtual evidence requires probability functions to take arguments which aren't part of the event space.
Why can't virtual evidence messages be part of the event space? Is it because they are continuously valued?
As to why one would want to have Bayesian updates be normative: one answer is that they maximize our predictive power, given sufficient compute. Given the name of this website, that seems a sufficient reason.
A second answer you hint at here:
The second seems more practical for the working Bayesian.
As a working Bayesian myself, having a practical update rule is quite useful! As far as I can tell, I don't see a good alternative in what you have provided.
Then we have to ask why not (steelmanned) classical Bayesianism? I think you've two arguments, one of which I buy, the other I don't.
The practical problem with this, in contrast to a more radical-probabilism approach, is that the probability distribution then has to explicitly model all of that stuff.
This is the weak argument. Computing P(A*|X) "the likelihood I recall seeing A given X" is not a fundamentally different thing than modeling P(A|X) "the likelihood signal A happened given X". You have to model an extra channel effect or two, but that's just a difference of degree.
Immediately after, though, you have the better argument:
As Scott and I discussed in Embedded World-Models, classical Bayesian models require the world to be in the hypothesis space (AKA realizability AKA grain of truth) in order to have good learning guarantees; so, in a sense, they require that the world is smaller than the probability distribution. Radical probabilism does not rest on this assumption for good learning properties.
if I were to paraphrase - Classical Bayesianism can fail entirely when the world state does not fit into one of its nonzero probability hypotheses, which must be of necessity limited in any realizable implementation.
I find this pretty convincing. In my experience this is a problem that crops up quite frequently, and requires meta-Bayesian methods you mentioned like calibration (to notice you are confused) and generation of novel hypotheses.
(Although Bayesianism is not completely dead here. If you reformulate your estimation problem to be over the hypothesis space and model space jointly, then Bayesian updates can get you the sort of probability shifts discussed in Pascal's Muggle. Of course, you still run into the 'limited compute' problem, but in many cases it might be easier than attempting to cover the entire hypothesis space. Probably worth a whole other post by itself.)
Why is a dogmatic Bayesian not allowed to update on virtual evidence? It seems like you (and Jeffries?) have overly constrained the types of observations that a classical Bayesian is allowed to use, to essentially sensory stimuli. It seems like you are attacking a strawman, given that by your definition, Pearl isn't a classical Bayesian.
I also want to push back on this particular bit:
Richard Jeffrey (RJ): Tell me one peice of information you're absolutely certain of in such a situation.
DP: I'm certain I had that experience, of looking at the cloth.
RJ: Surely you aren't 100% sure you were looking at cloth. It's merely very probable.
DP: Fine then. The experience of looking at ... what I was looking at.
I'm pretty sure we can do better. How about:
DP: Fine then. I'm certain I remember believing that I had seen that cloth.
For an artificial dogmatic probabilist, the equivalent might be:
ADP: Fine then. I'm certain of evidence A* : that my probability inference algorithm received a message with information about an observation A.
Essentially, we update on A* instead of A. When we compute the likelihood P(A*|X), we can attempt to account for all the problems with our senses, neurons, memory, etc. that result in P(A*|~A) > 0.
RJ still has a counterpoint here:
RJ: Again I doubt it. You're engaging in inner-outer hocus pocus.* There is no clean dividing line before which a signal is external, and after which that signal has been "observed". The optic nerve is a noisy channel, warping the signal. And the output of the optic nerve itself gets processed at V1, so the rest of your visual processing doesn't get direct access to it, but rather a processed version of the information. And all this processing is noisy. Nowhere is anything certain. Everything is a guess. If, anywhere in the brain, there were a sharp 100% observation, then the nerves carrying that signal to other parts of the brain would rapidly turn it into a 99% observation, or a 90% observation...
But I don't find this compelling. At some point there is a boundary to the machinery that's performing the Bayesian update itself. If the message is being degraded after this point, then that means we're no longer talking about a Bayesian updater.