Thinking About Filtered Evidence Is (Very!) Hard

[-]Vanessa Kosoy6y90

From my perspective, the trouble here comes from the honesty condition. This condition hides an unbounded quantifier: "if the speaker will ever say something, then it is true". So it's no surprise we run into computational complexity and even computability issues.

Consider the following setting. The agent Alice repeatedly interacts with two other entities: Bob and Carol. When Alice interacts with Bob, Bob asks Alice a yes/no question, Alice answers it and receives either +1 or -1 reward depending on whether the answer is correct. When Alice interacts with Carol, Carol tells Alice some question and the answer to that question.

Suppose that Alice starts with some low-information prior and learns over time about Bob and Carol both. The honesty condition becomes "if Carol will ever say $(X, Y)$ and Bob asks the question $X$ , then the correct answer is $Y$ ". But, this condition might be computationally intractable so it is not in the prior and cannot be learned. However, weaker versions of this condition might be tractable, for example "if Carol says $(X, Y)$ at time step between $0$ and $t + 1000$ , and Bob asks $X$ at time $t$ , then the correct answer is $Y$ ". Since simulating Bob is still intractable, this condition cannot be expressed as a vanilla Bayesian hypothesis. However, it can be expressed as an incomplete hypothesis. We can also have an incomplete hypothesis that is the conjunction of this weak honesty condition with a full simulation of Carol. Once Alice learned this incomplete hypothesis, ey answer correctly at least those questions which Carol have already taught em or will teach em within 1000 time steps.

[-]abramdemski6y30

I like your example, because "Carol's answers are correct" seems like something very simple, and also impossible for a (bounded) Bayesian to represent. It's a variation of calculator or notepad problems -- that is, the problem of trying to represent a reasoner who has (and needs) computational/informational resources which are outside of their mind. (Calculator/notepad problems aren't something I've written about anywhere iirc, just something that's sometimes on my mind when thinking about logical uncertainty.)

I do want to note that weakening honesty seems like a pretty radical departure from the standard Bayesian treatment of filtered evidence, in any case (for better or worse!). Distinguishing between observing X and X itself, it is normally assumed that observing X implies X. So while our thinking on this does seem to differ, we are agreeing that there are significant points against the standard view.

From outside, the solution you propose looks like "doing the best you can to represent the honesty hypothesis in a computationally tractable way" -- but from inside, the agent doesn't think of it that way. It simply can't conceive of perfect honesty. This kind of thing feels both philosophically unsatisfying and potentially concerning for alignment. It would be more satisfying if the agent could explicitly suspect perfect honesty, but also use tractable approximations to reason about it. (Of course, one cannot always get everything one wants.)

We could modify the scenario to also include questions about Carol's honesty -- perhaps when the pseudo-Bayesian gets a question wrong, it is asked to place a conditional bet about what Carol would say if Carol eventually gets around to speaking on that question. Or other variations along similar lines.

[-]Vanessa Kosoy6y70

Here's another perspective. Suppose that now Bob and Carol have symmetrical roles: each one asks a question, allows Alice to answer, and then reveals the right answer. Alice gets a reward when ey answer correctly. We can now see that perfect honesty actually is tractable. It corresponds to an incomplete hypothesis. If Alice learns this hypothesis, ey answer correctly any question ey already heard before (no matter who asks now and who asked before). We can also consider a different incomplete hypothesis that allows real-time simulation of Carol. If Alice learns this hypothesis, ey answer correctly any question asked by Carol. However, the conjunction of both hypotheses is already intractable. There's no impediment for Alice to learn both hypotheses: ey can both memorize previous answers and answer all questions by Carol. But, this doesn't automatically imply learning the conjunction.

[-]abramdemski6y40

It's absurd (in a good way) how much you are getting out of incomplete hypotheses. :)

[-]Martín Soto3y80

Since this hypothesis makes distinct predictions, it is possible for the confidence to rise above 50% after finitely many observations.

I was confused about why this is the case. I now think I've got an answer (please anyone confirm):
The description length of the Turing Machine enumerating theorems of PA is constant. The description length of any Turing Machine that enumerates theorems of PA up until time-step n and the does something else grows with n (for big enough n). Since any probability prior over Turing Machines has an implicit simplicity bias, no matter what prior we have, for big enough n the latter Turing Machines will (jointly) get arbitrarily low probability relative to the first one. Thus, after enough time-steps, given all observations are PA theorems, our listener will assign arbitrarily higher probability to the first one than all the rest, and thus the first one will be over 50%.

Edit: Okay, I now saw you mention the "getting over 50%" problem further down:

I don't know if the argument works out exactly as I sketched; it's possible that the rich hypothesis assumption needs to be "and also positive weight on a particular enumeration". Given that, we can argue: take one such enumeration; as we continue getting observations consistent with that observation, the hypothesis which predicts it loses no weight, and hypotheses which (eventually) predict other things must (eventually) lose weight; so, the updated probability eventually believes that particular enumeration will continue with probability > 1/2.

But I think the argument goes through already with the rich hypothesis assumption as initially stated. If the listener has non-zero prior probability on the speaker enumerating theorems of PA, it must have non-zero probability on it doing so in a particular enumeration. (unless our specification of the listener structure doesn't even consider different enumerations? but I was just thinking of their hypothesis space as different Turing Machines the whole time) And then my argument above goes through, which I think is just your argument + explicitly mentioning the additional required detail about the simplicity prior.

[-]abramdemski3y30

Sounds right to me.

[-]Davidmanheim6y50

This seems related to my speculations about multi-agent alignment. In short, for embedded agents, having a tractable complexity of building models of other decision processes either requires a reflexively consistent view of their reactions to modeling my reactions to their reactions, etc. - or it requires simplification that clearly precludes ideal Bayesian agents. I made the argument much less formally, and haven't followed the math in the post above (I hope to have time to go through more slowly at some point.)

To lay it out here, the basic argument in the paper is that even assuming complete algorithmic transparency, in any reasonably rich action space, even games as simple as poker become completely intractable to solve. Each agent needs to simulate a huge space of possibilities for the decision of all other agents in order to make a decision about what the probability is that the agent is in each potential position. For instance, what is the probability that they are holding a hand much better than mine and betting this way, versus that they are bluffing, versus that they have a roughly comparable strength hand and are attempting to find my reaction, etc. But evaluating this requires evaluating the probability that they assign to me reacting in a given way in each condition, etc. The regress may not be infinite, because the space of states is finite, as is the computation time, but even in such a simple world it grows too quickly to allow fully Bayesian agents within the computational capacity of, say, the physical universe.

[-]Rohin Shah6y40

Since this hypothesis makes distinct predictions, it is possible for the confidence to rise above 50% after finitely many observations. At that point, since the listener expects each theorem of PA to eventually be listed, with probability > 50%, and the listener believes the speaker, the listener must assign > 50% probability to each theorem of PA!

I don't see how this follows. At the point where the confidence in PA rises above 50%, why can't the agent be mistaken about what the theorems of PA are? For example, let T be a theorem of PA that hasn't been claimed yet. Why can't the agent believe P(claims-T) = 0.01 and P(claims-not-T) = 0.99? It doesn't seem like this violates any of your assumptions. I suspect you wanted to use Assumption 2 here:

A listener believes a speaker to be honest if the listener distinguishes between "X" and "the speaker claims X at time t" (aka "claimst-X"), and also has beliefs such that P(X| claimst-X)=1 when P(claims-X) > 0.

But as far as I can tell the scenario I gave is compatible with that assumption.

[-]Vanessa Kosoy6y80

I think there is some confusion here coming from the unclear notion of a Bayesian agent with beliefs about theorems of PA. The reformulation I gave with Alice, Bob and Carol makes the problem clearer, I think.

[-]Rohin Shah6y40

Yeah, I did find that reformulation clearer, but it also then seems to not be about filtered evidence?

Like, it seems like you need two conditions to get the impossibility result, now using English instead of math:

1. Alice believes Carol is always honest (at least with probability > 50%)

2. For any statement s: [if Carol will ever say s, Alice currently believes that Carol will eventually say s (at least with probability > 50%)]

It really seems like the difficulty here is with condition 2, not with condition 1, so I don't see how this theorem has anything to do with filtered evidence.

Maybe the point is just "you can't perfectly update on X and Carol-said-X , because you can't have a perfect model of them, because you aren't bigger than they are"?

(Probably you agree with this, given your comment.)

[-]Vanessa Kosoy6y50

The problem is not in one of the conditions separately but in their conjunction: see my follow-up comment. You could argue that learning an exact model of Carol doesn't really imply condition 2 since, although the model does imply everything Carol is ever going to say, Alice is not capable of extracting this information from the model. But then it becomes a philosophical question of what does it mean to "believe" something. I think there is value in the "behaviorist" interpretation that "believing X" means "behaving optimally given X". In this sense, Alice can separately believe the two facts described by conditions 1 and 2, but cannot believe their conjunction.

[-]Rohin Shah6y20

I still don't get it but probably not worth digging further. My current confusion is that even under the behaviorist interpretation, it seems like just believing condition 2 implies knowing all the things Carol would ever say (or Alice has a mistaken belief). Probably this is a confusion that would go away with enough formalization / math, but it doesn't seem worth doing that.

[-]abramdemski6y60

I'm not sure exactly what the source of your confusion is, but:

I don't see how this follows. At the point where the confidence in PA rises above 50%, why can't the agent be mistaken about what the theorems of PA are?

The confidence in PA as a hypothesis about what the speaker is saying is what rises above 50%. Specifically, an efficiently computable hypothesis eventually enumerating all and only the theorems of PA rises above 50%.

For example, let T be a theorem of PA that hasn't been claimed yet. Why can't the agent believe P(claims-T) = 0.01 and P(claims-not-T) = 0.99? It doesn't seem like this violates any of your assumptions.

This violates the assumption of honesty that you quote, because the agent simultaneously has P(H) > 0.5 for a hypothesis H such that P(obs_n-T | H) = 1, for some (possibly very large) n, and yet also believes P(T) < 0.5. This is impossible since it must be that P(obs_n-T) > 0.5, due to P(H) > 0.5, and therefore must be that P(T) > 0.5, by honesty.

[-]Rohin Shah6y30

Yeah, I feel like while honesty is needed to prove the impossibility result, the problem arose with the assumption that the agent could effectively reason now about all the outputs of a recursively enumerable process (regardless of honesty). Like, the way I would phrase this point is "you can't perfectly update on X and Carol-said-X , because you can't have a perfect model of Carol"; this applies whether or not Carol is honest. (See also this comment.)

[-]abramdemski6y80

I agree with your first sentence, but I worry you may still be missing my point here, namely that the Bayesian notion of belief doesn't allow us to make the distinction you are pointing to. If a hypothesis implies something, it implies it "now"; there is no "the conditional probability is 1 but that isn't accessible to me yet".

I also think this result has nothing to do with "you can't have a perfect model of Carol". Part of the point of my assumptions is that they are, individually, quite compatible with having a perfect model of Carol amongst the hypotheses.

[-]Rohin Shah6y20

the Bayesian notion of belief doesn't allow us to make the distinction you are pointing to

Sure, that seems reasonable. I guess I saw this as the point of a lot of MIRI's past work, and was expecting this to be about honesty / filtered evidence somehow.

I also think this result has nothing to do with "you can't have a perfect model of Carol". Part of the point of my assumptions is that they are, individually, quite compatible with having a perfect model of Carol amongst the hypotheses.

I think we mean different things by "perfect model". What if I instead say "you can't perfectly update on X and Carol-said-X , because you can't know why Carol said X, because that could in the worst case require you to know everything that Carol will say in the future"?

[-]abramdemski6y40

Sure, that seems reasonable. I guess I saw this as the point of a lot of MIRI’s past work, and was expecting this to be about honesty / filtered evidence somehow.

Yeah, ok. This post as written is really less the kind of thing somebody who has followed all the MIRI thinking needs to hear and more the kind of thing one might bug an orthodox Bayesian with. I framed it in terms of filtered evidence because I came up with it by thinking about some confusion I was having about filtered evidence. And it does problematize the Bayesian treatment. But in terms of actual research progress it would be better framed as a negative result about whether Sam's untrollable prior can be modified to have richer learning.

I think we mean different things by “perfect model”. What if [...]

Yep, I agree with everything you say here.

[-]Gordon Seidoh Worley6y10

Assumption 3. A listener is said to have minimally consistent beliefs if each proposition X has a negation X*, and P(X)+P(X*)≤1.

One thing that's interesting to me is that this is assumption is frequently not satisfied in real life due to underspecification, e.g. P(I'm happy) + P(I'm not happy) ≥ 1 because "happy" may be underspecified. I can't think of a really strong minimal example, but I feel like this pops up a lot of discussions on complex issues where a dialectic develops because neither thesis nor antithesis capture everything and so both are underspecified in ways that make their naive union exceed the available probability mass.

[-]orthonormal6y10

If the listener is running a computable logical uncertainty algorithm, then for a difficult proposition it hasn't made much sense of, the listener might say "70% likely it's a theorem and X will say it, 20% likely it's not a theorem and X won't say it, 5% PA is inconsistent and X will say both, 5% X isn't naming all and only theorems of PA".

Conditioned on PA being consistent and on X naming all and only theorems of PA, and on the listener's logical uncertainty being well-calibrated, you'd expect that in 78% of such cases X eventually names it.

But you can't use the listener's current probabilities on [X saying it] to sort out theorems from non-theorems in a way that breaks computability!

What am I missing?

[-]Stuart_Armstrong6y10

Is there any meaningful distinction between filtered evidence and lying? I know that in toy models these can be quite different, but in the expansive setting here, where the speaker can select the most misleading technically true fact, is there any major difference?

And how would the results here look if we expended it to allow the speaker to lie?

[-]abramdemski6y30

Here's one way to extend a result like this to lying. Rather than assume honesty, we could assume observations carry sufficiently much information about the truth. This is like saying that sensory perception may be fooled, but in the long run, bears a strong enough connection to reality for us to infer a great deal. Something like this should imply the same computational difficulties.

I'm not sure exactly how this assumption should be spelled out, though.

[-]habryka6y10

Promoted to curated: This post is a bit more technical than the usual posts we curate, but I think it is still quite valuable to read for a lot of people, since it's about a topic that has already received some less formal treatment on LessWrong.

I also am very broadly excited about trying move beyond a naive bayesianism paradigm, and felt like this post helped me significantly in understanding what that would look like.

[+][comment deleted]6y10

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

38

Thinking About Filtered Evidence Is (Very!) Hard

38