Abram Demski

Abram Demski's Comments

Two Alternatives to Logical Counterfactuals

I too have recently updated (somewhat) away from counterfactual non-realism. I have a lot of stuff I need to work out and write about it.

I seem to have a lot of disagreements with your post.

Given this uncertainty, you may consider material conditionals: if I take action X, will consequence Q necessarily follow? An action may be selected on the basis of these conditionals, such as by determining which action results in the highest guaranteed expected utility if that action is taken.

I don't think material conditionals are the best way to cash out counterfactual non-realism.

  • The basic reason I think it's bad is the happy dance problem. This makes it seem clear that the sentence to condition on should not be .
    • If the action can be viewed as a function of observations, conditioning on makes sense. But this is sort of like already having counterfactuals, or at least, being realist that there are counterfactuals about whan would do if the agent observed different things. So this response can be seen as abandoning counterfactual non-realism.
    • A different approach is to consider conditional beliefs rather than material implications. I think this is more true to counterfactual non-realism. In the simplest form, this means you just condition on actions (rather than trying to condition on something like or ). However, in order to reason updatelessly, you need something like conditioning on conditionals, which complicates matters.
  • Another reason to think it's bad is Troll Bridge.
    • Again if the agent thinks there are basic counterfactual facts, (required to respect but little else -- ie entirely determined by subjective beliefs), then the agent can escape Troll Bridge by disagreeing with the relevant inference. But this, of course, rejects the kind of counterfactual non-realism you intend.
    • To be more in line with counterfactual non-realism, we would like to use conditional probabilities instead. However, conditional probability behaves too much like material implication to block the Troll Bridge argument. However, I believe that there is an account of conditional probability which avoids this by rejecting the ratio analysis of conditional probability -- ie Bayes' definition -- and instead regards conditional probability as a basic entity. (Along the lines of what Alan Hájek goes on and on about.) Thus an EDT-like procedure can be immune to both 5-and-10 and Troll Bridge. (I claim.)

As for policy-dependent source code, I find myself quite unsympathetic to this view.

  • If the agent is updateful, this is just saying that in counterfactuals where the agent does something else, it might have different source code. Which seems fine, but does it really solve anything? Why is this much better than counterfactuals which keep the source code fixed but imagine the execution trace being different? This seems to only push the rough spots further back -- there can still be contradictions, e.g. between the source code and the process by which programmers wrote the source code. Do you imagine it is possible to entirely remove such rough spots from the counterfactuals?
  • So it seems you intend the agent to be updateless instead. But then we have all the usual issues with logical updatelessness. If the agent is logically updateless, there is absolutely no reason to think that its beliefs about the connections between source code and actual policy behavior is any good. Making those connections requires actual reasoning, not simply a good enough prior -- which means being logically updateful. So it's unclear what to do.
    • Perhaps logically-updateful policy-dependent-source-code is the most reasonable version of the idea. But then we are faced with the usual questions about spurious counterfactuals, chicken rule, exploration, and Troll Bridge. So we still have to make choices about those things.
Thinking About Filtered Evidence Is (Very!) Hard

I agree with your first sentence, but I worry you may still be missing my point here, namely that the Bayesian notion of belief doesn't allow us to make the distinction you are pointing to. If a hypothesis implies something, it implies it "now"; there is no "the conditional probability is 1 but that isn't accessible to me yet".

I also think this result has nothing to do with "you can't have a perfect model of Carol". Part of the point of my assumptions is that they are, individually, quite compatible with having a perfect model of Carol amongst the hypotheses.

Thinking About Filtered Evidence Is (Very!) Hard

I'm not sure exactly what the source of your confusion is, but:

I don't see how this follows. At the point where the confidence in PA rises above 50%, why can't the agent be mistaken about what the theorems of PA are?

The confidence in PA as a hypothesis about what the speaker is saying is what rises above 50%. Specifically, an efficiently computable hypothesis eventually enumerating all and only the theorems of PA rises above 50%.

For example, let T be a theorem of PA that hasn't been claimed yet. Why can't the agent believe P(claims-T) = 0.01 and P(claims-not-T) = 0.99? It doesn't seem like this violates any of your assumptions.

This violates the assumption of honesty that you quote, because the agent simultaneously has P(H) > 0.5 for a hypothesis H such that P(obs_n-T | H) = 1, for some (possibly very large) n, and yet also believes P(T) < 0.5. This is impossible since it must be that P(obs_n-T) > 0.5, due to P(H) > 0.5, and therefore must be that P(T) > 0.5, by honesty.

Thinking About Filtered Evidence Is (Very!) Hard

Here's one way to extend a result like this to lying. Rather than assume honesty, we could assume observations carry sufficiently much information about the truth. This is like saying that sensory perception may be fooled, but in the long run, bears a strong enough connection to reality for us to infer a great deal. Something like this should imply the same computational difficulties.

I'm not sure exactly how this assumption should be spelled out, though.

Thinking About Filtered Evidence Is (Very!) Hard

It's absurd (in a good way) how much you are getting out of incomplete hypotheses. :)

Thinking About Filtered Evidence Is (Very!) Hard

I like your example, because "Carol's answers are correct" seems like something very simple, and also impossible for a (bounded) Bayesian to represent. It's a variation of calculator or notepad problems -- that is, the problem of trying to represent a reasoner who has (and needs) computational/informational resources which are outside of their mind. (Calculator/notepad problems aren't something I've written about anywhere iirc, just something that's sometimes on my mind when thinking about logical uncertainty.)

I do want to note that weakening honesty seems like a pretty radical departure from the standard Bayesian treatment of filtered evidence, in any case (for better or worse!). Distinguishing between observing X and X itself, it is normally assumed that observing X implies X. So while our thinking on this does seem to differ, we are agreeing that there are significant points against the standard view.

From outside, the solution you propose looks like "doing the best you can to represent the honesty hypothesis in a computationally tractable way" -- but from inside, the agent doesn't think of it that way. It simply can't conceive of perfect honesty. This kind of thing feels both philosophically unsatisfying and potentially concerning for alignment. It would be more satisfying if the agent could explicitly suspect perfect honesty, but also use tractable approximations to reason about it. (Of course, one cannot always get everything one wants.)

We could modify the scenario to also include questions about Carol's honesty -- perhaps when the pseudo-Bayesian gets a question wrong, it is asked to place a conditional bet about what Carol would say if Carol eventually gets around to speaking on that question. Or other variations along similar lines.

Zoom In: An Introduction to Circuits

The "Zoom In" work is aimed at understanding what's going on in neural networks as a scientific question, not directly tackling mesa-optimization. This work is relevant to more application-oriented interpretability if you buy that understanding what is going on is an important prerequisite to applications.

As the original article put it:

And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true statements.

Or, as I put it in Embedded Curiosities:

One downside of discussing these problems as instrumental strategies is that it can lead to some misunderstandings about why we think this kind of work is so important. With the “instrumental strategies” lens, it’s tempting to draw a direct line from a given research problem to a given safety concern.

A better understanding of 'circuits' in the sense of Zoom In could yield unexpected fruits in terms of safety. But to name an expected direction: understanding the algorithms expressed by 95% of a neural network, one could re-implement those independently. This would yield a totally transparent algorithm. Obviously a further question to ask is, how much of a performance hit do we take by discarding the 5% we don't understand? (If it's too large, this is also a significant point against the idea that the 'circuits' methodology is really providing much understanding of the deep NN from a scientific point of view.)

I'm not claiming that doing that would eliminate all safety concerns with the resulting reimplementation, of course. Only that it would address the specific concern you mention.

Bayesian Evolving-to-Extinction
Or just bad implementations do this - predict-o-matic as described sounds like a bad idea, and like it doesn't contain hypotheses, so much as "players"*. (And the reason there'd be a "side channel" is to understand theories - the point of which is transparency, which, if accomplished, would likely prevent manipulation.)

You can think of the side-channel as a "bad implementation" issue, but do you really want to say that we have to forego diagnostic logs in order to have a good implementation of "hypotheses" instead of "players"? Going to the extreme, every brain has side-channels such as EEG.

But more importantly, as Daniel K pointed out, you don't need the side-channel. If the predictions are being used in a complicated way to make decisions, the hypotheses/players have an incentive to fight each other through the consequences of those decisions.

So, the interesting question is, what's necessary for a *good* implementation of this?

This seems a strange thing to imagine - how can fighting occur, especially on a training set?

If the training set doesn't provide any opportunity for manipulation/corruption, then I agree that my argument isn't relevant for the training set. It's most directly relevant for online learning. However, keep in mind also that deep learning might be pushing in the direction of learning to learn. Something like a Memory Network is trained to "keep learning" in a significant sense. So you then have to ask if its learned learning strategy has these same issues, because that will be used on-line.

(I can almost imagine neurons passing on bad input, but a) it seems like gradient descent would get rid of that, and b) it's not clear where the "tickets" are.)

Simplifying the picture greatly, imagine that the second-back layer of neurons is one-neuron-per-ticket. Gradient descent can choose which of these to pay the most attention to, but little else; according to the lottery ticket hypothesis, the gradients passing through the 'tickets' themselves aren't doing that much for learning, besides reinforcing good tickets and weakening bad.

So imagine that there is one ticket which is actually malign, and has a sophisticated manipulative strategy. Sometimes it passes on bad input in service of its manipulations, but overall it is the best of the lottery tickets so while the gradient descent punishes it on those rounds, it is more than made up for in other cases. Furthermore, the manipulations of the malign ticket ensure that competing tickets are kept down, by manipulating situations to be those which the other tickets don't predict very well.

*I don't have a link to the claim, but it's been said before that 'the math behind Bayes' theorem requires each hypothesis to talk about all of the universe, as opposed to human models that can be domain limited.'

This remark makes me think you're thinking something about logical-induction style traders which only trade on a part of the data vs bayesian-style hypotheses which have to make predictions everywhere. I'm not sure how that relates to my post -- there are things to say about it, but, I don't think I said any of them. In particular the lottery-ticket hypothesis isn't about this; a "lottery ticket" is a small part of the deep NN, but, is effectively a hypothesis about the whole data.

Bayesian Evolving-to-Extinction

Ah right! I meant to address this. I think the results are more muddy (and thus don't serve as clear illustrations so well), but, you do get the same thing even without a side-channel.

Bayesian Evolving-to-Extinction

Yeah, in probability theory you don't have to worry about how everything is implemented. But for implementations of Bayesian modeling with a rich hypothesis class, each hypothesis could be something like a blob of code which actually does a variety of things.

As for "want", sorry for using that without unpacking it. What it specifically means is that hypotheses like that will have a tendency to get more probability weight in the system, so if we look at the weighty (and thus influential) hypotheses, they are more likely to implement strategies which achieve those ends.

Load More