Congrats to the winner TailCalled with their post Some thoughts on "The Nature of Counterfactuals". See the winner announcement post.

I've previously argued that the concept of counterfactuals can only be understood from within the counterfactual perspective.

I will be awarding a $1000 prize for the best post that engages with the idea that counterfactuals may be circular in this sense. The winning entry may be one of the following (these categories aren't intended to be exclusive):

a) A post that attempts to draw out the consequences of this principle for decision theory

b) A post that attempts to evaluate the arguments for and against adopting the principle that counterfactuals only make sense from within the counterfactual perspective

c) A review of relevant literature in philosophy or decision theory

d) A post that restates already existing ideas in a clearer or more accessible manner (I don't think this topic has been explored much on LW, but it may have in explored in the literature on decision theory or philosophy)

Feel free to ask me for clarification about what would be on or off-topic. Probably the main thing I'd like to see is substantial engagement with this principle. The bounty is for posts that engage with the notion that counterfactuals might only make sense from within a counterfactual perspective. I have written on this topic, but the competition isn't limited to posts that engage with my views on this topic. It's perfectly fine to engage with other arguments for this proposition if, for example, you find someone arguing in favour of this in the philosophical/mathematical literature or Less Wrong.

If someone submits a high-quality post that only touches on this issue tangentially, but someone else submits an only okayish post that tries to deeply engage with this issue, then I would likely award it to the latter as I'm trying to incentivise more engagement with this issue rather than just high-quality posts generally. If the bounty is awarded to an unexpected submission, I expect this to be the main cause.

I will be awarding an additional $100 for the best short-form post on this topic. This may be a LW Shortform post, a public Facebook post, a Twitter thread, ect (I'm not going to include Discord/Slack messages as they aren't accessible).

Why do I believe in this principle?

Roughly, my reasons are as follows:

  1. Rejecting David Lewis' Counterfactual Realism as absurd and therefore concluding that counterfactuals must be at least partially a human construction: either a) in the sense of them being an inevitable and essential part of how we make sense of the world by our very nature or b) in the sense of being a semi-arbitrary and contingent system that we've adopted in order to navigate the world
  2. Insofar as counterfactuals are inherently a part of how we interpret the world, the only way that we can understand them is to "look out through them", notice what we see, and attempt to characterise this as precisely as possible
  3. Insofar as counterfactuals are a somewhat arbitrary and contingent system constructed in order to navigate the world, the way that the system is justified is by imagining adopting various mental frameworks and noticing that a particular framework seems like it would be useful over a wide variety of circumstances. However, we've just invoked counterfactuals twice: a) by imagining adopting different mental frameworks b) by imagining different circumstances over which to evaluate these frameworks[1].
  4. In either case, we seem to be unable to characterise counterfactuals without depending on already having the concept of counterfactuals. Or at least, I find this argument persuasive.

Why do I believe this is important?

I've argued for the importance of agent meta-foundations before. Roughly, there seems to be a lot of confusion about what counterfactuals are and how to construct them. I believe that much of this confusion would be cleared up if we can sort out some of these foundational issues. And the claim that counterfactuals can only be understood from an interior perspective is one such issue.

Why am I posting this bounty?

I believe in this idea, but:

  1. I haven't been able to dedicate nearly as much to time exploring this as I would like in between all of my other commitments
  2. Working on this approach just by myself is kind of lonely and extremely challenging (for example, it's hard to get good quality feedback)
  3. I suspect that more people would be persuaded that this was a fruitful approach if this principle was presented to them in a different light.

How do I submit my entry?

Make a post on LW or the Alignment forum, then add a link in the comments below. I guess I'm also open to private submissions. Ideally, you should mention that you're submitting your post for the bounty just to make sure that I'm aware of it.

When do I need to submit by?

I'm currently planning to set the submission window to 3 months from the date of this post (that would be the 1st of April, but let's make it April 2nd so people don't think this competition is some kind of prank). Submissions after this date may be refused.

How will this be judged?

I've written on this topic myself, so this probably biases me in some ways, but $1000 is a small enough amount of money that it's probably not worthwhile looking for external judges.

Some Background Info

I guess I started to believe that counterfactuals were circular when I started to ask questions like, "What actually are these things we call counterfactuals?". I noticed that they didn't seem to exist in a literal sense, but that we also seem to be unable to do without them.

Some people have asked why the Bayesian Network approach suggested by Judea Pearl is insufficient (including in the comments below). This approach is firmly rooted in Causal Decision Theory (CDT). Most people on LW have rejected CDT because of its failure to handle Newcomb's Problem.

MIRI has proposed Functional Decision Theory (FDT) as an alternative, but this theory is dependent on logical counterfactuals and they haven't figured out exactly how to construct these. While I don't exactly agree with the logical counterfactual framing, I agree that these kinds of exotic decision theory problems require us to create a new notion of counterfactuals. And this naturally leads to questions about what counterfactuals really are which I see as further leading to the conclusion that they are circular.

I can see why many people are sufficiently skeptical of the notion of counterfactuals being circular that they dismiss it out of hand. It's entirely possible that I could be mistaken about this thesis, but for these people, I'd suggest reading Eliezer's post Where Recursive Justification Hits Bottom which argues for a circular epistemology since if you are persuaded by this post, counterfactuals being circular may then be less of a jump.

Fine Print

I'll award the prize assuming that there's at least one semi-decent submission (according to the standards of posts on Less Wrong). If this isn't the case, then I'll donate the money to an AI Safety organization instead. I'd be open to having this money be held in escrow.

I'm intending to award the prize to the top entry, but there's a chance that I split it if I can't make a decision.

  1. ^

    Counterpoint: requiring counterfactuals to justify their own use isn't the same as counterfactuals only making sense from within themselves. Response: It's possible to engage in the appropriate symbol manipulation without a concept of counterfactuals, but we can't have a semantic understanding of what we're doing. We can't even describe this process without being to say things like "if given string of symbols s, do y". Similarly, counterfactuals aren't just justified by imagining the consequences of applying different mental over different circumstances, in this case, they are a system for performing well over a variety of circumstances.

New Comment
41 comments, sorted by Click to highlight new comments since: Today at 11:19 AM

I previously wrote a post about reconciling free will with determinism. The metaphysics implicit in Pearlian causality is free will (In Drescher's words: "Pearl's formalism models free will rather than mechanical choice."). The challenge is reconciling this metaphysics with the belief that one is physically embodied. That is what the post attempts to do; these perspectives aren't inherently irreconcilable, we just have to be really careful about e.g. distinguishing "my action" vs "the action of the computer embodying me" in a the Bayes net and distinguishing the interventions on them.

I wrote another post about two alternatives to logical counterfactuals: one says counterfactuals don't exist, one says that your choice of policy should affect your anticipation of your own source code. (I notice you already commented on this post, just noting it for completeness)

And a third post, similar to the first, reconciling free will with determinism using linear logic.

I'm interested in what you think of these posts and what feels unclear/unresolved, I might write a new explanation of the theoretical perspective or improve/extend/modify it in response.

You've linked me to three different posts, so I'll address them in separate comments. 

Two Alternatives to Logical Counterfactuals

I actually really liked this post - enough that I changed my original upvote to a strong upvote. I also disagree with the notion that logical counterfactuals make sense when taken literally so I really appreciated you making this point persuasively. I agreed with your criticisms of the material condition approach and I think policy-dependent source code could be potentially promising. I guess this naturally leads to the question of how to justify this approach. This results in questions like, "What exactly is a counterfactual?" and "Why exactly do we want such a notion?" and I believe that following this path leads to the discovery that counterfactuals are circular.

I'm more open to saying that I adopt Counterfactual Non-Realism than I was when I originally commented although I don't see theories based on material conditionals as the only approach within this category. I guess I'm also more enthusiastic about thinking in terms of policies rather than action mainly because of the lesson I drew from the Counterfactual Prisoner's Dilemma. I don't really know why I didn't make this connection at the time, since I had written that post a few months prior, but I appear to have missed this.

I still feel that introducing the term "free will" is too loaded to be helpful here, regardless of whether you are or aren't using it in a non-standard fashion. Like I'd encourage you to structure your posts to try to separate:

a) This is how we handle counterfactuals
b) This is the implications of this for the free will debate

A large part of this is because I suspect many people on Less Wrong are simply allergic to this term.

Thoughts on Modeling Naturalized Logic Decision Theory Problems in Linear Logic

I hadn't heard of linear logic before - it seems like a cool formalisation - although I tend to believe that formalisations are overrated as unless they are used very carefully they can obscure more than they reveal.

I believe that spurious counterfactuals are only an issue with the 5 and 10 problem because of an attempt to hack logical-if to substitute for counterfactual-if in such a way that we can reuse proof-based systems. It's extremely cool that we can do as much as we can working in that fashion, but there's no reason why we should be surprised that it runs into limits.

So I don't see inventing alternative formalisations that avoid the 5 and 10 problem as particularly hard as the bug is really quite specific to systems that try to utilise this kind of hack. I'd expect that almost any other system in design space will avoid this. So if, as I claim, attempts at formalisation will avoid this issue by default, the fact that any one formalisation avoids this problem shouldn't give us too much confidence in it being a good system for representing counterfactuals in general.

Instead, I think it's much more persuasive to ground any proposed system with philosophical arguments (such as your first post was focusing on), rather than mostly just posting a system and observing it has a few nice properties. I mean, your approach in this article certainly a valuable thing to do, but I don't see it as getting all the way to the heart of the issue.

One way is by asserting that the logic is about the territory, while the proof system is about the map; so, counterfactuals are represented in the map, even though the map itself asserts that there is only a singular territory.

Interestingly enough, this mirrors my position in Why 1-boxing doesn't imply backwards causation where I distinguish between Raw Reality (the territory) and Augmented Reality (the territory augmented by counterfactuals). I guess I put more emphasis on delving into the philosophical reasons for such a view and I think that's what this post is a bit short on.

 

Thanks for reading all the posts!

I'm not sure where you got the idea that this was to solve the spurious counterfactuals problem, that was in the appendix because I anticipated that a MIRI-adjacent person would want to know how it solves that problem.

The core problem it's solving is that it's a well-defined mathematical framework in which (a) there are, in some sense, choices, and (b) it is believed that these choices correspond to the results of a particular Turing machine. It goes back to the free will vs determinism paradox, and shows that there's a formalism that has some properties of "free will" and some properties of "determinism".

A way that EDT fails to solve 5 and 10 is that it could believe with 100% certainty that it takes $5 so its expected value for $10 is undefined. (I wrote previously about a modification of EDT to avoid this problem.)

CDT solves it by constructing physically impossible counterfactuals which has other problems, e.g. suppose there's a Laplace's demon that searches for violations of physics and destroys the universe if physics is violated; this theoretically shouldn't make a difference but it messes up the CDT counterfactuals.

It does look like your post overall agrees with the view I presented. I would tend to call augmented reality "metaphysics" in that it is a piece of ontology that goes beyond physics. I wrote about metaphysical free will a while ago and didn't post it on LW because I anticipated people would be allergic to the non-physicalist philosophical language.

I'm not sure where you got the idea that this was to solve the spurious counterfactuals problem, that was in the appendix because I anticipated that a MIRI-adjacent person would want to know how it solves that problem.


Thanks for that clarification.

A way that EDT fails to solve 5 and 10 is that it could believe with 100% certainty that it takes $5 so its expected value for $10 is undefined

I suppose that demonstrates that the 5 and 10 problem is a broader problem than I realised. I still think that it's only a hard problem within particular systems that have a vulnerability to it.

It does look like your post overall agrees with the view I presented. I would tend to call augmented reality "metaphysics" in that it is a piece of ontology that goes beyond physics

Yeah, we have significant agreement, but I'm more conservative in my interpretations. I guess this is a result of me being, at least in my opinion, more skeptical of language. Like I'm very conscious of arguments where someone says, "X could be described by phrase Y" and then later they rely on connations of Y that weren't proven. 


For example, you write, "From the AI’s perspective, it has a choice among multiple actions, hence in a sense “believing in metaphysical free will”. I would suggest it would be more accurate to write: "The AI models the situation as though it had free will" which leaves open the possibility that it is might be just a pragmatic model, rather than the AI necessarily endorsing itself as possessing free will. 

Another way of framing this: there's an additional step in between observing that an agent acts or models a situation as it believes in freewill and concluding that it actually believes in freewill. For example, I might round all numbers in a calculation to integers in order to make it easier for me, but that doesn't mean that I believe that the values are integers.

Comments on A critical agential account of free will, causation, and physics

Consider the statement: "I will take action A". An agent believing this statement may falsify it by taking any action B not equal to A. Therefore, this statement does not hold as a law. It may be falsified at will.

We can imagine a situation where there is a box containing an apple or a pear. Suppose we believe that it contains a pear, but we believe it contains an apple. If we look in the box (and we have good reason to believe looking doesn't change the contents), then we'll falsfy our pear hypothesis. Similarly, if we're told by an oracle that if we looked we would see a pear, then there'd be no need for us to actually look, we'd have heard enough to falsify our pear hypothesis.

However, the situation you've identified isn't the same. Here you aren't just deciding whether to make an observation or not, but what the value of that observation would be. So in this case, the fact that if you took action B you'd observe the action you took was B doesn't say anything about the case where you don't take action B, unlike knowing that if you looked in the box you'd see you an apple provides you information even if you don't look in the box. It simply isn't relevant unless you actually take B.

Interestingly, falsificationism takes agency (in terms of observations, computation, and action) as more basic than physics. For a thing to be falsifiable, it must be able to be falsified by some agent, seeing some observation. And the word able implies freedom.

I think it's reasonable to suggest starting from falsification as our most basic assumption. I guess where you lose me is when you claim that this implies agency. I guess my position is as follows:

  • It seems like agents in a deterministic universe can falsify theories in at least some sense. Like they take two different weights drop them and see they land at the same time falsifying the fact that heavier objects fall faster
  • On the other hand, some like agency or counterfactuals seems necessary for talking about falsfiability in the abstract as this involves saying that we could falsify a theory if we ran an experiment that we didn't.

In the second case, I would suggest that what we need is counterfactuals not agency. That is, we need to be able to say things like, "If I ran this experiment and obtained this result, then theory X would be falsified", not "I could have run this experiment and if I did and we obtained this result, then theory X would be falsified".

In other words, I think that there is something behind the intuition which I'm guessing led you to these views, but am in favour of developing it in a different direction than you.

I didn't read past this point, not because I thought it was uninteresting, but because it already took me a while to figure out how to articulate my objections to the article up to this point and I still have to look at one of your posts. But let me know if there's anything further down more directly related to whether counterfactuals are circular.

It seems like agents in a deterministic universe can falsify theories in at least some sense. Like they take two different weights drop them and see they land at the same time falsifying the fact that heavier objects fall faster

The main problem is that it isn't meaningful for their theories to make counterfactual predictions about a single situation; they can create multiple situations (across time and space) and assume symmetry and get falsification that way, but it requires extra assumptions. Basically you can't say different theories really disagree unless there's some possible world / counterfactual / whatever in which they disagree; finding a "crux" experiment between two theories (e.g. if one theory says all swans are white and another says there are black swans in a specific lake, the cruxy experiment looks in that lake) involves making choices to optimize disagreement.

In the second case, I would suggest that what we need is counterfactuals not agency. That is, we need to be able to say things like, "If I ran this experiment and obtained this result, then theory X would be falsified", not "I could have run this experiment and if I did and we obtained this result, then theory X would be falsified".

Those seem pretty much equivalent? Maybe by agency you mean utility function optimization, which I didn't mean to imply was required.

The part I thought was relevant was the part where you can believe yourself to have multiple options and yet be implemented by a specific computer.

Basically you can't say different theories really disagree unless there's some possible world / counterfactual / whatever in which they disagree;


Agreed, this is yet another argument for considering counterfactuals to be so fundamental that they don't make sense outside of themselves. I just don't see this as incompatible with determinism, b/c I'm grounding using counterfactuals rather than agency.

Those seem pretty much equivalent? Maybe by agency you mean utility function optimization, which I didn't mean to imply was required.

I don't mean utility function optimization, so let me clarify what as I see as the distinction. I guess I see my version as compatible with the determinist claim that you couldn't have run the experiment because the path of the universe was always determined from the start. I'm referring to a purely hypothetical running with no reference to whether you could or couldn't have actually run it.

Hopefully, my comments here have made it clear where we diverge and this provides a target if you want to make a submission (that said, the contest is about the potential circular dependency of counterfactuals and not just my views. So it's perfectly valid for people to focus on other arguments for this hypothesis, rather than my specific arguments).

How much are you interested in a positive vs normative theory of counterfactuals? For example, do you feel like you understand how humans do counterfactual reasoning, and how and why it works for them (insofar as it works for them)? If not, is such an understanding what you're looking for? Or do you think humans are not perfect at counterfactual reasoning (e.g. maybe because people disagree with each other about Newcomb's problem etc.) and there's some deep notion of "correct counterfactual reasoning" that humans are merely approximating, and the deeper "correct" thing is what you really care about?

(For my part I'm somewhat skeptical that there is a notion of counterfactuals that is fundamentally different from and better than what humans do.)

Update: I should further clarify that even though I provided a rough indication of how important I consider various approaches, this is off-the-cuff and I could be persuaded an approach was more valuable than I think, particularly if I saw good quality work.

I guess my ultimate interest is normative as the whole point of investigating this area is to figure out what we should do.

However, I am interested in descriptive theories insofar as they can contribute to this investigation (and not insofar as the details aren't useful for normative theories). For example, when I say that counterfactuals only make sense from within the counterfactual perspective and further that counterfactuals are ultimately grounded as an evolutionary adaption I'm making descriptive statements. The latter seems to be more of a positive statement, while the former doesn't seem to be (it seems to be justified by philosophical reasoning more than empirical investigation). In any case, it feels like there is more work to be done in taking these high-level abstract statements and making them more precise.

For example, do you feel like you understand how humans do counterfactual reasoning, and how and why it works for them (insofar as it works for them)?

I think that further investigation here could be useful - although not in the sense that 40% use this style of reasoning and 60% use this style - exact percentages aren't the relevant things here - at least not at this early stage. I'd also lean towards saying that how experts operate is more important than average humans and that the behavior of especially stupid humans is probably of limited importance.

I guess I see the behaviour of normal humans mattering for two reasons:

a) Firstly because I see making use of counterfactuals as evolutionarily grounded (in a more primitive form than the highly cognitive and mathematically influenced versions that we tend to use on LW)

b) Secondly because the experts are more likely to discard intuitions that don't agree with their theories. And I think we need to use our reasoning to produce a consistent theory from our intuitions at some point, but this may be less than ideal if we're simply trying to collect various intuitions as raw data to later turn into a theory.

I should clarify: in the above discussion, I'm commenting on what I'm interested in, rather than what's in scope. The scope of the prize is the proposition that counterfactuals only make sense within themselves. And I guess part of what I was trying to clarify above is that empirical investigation can be relevant when carefully chosen. Happy to provide additional clarification if you were planning to submit a post covering something specific.

There's some deep notion of "correct counterfactual reasoning" that humans are merely approximating, and the deeper "correct" thing is what you really care about?

I guess my position on this is complex as I believe that counterfactuals only make sense in terms of themselves. So I don't think there is a "true" notion of counterfatuals that exists within the ontology, rather I see them as a heuristic ultimately grounded by evolution. That said, our instinct to systematise and use logic to make things more coherent is also grounded in evolution.

(For my part I'm somewhat skeptical that there is a notion of counterfactuals that is fundamentally different from and better than what humans do.)

People often hold vastly different perspectives on what counts as "fundamentally different" from something else. That said, I believe we should one-box on Newcomb's problem (do you?) and I guess that seems fundamentally different from how humans who are trained on traditional decision theory/classical physics think. On the other hand, it may not be fundamentally different from how more untutored and instinctual individuals woudl behave. I guess I'd be curious where you stand here. 
 

I think brains build a generative world-model, and that world-model is a certain kind of data structure, and "counterfactual reasoning" is a class of operations that can be performed on that data structure. (See here.) I think that counterfactual reasoning relates to reality only insofar as the world-model relates to reality. (In map-territory terminology: I think counterfactual reasoning is a set of things that you can do with the map, and those things are related to the territory only insofar as the map is related to the territory.)

I also think that there are lots of specific operations that are all "counterfactual reasoning" (just as there are lots of specific operations that are all "paying attention"—paying attention to what?), and once we do a counterfactual reasoning operation, there are also a lot of things that we can do with the result of the operation. I think that, over our lifetimes, we learn metacognitive heuristics that guide these decisions (i.e. exactly what "counterfactual reasoning"-type operations to do and when, and what to do with the result of the operation), and some people's learned metacognitive heuristics are better than others (from the perspective of achieving such-and-such goal).

Analogy: If you show me a particular trained ConvNet that misclassifies a particular dog picture as a cat, I wouldn't say that this reveals some deep truth about the nature of image classification, and I wouldn't conclude that there is necessarily such a thing as a philosophically-better type of image classifier that fundamentally doesn't ever make mistakes like that. (The brain image classifier makes mistakes too, albeit different mistakes than ConvNets make, but that's besides the point.) Instead I would be more inclined to look for a very complicated explanation of the mistake, related to details of its training data and so on.

By the same token: if someone makes a poor decision on Newcomb's problem, I don't think that reveals some deep truth about the nature of counterfactual reasoning, and I wouldn't conclude that there is necessarily such a thing as a philosophically-better type of counterfactual reasoning that fundamentally doesn't ever make mistakes like that. Instead I would be more inclined to look for a very complicated explanation of the mistake, related to the person's life history, exactly how Newcomb's problem was explained to them, exactly what their learned world-model looks like, etc.

And if I wanted to build an AGI that performed well on Newcomb's problem, I would build the AGI first, and then have the AGI read Eliezer's essays or whatever, same as if I wanted my (human) friend to perform well on Newcomb's problem. :-)

I also think that there are lots of specific operations that are all "counterfactual reasoning"


Agreed. This is definitely something that I would like further clarity on

Instead I would be more inclined to look for a very complicated explanation of the mistake, related to details of its training data and so on.

I guess the real-world reasons for a mistake are sometimes not very philosophically insightful (ie. Bob was high when reading the post, James comes from a Spanish speaking background and they use their equivalent of a word differently than English-speakers, Sarah has a terrible memory and misremembered it)

I'm guessing like your position might be that there are just mistakes and there aren't mistakes that are more philosophically fruitful or less fruitful? There's just mistakes. Is that correct? Or were you just responding to my specific claim that it might be useful to know how the average person responds to problems because we are evolved creatures? If so, then I definitely agree that we'd have to delve into the details and not just remain on the level of averages.

Update: Actually, I'll add an analogy that might be helpful. Let's suppose you didn't know what a dog was. Actually, that's kind of the case: once you start diving into any definition you end up running into fuzzy cases, such as does a robotic dog count as a dog? Then if humans had built a bunch of different classifiers and you didn't have access to the humans (say they went extinct) then you might want to analyse the different classifiers to try to figure out how humans defined the term dog, even though much of the behaviour might only tell you how the flaws tend to produce rather than about the human concept

Similarly, we don't have exact access to our evolutionary history, but examining human intuitions about counterfactuals might provide insights about which heuristics have worked well, whilst also recognising that it's hard, arguably impossible, to even talk about "working well" without embracing the notion of counterfactuals. And I agree that there are probably different ways we could emphasis various heuristics rather than a unique, principled solution.

I'm not claiming the situation is precisely this - in fact I'm not sure exactly how useful this analogy is - but I think it's worth sharing anyway in case it lands.

I also think that there are lots of specific operations that are all "counterfactual reasoning"

Agreed. This is definitely something that I would like further clarity on

Hmm, my hunch is that you're misunderstanding me here. There are a lot of specific operations that are all "making a fist". I can clench my fingers quickly or slowly, strongly or weakly, left hand or right hand, etc. By the same token, if I say to you "imagine a rainbow-colored tree; are its leaves green?", there are a lot of different specific mental models that you might be invoking. (It could have horizontal rainbow stripes on the trunk, or it could have vertical rainbow stripes on its branches, etc.) All those different possibilities involve constructing a counterfactual mental model and querying it, in the same nuts-and-bolts way. I just meant, there are many possible counterfactual mental models that one can construct.

I'm guessing like your position might be that there are just mistakes and there aren't mistakes that are more philosophically fruitful or less fruitful? There's just mistakes. Is that correct?

Suppose I ask "There's a rainbow-colored tree somewhere in the world; are its leaves green?" You think for a second. What's happening under the surface when you think about this? Inside your head are various different models pushing in different directions. Maybe there's a model that says something like "rainbow-colored things tend to be rainbow-colored in all respects". So maybe you're visualizing a rainbow-colored tree, and querying the color of the leaves in that model, and this model is pushing on your visualized tree and trying to make it have a color scheme that's compatible with the kinds of things you usually see, e.g. in cartoons, which would be rainbow-colored leaves. But there's also a botany model that says "tree leaves tend to be green, because that's the most effective for photosynthesis, although there are some exceptions like Japanese maples and autumn colors". In scientifically-educated people, probably there will also be some metacognitive knowledge that principles of biology and photosynthesis are profound deep regularities in the world that are very likely to generalize , whereas color-scheme knowledge comes from cartoons etc. and is less likely to generalize.

So what's at play is not "the nature of counterfactuals", but the relative strengths of these three specific mental models (and many more besides) that are pushing in different directions. The way it shakes out will depend on the particular person and their life experience (and in particular, how much of a track-record of successful predictions these models have built up in similar contexts).

By the same token, I think every neurotypical human thinking about Newcomb's problem is using counterfactual reasoning, and I think that there isn't any interesting difference in the general nature of the counterfactual reasoning that they're using. But the mental model of free will is different in different people, and the mental model of Omega is different in different people, etc.

Hmm, maybe we're talking past each other a bit because of the learning-algorithm-vs-trained-model division. Understanding the learning algorithm is like being able to read and understand the the source code for a particular ML paper (and the PyTorch source code that it calls in turn). Understanding the trained model is like OpenAI microscope.

(It's really "learning algorithm & inference algorithm"—the first changes the parameters, the second chooses what to do right now. I'm just calling it "learning algorithm" for short.)

I usually take the perspective that "the main event" is to understand the learning algorithm, because that's what you need to build AGI, and that's what the genome needs to build humans (thanks to within-lifetime learning), whereas understanding the trained model is "a sideshow", unnecessary for building AGI, but still worth talking about for safety and whatnot.

On the "learning algorithm" side, I put "the basic capability to do counterfactual reasoning operations". On the "trained model" side, I put all the learned heuristics about how reliable counterfactual reasoning is under what circumstance, and also all the learned concepts that go into a particular "counterfactual reasoning" operation (e.g. botany concepts, free will concepts, etc.)

Then when I brashly declare "I basically understand counterfactual reasoning", I'm just talking about the stuff on the "learning algorithm" side. Whereas it seems that you feel like your project is to understand stuff on both sides—not only what a "counterfactual reasoning" operation is at a nuts-and-bolts level, but also all the other things that go into Newcomb's problem, like whether there's a "free will" concept in the world-model and what other concepts it's connected to and how strongly (all of which can impact the results of a "counterfactual reasoning" operation). Then that research program seems to me to be more about normative decision theory and epistemology (e.g. "what to do in Newcomb's problem"), rather than about the nature of counterfactual reasoning per se. Or I guess perhaps what you're going for is closer to "practical advice that helps adult humans use counterfactual reasoning to reach correct conclusions"? In that case I'd be a bit surprised if there was much generically useful advice like that; I would expect that the main useful thing is object-level stuff like teaching better intuitions about the nature of free will etc.

I just meant, there are many possible counterfactual mental models that one can construct.

I agree that there isn't a single uniquely correct notion of a counterfactual. I'd say that we want different things from this notion and there are different ways to handle the trade-offs.

By the same token, I think every neurotypical human thinking about Newcomb's problem is using counterfactual reasoning, and I think that there isn't any interesting difference in the general nature of the counterfactual reasoning that they're using.

I find this confusing as CDT counterfactuals where you can only project forward seem very different from things like FDT where you can project back in time as well.

I usually take the perspective that "the main event" is to understand the learning algorithm, because that's what you need to build AGI, and that's what the genome needs to build humans

Well, we need the information encoded in our DNA rather than than what is actually implemented in humans (clarification: what is implemented in humans is significantly influenced by society) though we aren't at the level where we can access that by analysing the DNA directly or people's brain structure for that matter, so we have to reverse engineer it from behaviour

Or I guess perhaps what you're going for is closer to "practical advice that helps adult humans use counterfactual reasoning to reach correct conclusions"?

I've very much focused on trying to understand how to solve these problems in theory rather than how can we correct any cognitive flaws in humans or on how to adapt decision theory to be easier or more convenient to use.

In so far as I'm interested in how average humans reason counterfactually, it's mostly about trying to understand the various heuristics that are the basis of counterfactuals. I guess I believe that we need counterfactuals to understand and evaluate these heuristics, but I guess I'm hoping that we can construct something reflexively consistent.

By the same token, I think every neurotypical human thinking about Newcomb's problem is using counterfactual reasoning, and I think that there isn't any interesting difference in the general nature of the counterfactual reasoning that they're using.

I find this confusing as CDT counterfactuals where you can only project forward seem very different from things like FDT where you can project back in time as well.

I think there is "machinery that underlies counterfactual reasoning" (which incidentally happens to be the same as "the machinery that underlies imagination"). My quote above was saying that every human deploys this machinery when you ask them a question about pretty much any topic.

I was initially assuming (by default) that if you're trying to understand counterfactuals, you're mainly trying to understand how this machinery works. But I'm increasingly confident that I was wrong, and that's not in fact what you're interested in. Instead it seems that your interests are more like "how would an AI, equipped with this kind of machinery, reach correct conclusions about the world?" (After all, the machinery by itself can lead to both correct and incorrect conclusions—just as "thinking / reasoning in general" can lead to correct or incorrect conclusions.)

Given what (I think) you're trying to do above, I'm somewhat skeptical that you'll make progress by thinking about the philosophical nature of counterfactuals in general. I don't think there's a clean separation between "good counterfactual reasoning" and "good reasoning in general". If I say some counterfactual nonsense like "If the Earth were a flat disk, then the north pole would be in the center," I think the reason it's nonsense lives at the object-level, i.e. the detailed content of the thought in the context of everything else we know about the world. I don't think the problem with that nonsense thought can be diagnosed at the meta-level, i.e. by examining structural properties of its construction as a counterfactual or whatever.

So by the same token, I think that "what counterfactuals make sense in the context of decision-making" is a decision theory question, not a counterfactuals question, and I expect a good answer to look like explicit discussions of decision theory as opposed to looking like a more general discussion of the philosophical nature of counterfactuals. (That said, the conclusion of that decision theory discussion could certainly look like a prescription on the content of counterfactual reasoning in a certain context, e.g. maybe the decision theory discussion concludes with "...Therefore, when making decisions, use FDT-type counterfactuals" or whatever.)

I think there is "machinery that underlies counterfactual reasoning"

I agree that counterfactual reasoning is contingent on certain brain structures, but I would say the same about logic as well and it's clear that the logic of a kindergartener is very different from that of a logic professor - although perhaps we're getting into a semantic debate - and what you mean is that the fundamental machinery is more or less the same.

I was initially assuming (by default) that if you're trying to understand counterfactuals, you're mainly trying to understand how this machinery works. But I'm increasingly confident that I was wrong, and that's not in fact what you're interested in. Instead it seems that your interests are more like "how would an AI, equipped with this kind of machinery, reach correct conclusions about the world?"

Yeah, this seems accurate. I see understanding the machinery as the first step towards the goal of learning to counterfactually reason well. As an analogy, suppose you're trying to learn how to reason well. It might make sense to figure out how humans reason, but if you want to build a better reasoning machine and not just duplicate human performance, you'd want to be able to identify some of these processes as good reasoning and some as biases.

I don't think there's a clean separation between "good counterfactual reasoning" and "good reasoning in general"

I guess I don't see why there would need to be a separation in order for the research direction I've suggested to be insightful. In fact, if there isn't a separation, this direction could even be more fruitful as it could lead to rather general results.

If I say some counterfactual nonsense like "If the Earth were a flat disk, then the north pole would be in the center," I think the reason it's nonsense lives at the object-level, i.e. the detailed content of the thought in the context of everything else we know about the world

I would say (as a slight simplification) that our goal in studying counterfactual reasoning should be to get counterfactuals to a point where we can answer questions about them using our normal reasoning.

I think that "what counterfactuals make sense in the context of decision-making" is a decision theory question, not a counterfactuals question, and I expect a good answer to look like explicit discussions of decision theory as opposed to looking like a more general discussion of the philosophical nature of counterfactuals

That post certainly seems to contain an awful lot of philosophy to me. And I guess even though this post and my post On the Nature of Counterfactuals don't make any reference to decision theory, that doesn't mean that it isn't in the background influencing what I write. I've written a lot of posts here, many of which discuss specific decision theory questions.

I guess I would still consider Joe Carlsmith's post a high-quality post if it had focused exclusively on the more philosophical aspects. And I guess philosophical arguments are harder to evaluate than mathematical ones and it can be disconcerting for some people, especially those used to the certainty of mathematics, but I believe it's possible to get to the level where you can avoid formalisation things a lot of the time because you have enough experience to know how things will shake out.

Although I suppose in this case my reason for avoiding formalisation is that I see premature formalisation as a critical error. Once someone has produced a formal theory they will feel psychologically compelled to defend it, especially if it mathematically beautiful, so I believe it's important to be very careful about making sure the assumptions are right before attempting to formalise anything.

Counterfactuals (in the potential outcome sense used in statistics) and Pearl's structural equation causality semantics are equivalent.

What are your thoughts on Newcomb's, ect?

I gave a talk at FHI ages ago on how to use causal graphs to solve Newcomb type problems.  It wasn't even an original idea: Spohn had something similar in 2012.

I don't think any of this stuff is interesting, or relevant for AI safety.  There's a pretty big literature on model robustness and algorithmic fairness that uses causal ideas.

If you want to worry about the end of the world, we have climate change, pandemics, and the rise of fascism.

Why did you give a talk on causal graphs if you didn't think this kind of work was interesting or relevant? Maybe I'm misunderstanding what you're saying isn't interesting or relevant.

I mostly agree with Zack_M_Davis that this is a solved problem, although rather than talking about a formalization of causality I'd say this is a special case of epistemic circularity and thus an instance of the problem of the criterion. There's nothing unusual going on with counterfactuals other than that people sometimes get confused about what propositions are (e.g. they believe propositions have some sort of absolute truth beyond causality because they fail to realize epistemology is grounded in purpose rather than something eternal and external to the physical world) and then go on to get mixed up into thinking that something special must be going on with counterfactuals due to their confusion about propositions in general.

I don't know if I'll personally get around to explaining this in more detail, but I think this is low hanging fruit since it falls out so readily from understanding the contingency of epistemology caused by the problem of the criterion.

Which part are you claiming is a solved problem? Is it:

a) That counterfactuals can only be understood within the counterfactual perspective OR
b) The implications of this for decision theory OR
c) Both

I think A is solved, though I wouldn't exactly phrase it like that, more like counterfactuals make sense because they are what they are and knowledge works the way it does.

Zack seems to be making a claim to B, but I'm not expert enough in decision theory to say much about it.

Sorry, when you say A is solved, you're claiming that the circularity is known to be true, right?

Zack seems to be claiming that Bayesian Networks both draw out the implications and show that the circularity is false.

So unless I'm misunderstanding you, your answer seems to be at odds with Zack.

I don't think they're really at odds. Zack's analysis cuts off at a point where the circularity exists below it. There's still the standard epistemic circularity that exists whenever you try to ground out any proposition, counterfactual or not, but there's a level of abstraction where you can remove the seeming circularity by shoving it lower or deeper into the reduction of the proposition towards grounding out in some experience.

Another way to put this is that we can choose what to be pragmatic about. Zack's analysis choosing to be pragmatic about counterfactuals at the level of making decisions, and this allows removing the circularity up to the purpose of making a decision. If we want to be pragmatic about, say, accurately predicting what we will observe about the world, then there's still some weird circularity in counterfactuals to be addressed if we try to ask questions like "why these counterfactuals rather than others?" or "why can we formulate counterfactuals at all?".

Also I guess I should be clear that there's no circularity outside the map. Circularity is entirely a feature of our models of reality rather than reality itself. That's way, for example, the analysis on epistemic circularity I offer is that we can ground things out in purpose and thus the circularity was actually an illusion of trying to ground truth in itself rather than experience.

I'm not sure I've made this point very clearly elsewhere before, so sorry if that's a bit confusing. The point is that circularity is a feature of the relative rather than the absolute, so circularity exists in the map but not the territory. We only get circularity by introducing abstractions that can allow things in the map to depend on each other rather than the territory.

I'd say this is a special case of epistemic circularity


I wouldn't be surprised if other concepts such as probability were circular in the same way as counterfactuals, although I feel that this is more than just a special case of epistemic circularity. Like I agree that we can only reason starting from where we are - rather than from the view from nowhere - but counterfactuals feel different because they are such a fundamental concept that appears everywhere. As an example, our understanding of chairs doesn't seem circular in quite the same sense. That said, I'd love to see someone explore this line of thought.

Zack's analysis cuts off at a point where the circularity exists below it

I could be wrong, but I suspect Zack would disagree with the notion that there is a circularity below it involving counterfactuals. I wouldn't be surprised though if Zack acknowledge a circularity not involving counterfactuals.

Also I guess I should be clear that there's no circularity outside the map. Circularity is entirely a feature of our models of reality rather than reality itself

Agreed. That said, I don't think counterfactuals are in the territory. I think I said before that they were in the map, although I'm now leaning away from that characterisation as I feel that they are more of a fundamental category that we use to draw the map.

Agreed. That said, I don't think counterfactuals are in the territory. I think I said before that they were in the map, although I'm now leaning away from that characterisation as I feel that they are more of a fundamental category that we use to draw the map.

Yes, I think there is something interesting going on where human brains seem to operate in a way that makes counterfactuals natural. I actually don't think there's anything special about counterfactuals, though, just that the human brain is designed such that thoughts are not strongly tethered to sensory input vs. "memory" (internally generated experience), but that's perhaps only subtly different than saying counterfactuals rather than something powering them is a fundamental feature of how our minds work.

Some people have asked why the Bayesian Network approach suggested by Judea Pearl is insufficient (including in the comments below). This approach is firmly rooted in Causal Decision Theory (CDT). Most people on LW have rejected CDT because of its failure to handle Newcomb's Problem.

I'll make a counter-claim and say that most people on LW in fact have rejected the use of Newcomb's Problem as a test that will say something useful about decision theories.

That being said, there is definitely a sub-community which believes deeply in the relevance of Newcomb's Problem as a test. This sub-community has historically created, and is still creating, a lot of traffic on this forum. This is to be expected: the people who reject Newcomb's Problem do not tend to post about it that much.

Personally, I reject Newcomb's Problem as a test.

I am also among the crowd who have posted explanations of Pearl Causality and Counterfactuals. My explanation here highlights the 'using a different world model' interpretation of Pearl's counterfactual math, so it may in fact touch on your reframing:

Or reframing this, counterfactuals only make sense from a cognitive frame.

I guess I'd roughly describe [a cognitive frame] as something that forms models of the world.

Overall, reading the post and the comment section, I feel that, if I reject Newcomb's Problem as a test, I can only ever write things that will not meet your prize criterion of usefully engaging with 'circular dependency'.

I have a sense that with 'circular dependency' you are also pointing to a broader class of philosophical problems of 'what does it mean for something to be true or correctly inferred'. If these were spelled out in detail, I also believe that I would end up rejecting the notion that we need to solve all these open problems definitively, the notion that these problems represent gaps in an agent foundations framework that still need to be filled, if the framework is to support AGI safety/alignment.

Overall, reading the post and the comment section, I feel that, if I reject Newcomb's Problem as a test, I can only ever write things that will not meet your prize criterion of usefully engaging with 'circular dependency'.


Firstly, I don't see why that would interfere with evaluating possible arguments for and against circular dependency. It's possible for an article to be here's why these 3 reasons why we might think counterfactuals are circular are all false (not stating that an article would have to necessarily engage with 3 different arguments to win).

Secondly, I guess my issue with most of the attempts to say "use system X for counterfactuals" is that people seem to think merely not mentioning counterfactuals means that there isn't a dependence on them. So there likely needs to be some part of such an article discussing why things that look counterfactual really aren't.

I briefly skimmed your article and I'm sure if I read it further I'd learn something interesting, but merely as is it wouldn't be on scope.

It's possible for an article to be here's why these 3 reasons why we might think counterfactuals are circular are all false

OK, so if I understand you correctly, you posit that there is something called 'circular epistemology'. You said in the earlier post you link to at the top:

You might think that the circularity is a problem, but circular epistemology turns out to be viable (see Eliezer's Where Recursive Justification Hits Bottom). And while circular reasoning is less than ideal, if the comparative is eventually hitting a point where we can provide no justification at all, then circular justification might not seem so bad after all.

You further suspect that circular epistemology might have something useful to say about counterfactuals, in terms of offering a justification for them without 'hitting a point where we can provide no justification at all'. And you have a bounty for people writing more about this.

Am I understanding you correctly?

Yeah, I believe epistemology to be inherently circular. I think it has some relation to counterfactuals being circular, but I don't see it as quite the same as counterfactuals seem a lot harder to avoid using than most other concept. The point of mentioning circular epistemology was to persuade people that my theory isn't as absurd as it sounds at first.

Wait, I was under the impression from the quoted text that you make a distinction between 'circular epistemology' and 'other types of epistemology that will hit a point where we can provide no justification at all'. i.e. these other types are not circular because they are ultimately defined as a set of axioms, rewriting rules, and observational protocols for which no further justification is being attempted.

So I think I am still struggling to see what flavour of philosophical thought you want people to engage with, when you mention 'circular'.

Mind you, I see 'hitting a point where we provide no justification at all' as a positive thing in a mathematical system, a physical theory, or an entire epistemology, as long as these points are clearly identified.

Wait, I was under the impression from the quoted text that you make a distinction between 'circular epistemology' and 'other types of epistemology that will hit a point where we can provide no justification at all'. i.e. these other types are not circular because they are ultimately defined as a set of axioms, rewriting rules, and observational protocols for which no further justification is being attempted.

If you're referring to the Wittgenstenian quote, I was merely quoting him, not endorsing his views.

Not aware of which part would be a Wittgenstenian quote. Long time ago that I read Wittgenstein, and I read him in German. In any case, I remain confused on what you mean with 'circular'.

Hmm... Oh, I think that was elsewhere on this thread. Probably not to you. Eliezer's Where Recursive Justification Hits Bottom seems to embrace a circular epistemology despite its title.

Secondly, I guess my issue with most of the attempts to say "use system X for counterfactuals" is that people seem to think

??? I don't follow. You meant to write "use system X instead of using system Y which calls itself a definition of counterfactuals "?

What I mean is that some people seem to think that if they can describe a system that explains counterfactuals without mentioning counterfactuals when explaining them that they've avoided a circular dependence. When of course, we can't just take things at face value, but have to dig deeper than that.

OK thanks for explaining. See my other recent reply for more thoughts about this.

So, this post only deals with agent counterfactuals (not environmental counterfactuals), but I believe I have solved the technical issue you mention about the construction of logical counterfactuals as it concerns TDT. See: https://www.alignmentforum.org/posts/TnkDtTAqCGetvLsgr/a-possible-resolution-to-spurious-counterfactuals

I have fewer thoughts about environmental counterfactuals but think a similar approach could be used to make statements along those lines, i.e. construct alternate agents receiving a different observation about the world. I'm not sure any very specific technical problem exists with that, though -- the TDT paper already talks about world model surgery.

I added a comment on the post directly, but I will add: we seem to roughly agree on counterfactuals existing in the imagination in a broad sense (I highlighted two ways this can go above - with counterfactuals being an intrinsic part of how we interact with the world or a pragmatic response to navigating the world). However, I think that following this through and asking why we care about them if they're just in our imagination ends up taking us down a path where counterfactuals being circular seems plausible. On the other hand, you seem to think that this path takes us somewhere where there isn't any circularity. Anyway, that's the difference in our positions as far as I can tell from having just skimmed your link.

I was attempting to solve a relatively specific technical problem related to self-proofs using counterfactuals. So I suppose I do think (at least non-circular ones) are useful. But I'm not sure I'd commit to any broader philosophical statement about counterfactuals beyond "they can be used in a specific formal way to help functions prove statements about their own output in a way that avoid Lob's Theorem issues". That being said, that's a pretty good use, if that's the type of thing you want to do? It's also not totally clear if you're imagining counterfactuals the same way I am. I am using the English term because it matches the specific thing I'm describing decently well, but the term has a broad meaning, and without having an extremely specific imagining, it's hard to make any more statements about what can be done with them.