# All of Chris_Leong's Comments + Replies

Which counterfactuals should an AI follow?

I believe that we need to take a Conceputal Engineering approach here. That is, I don't see counterfactuals as intrinsically part of the world, but rather someone we construct. The question to answer is what purpose are we constructing these for? Once we've answer this question, we'll be 90% of the way towards constructing them.

As far as I can see, the answer is that we imagine a set of possible worlds and we notice that agents that use certain notions of counterfactuals tend to perform better than agents that don't. Of course, this raises the question of ... (read more)

The Counterfactual Prisoner's Dilemma

"The problem is that principle F elides" - Yeah, I was noting that principle F doesn't actually get us there and I'd have to assume a principle of independence as well. I'm still trying to think that through.

The Counterfactual Prisoner's Dilemma

Hmm... that's a fascinating argument. I've been having trouble figuring out how to respond to you, so I'm thinking that I need to make my argument more precise and then perhaps that'll help us understand the situation.

Let's start from the objection I've heard against Counterfactual Mugging. Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F a... (read more)

1Richard Ngo1moThe problem is that principle F elides over the difference between facts which are logically caused by your decision, and facts which aren't. For example, in Parfit's hitchhiker, my decision not to pay after being picked up logically causes me not to be picked up. The result of that decision would be a counterpossible world: a world in which the same decision algorithm outputs one thing at one point, and a different thing at another point. But in counterfactual mugging, if you choose not to pay, then this doesn't result in a counterpossible world. The whole point of functional decision theory is that it's very unlikely for these two policies to differ. For example, consider the Twin Prisoner's Dilemma, but where the walls of one room are green, and the walls of the other are blue. This shouldn't make any difference to the outcome: we should still expect both agents to cooperate, or both agents to defect. But the same is true for heads vs tails in Counterfactual Prisoner's Dilemma - they're specific details which distinguish you from your counterfactual self, but don't actually influence any decisions.
How do we prepare for final crunch time?

One of the biggest considerations would be the process for activating "crunch time". In what situations should crunch time be declared? Who decides? How far out would we want to activate and would there be different levels? Are there any downsides of such a process including unwanted attention?

If these aren't discussed in advance, then I imagine that far too much of the available time could be taken up by whether to activate crunch time protocols or not.

PS. I actually proposed here that we might be able to get a superintelligence to solve most of the probl... (read more)

The Counterfactual Prisoner's Dilemma

You're correct that paying in Counterfactual Prisoner's Dilemma doesn't necessarily commit you to paying in Counterfactual Mugging.

However, it does appear to provide a counter-example to the claim that we ought to adopt the principle of making decisions by only considering the branches of reality that are consistent with our knowledge as this would result in us refusing to pay in Counterfactual Prisoner's Dilemma regardless of the coin flip result.

(Interestingly enough, blackmail problems seem to also demonstrate that this principle is flawed as well).

1Richard Ngo1moI know that, in the branch of reality which actually happened, Omega predicted my counterfactual behaviour. I know that my current behaviour is heavily correlated with my counterfactual behaviour. So I know that I can logically cause Omega to give me 10,000. This seems exactly equivalent to Newcomb's problem, where I can also logically cause Omega to give me a lot of money. So if by "considering [other branches of reality]" you mean "taking predicted counterfactuals into account when reasoning about logical causation", then Counterfactual Prisoner's Dilemma doesn't give us anything new. If by "considering [other branches of reality]" you instead mean "acting to benefit my counterfactual self", then I deny that this is what is happening in CPD. You're acting to benefit your current self, via logical causation, just like in the Twin Prisoner's Dilemma. You don't need to care about your counterfactual self at all. So it's disanalogous to Counterfactual Mugging, where the only reason to pay is to help your counterfactual self. Why 1-boxing doesn't imply backwards causation I think the best way to explain this is to imagine characterise the two views as slightly different functions both of which return sets. Of course, the exact type representations isn't the point. Instead, the types are just there to illustrate the difference between two slightly different concepts. possible_world_pure() returns {x} where x is either <study & pass> or <beach & fail>, but we don't know which one it will be possible_world_augmented() returns {<study & pass>, <beach & fail>} Once we've defined possible worl... (read more) Why 1-boxing doesn't imply backwards causation Excellent question. Maybe I haven't framed this well enough. We need a way of talking about the fact that both your outcome and your action are fixed by the past. We also need a way of talking about the fact that we can augment the world with counterfactuals (Of course, since we don't have complete knowledge of the world, we typically won't know which is the factual and which are the counterfactuals). And that these are two distinct ways of looking at the world. I'll try to think about a cleaner way of framing this, but do you have any suggestions? (For the rec... (read more) 1Chris_Leong2moI think the best way to explain this is to imagine characterise the two views as slightly different functions both of which return sets. Of course, the exact type representations isn't the point. Instead, the types are just there to illustrate the difference between two slightly different concepts. possible_world_pure() returns {x} where x is either <study & pass> or <beach & fail>, but we don't know which one it will be possible_world_augmented() returns {<study & pass>, <beach & fail>} Once we've defined possible worlds, it naturally provides us a definition of possible actions and possible outcomes that matches what we expect. So for example: size(possible_world_pure()) = size(possible_action_pure()) = size(possible_outcome_pure()) = 1 size(possible_world_augmented()) = size(possible_action_augmented()) = size(possible_outcome_augmented()) = 2 And if we have a decide function that iterates over all the counterfactuals in the set and returns the highest one, we need to call it on possible_world_augmented() rather than possible_world_pure(). Note that they aren't always this similar. For example, for Transparent Newcomb they are: possible_world_pure() returns {<1-box, million>} possible_world_augmented() returns {<1-box, million>, <2-box, thousand>} The point is that if we remain conscious of the type differences then we can avoid certain errors. For example possible_outcome_pure() = {"PASS"}, doesn't mean that possible_outcome_augmented() = {"PASS"}. It's that later which would imply it doesn't matter what the student does, not the former. Alignment By Default Also, I have another strange idea that might increase the probability of this working. If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of "true human values"? I don't think it's likely to work, but thought I'd share anyway. Alignment By Default Thanks! Is this why you put the probability as "10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values"? Or have you updated your probabilities since writing this post? 3johnswentworth2moYup, this is basically where that probability came from. It still feels about right. Alignment By Default I guess the main issue that I have with this argument is that an AI system that is extremely good at prediction is unlikely to just have a high-level concept corresponding to human values (if it does contain such a concept). Instead, it's likely to also include a high-level concept corresponding to what people say about about values - or rather several corresponding to what various different groups would say about human-values. If your proxy is based on what people say, then these concepts which correspond to what people say will match much better - and the probability of at least one of these concepts being the best match is increased by large the number of these. So I don't put a very high weight on this scenario at all. 2johnswentworth2moThis is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up. Applying the Counterfactual Prisoner's Dilemma to Logical Uncertainty I'm curious, do you find this argument for paying in Logical Counterfactual Mugging persuasive? What about the Counterfactual Prisoner's Dilemma argument for the basic Counterfactual Mugging? Another approach is to change the example to remove the objection Interesting point about the poker game version. It's still a one shot game, so there's no real reason to hide a 0 unless you think they're a pretty powerful predictor, but it is always predicting something coherent. I don't see how you're applying CPD to LU The claim is that you should pay in the Logical Co... (read more) Decision Theory is multifaceted To be honest, this thread has gone on long enough that I think we should end it here. It seems to me that you are quite confused about this whole issue, though I guess from your perspective it seems like I am the one who is confused. I considered asking a third person to try looking at this thread, but I decided it wasn't worth calling in a favour. I made a slight edit to my description of Counterfactual Prisoner's Dilemma, but I don't think this will really help you understand: Omega, a perfect predictor, flips a coin and tell you how it came up. If if come ... (read more) 1Michele Campolo8moOk, if you want to clarify—I'd like to—we can have a call, or discuss in other ways. I'll contact you somewhere else. Decision Theory is multifaceted I am considering the second intuiton. Acting according to it results in you receiving0 in Counterfactual Prisoner's Dilemma, instead of losing $100. This is because if you act updatefully when it comes up heads, you have to also act updatefully when it comes up tails. If this still doesn't make sense, I'd encourage you to reread the post. 1Michele Campolo8moHere there is no question, so I assume it is something like: "What do you do?" or "What is your policy?" That formulation is analogous to standard counterfactual mugging, stated in this way: Omega flips a coin. If it comes up heads, Omega will give you 10000 in case you would pay 100 when tails. If it comes up tails, Omega will ask you to pay 100. What do you do? According to these two formulations, the correct answer seems to be the one corresponding to the first intuition. Now consider instead this formulation of counterfactual PD: Omega, a perfect predictor, tells you that it has flipped a coin, and it has come up heads. Omega asks you to pay 100 (here and now) and gives you 10000 (here and now) if you would pay in case the coin landed tails. Omega also explains that, if the coin had come up tails—but note that it hasn't—Omega would tell you such and such (symmetrical situation). What do you do? The answer of the second intuition would be: I refuse to pay here and now, and I would have paid in case the coin had come up tails. I get 10000. And this formulation of counterfactual PD is analogous to this [https://wiki.lesswrong.com/wiki/Counterfactual_mugging] formulation of counterfactual mugging, where the second intuition refuses to pay. Is your opinion that is false/not admissible/impossible? Or are you saying something else entirely? In any case, if you could motivate your opinion, whatever that is, you would help me understand. Thanks! Decision Theory is multifaceted If you pre-commit to that strategy (heads don't post, tails pay) it provides 10000, but it only works half the time. If you decide that after you see the coin, not to pay in that case, then this will lead to the strategy (not pay, not pay) which provides 0. 2Michele Campolo8moIt seems you are arguing for the position that I called "the first intuition" in my post. Before knowing the outcome, the best you can do is (pay, pay), because that leads to 9900. On the other hand, as in standard counterfactual mugging, you could be asked: "You know that, this time, the coin came up tails. What do you do?". And here the second intuition applies: the DM can decide to not pay (in this case) and to pay when heads. Omega recognises the intent of the DM, and gives 10000. Maybe you are not even considering the second intuition because you take for granted that the agent has to decide one policy "at the beginning" and stick to it, or, as you wrote, "pre-commit". One of the points of the post is that it is unclear where this assumption comes from, and what it exactly means. It's possible that my reasoning in the post was not clear, but I think that if you reread the analysis you will see the situation from both viewpoints. Decision Theory is multifaceted "After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition)" - No you can't. The only way to get 10,000 is to pay if the coin comes up the opposite way it comes up. And that's only a 50/50 chance. 1Michele Campolo8moIf the DM knows the outcome is heads, why can't he not pay in that case and decide to pay in the other case? In other words: why can't he adopt the policy (not pay when heads; pay when tails), which leads to 10000? Decision Theory is multifaceted Well, you can only predict conditional on what you write, you can't predict unconditionally. However, once you've fixed what you'll write in order to make a prediction, you can't then change what you'll write in response to that prediction. Actually, it isn't about utility in expectation. If you are the kind of person who pays you gain$9900, if you aren't you gain 100. This is guaranteed utility, not expected utility. 1Michele Campolo8moThe fact that it is "guaranteed" utility doesn't make a significant difference: my analysis still applies. After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition). Decision Theory is multifaceted Hey Michael, I agree that it is important to look very closely at problems like Counterfactual Mugging and not accept solutions that involve handwaving. Suppose the predictor knows that it writes M on the paper you'll choose N and if it writes N on the paper you'll choose M. Further, if it writes nothing you'll choose M. That isn't a problem since regardless of what it writes it would have predicted your choice correctly. It just can't write down the choice without making you choose the opposite. I was quite skeptical of paying in Co... (read more) 1Michele Campolo8moHi Chris! My point in the post is that the paradoxical situation occurs when the prediction outcome is communicated to the decision maker. We have a seemingly correct prediction—the one that you wrote about—that ceases to be correct after it is communicated. And later in the post I discuss whether this problematic feature of prediction extends to other scenarios, leaving the question open. What did you want to say exactly? I've read the problem and the analysis I did for (standard) counterfactual mugging applies to your version as well. The first intuition is that, before knowing the toss outcome, the DM wants to pay in both cases, because that gives the highest utility (9900) in expectation. The second intuition is that, after the DM knows (wlog) the outcome is heads, he doesn't want to pay anymore in that case—and wants to be someone who pays when tails is the outcome, thus getting 10000. Arguments about fast takeoff A few thoughts: • Even if we could theoretically double output for a product, it doesn't mean that there will be sufficient demand for it to be doubled. This potential depends on how much of the population already has thing X • Even if we could effectively double our workforce, if we are mostly replacing low-value jobs, then our economy wouldn't double • Even if we could say halve the cost of producing robot workers, that might simply result in extra profits for a company instead of increasing the size of the economy • Even if we have a technology that could ... (read more) What makes counterfactuals comparable? Yeah, sorry, that's a typo, fixed now. What makes counterfactuals comparable? Hey Vojta, thanks so much for your thoughts. I feel slightly worried about going too deep into discussions along the lines of "Vojta reacts to Chris' claims about what other LW people argue against hypothetical 1-boxing CDT researchers from classical academia that they haven't met" :D. Fair enough. Especially since this post isn't so much about the way people currently frame their arguments but attempt to persuade people to reframe the discussion around comparability. My take on how to do counterfactuals correctly is that this is not ... (read more) Motivating Abstraction-First Decision Theory Ah, I think I now get where you are coming from Motivating Abstraction-First Decision Theory I guess what is confusing me is that you seem to have provided a reason why we shouldn't just care about high-level functional behaviour (because this might miss correlations between the low-level components), then in the next sentence you're acting as though this is irrelevant? 2johnswentworth1yYeah, I worried that would be confusing when writing the OP. Glad you left a comment, I'm sure other people are confused by it too. The reason why we shouldn't just care about high-level functional behavior is that the abstraction leaks. Problem is, it's not clear that the abstraction leaks in a way which is relevant to the outcome. And if the leak isn't relevant to the outcome, then it's not clear why we need to care about it - or (more importantly) why the agents themselves would care about it. Please keep asking for clarification if that doesn't help explain it; I want to make this make sense for everyone else too. Motivating Abstraction-First Decision Theory "First and foremost: why do we care about validity of queries on correlations between the low-level internal structures of the two agent-instances? Isn’t the functional behavior all that’s relevant to the outcome? Why care about anything irrelevant to the outcome?" - I don't follow what you are saying here 1johnswentworth1yThe idea in the first part of the post is that, if we try to abstract just one of the two agent-instances, then the abstraction fails because low-level details of the calculations in the two instances are correlated; the point of abstraction is (roughly) that low-level details should be uncorrelated between different abstract components. But then we ask why that's a problem; in what situations does the one-instance agent abstraction actually return incorrect predictions? What queries does it not answer correctly? Answer: queries which involve correlations between the low-level calculations within the two agent-instances. But do we actually care about those particular queries? Like, think about it from the agent's point of view. It only cares about maximizing the outcome, and the outcome doesn't depend on the low-level details of the two agents' calculations; the outcome only depends on the functional form of the agents' computations (i.e. their input-output behavior). Does that make sense? Two Alternatives to Logical Counterfactuals Yes (see thread with Abram Demski). Hmm, yeah this could be a viable theory. Anyway to summarise the argument I make in Is Backwards Causation Necessarily Absurd?, I point out that since physics is pretty much reversible, instead of A causing B, it seems as though we could also imagine B causing A and time going backwards. In this view, it would be reasonable to say that one-boxing (backwards-)caused the box to be full in Newcombs. I only sketched the theory because I don't have enough physics knowledge to evaluate it. But the point is that we can give... (read more) Two Alternatives to Logical Counterfactuals So my usage (of free will) seems pretty standard. Not quite. The way you are using it doesn't necessarily imply real control, it may be imaginary control. All word definitions are determined in large part by social convention True. Maybe I should clarify what I'm suggesting. My current theory is that there are multiple reasonable definitions of counterfactual and it comes down to social norms as to what we accept as a valid counterfactual. However, it is still very much a work in progress, so I wouldn't be able to provide more than vague detail... (read more) 1Jessica Taylor1yI'm discussing a hypothetical agent who believes itself to have control. So its beliefs include "I have free will". Its belief isn't "I believe that I have free will". Yes, that makes sense. Yes (see thread with Abram Demski). Already factorized as an agent interacting with an environment. Two Alternatives to Logical Counterfactuals I found parts of your framing quite original and I'm still trying to understand all the consequences. Firstly, I'm also opposed to characterising the problem in terms of logical counterfactuals. I've argued before that Counterfactuals are an Answer Not a Question, although maybe it would have been clearer to say that they are a Tool Not a Question instead. If we're talking strictly, it doesn't make sense to ask what maths would. be like if 1+1=3 as it doesn't, but we can construct a para-consistent logic where it makes sense to... (read more) 2Jessica Taylor1yWikipedia [https://en.wikipedia.org/wiki/Free_will] says "Free will is the ability to choose between different possible courses of action unimpeded." SEP [https://plato.stanford.edu/entries/freewill/] says "The term “free will” has emerged over the past two millennia as the canonical designator for a significant kind of control over one’s actions." So my usage seems pretty standard. All word definitions are determined in large part by social convention. The question is whether the social convention corresponds to a definition (e.g. with truth conditions) or not. If it does, then the social convention is realist, if not, it's nonrealist (perhaps emotivist, etc). Not necessarily. An agent may be uncertain over its own action, and thus have uncertainty about material conditionals involving its action. The "possible worlds" represented by this uncertainty may be logically inconsistent, in ways the agent can't determine before making the decision. I don't understand this? I thought it searched for proofs of the form "if I take this action, then I get at least this much utility", which is a material conditional. Policy-dependent source code does this; one's source code depends on one's policy. I think UDT makes sense in "dualistic" decision problems that are already factorized as "this policy leads to these consequences". Extending it to a nondualist case brings up difficulties, including the free will / determinism issue. Policy-dependent source code is a way of interpreting UDT in a setting with deterministic, knowable physics. [Meta] Do you want AIS Webinars? I would be keen run a webinar on Logical Counterfactuals 1Linda Linsefors1yLet's do it! If you pick a time and date and write up an abstract, then I will sort out the logistic. Worst case it's just you and me having a conversation, but most likely some more people will show up. Reference Post: Trivial Decision Problem "I think the next place to go is to put this in the context of methods of choosing decision theories - the big ones being reflective modification and evolutionary/population level change. Pretty generally it seems like the trivial perspective is unstable is under these, but there are some circumstances where it's not." - sorry, I'm not following what you're saying here 2Charlie Steiner1yReflective modification flow: Suppose we have an EDT agent that can take an action to modify its decision theory. It will try to choose based on the average outcome conditioned on taking the different decision. In some circumstances, EDT agents are doing well so it will expect to do well by not changing; in other circumstances, maybe it expects to do better conditional on self-modifying to use the Counterfactual Perspective more. Evolutionary flow: If you put a mixture of EDT and FDT agents in an evolutionary competition where they're playing some iterated game and high scorers get to reproduce, what does the population look like at large times, for different games and starting populations? The Counterfactual Prisoner's Dilemma We can assume that the coin is flipped out of your sight. Vanessa Kosoy's Shortform Yeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals. Vanessa Kosoy's Shortform "The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined" - What do you mean here? I'm curious whether you're referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors. The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait's Hitchhiker to end up in town if the predictor is perfect, so that it wouldn't actually be well-defined what the predictor was predicting. A... (read more) 2Vanessa Kosoy1yIt is not a mere "concern", it's the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and "objective" description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it's supposed to work is, the agent needs to locate all copies of itself or things "logically correlated" with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found. Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT). Transparent Newcomb's Problem and the limitations of the Erasure framing Some people want to act as though a simulation of you is automatically you and my argument is that it is bad practise to assume this. I'm much more open to the idea that some simulations might be you in some sense than the claim that all are. This seems compatible with a fuzzy cut-off. Transparent Newcomb's Problem and the limitations of the Erasure framing "I actually don't think that there is a general procedure to tell what is you, and what is a simulation of you" - Let's suppose I promise to sell you an autographed Michael Jackson CD. But then it turns out that the CD wasn't signed by Michael, but by me. Now I'm really good at forgeries, so good in fact that my signature matches his atom to atom. Haven't I still lied? 1Donald Hobson1yImagine sitting outside the universe, and being given an exact description of everything that happened within the universe. From this perspective you can see who signed what. You can also see whether your thoughts are happening in biology or silicon or whatever. My point isn't "you can't tell whether or not your in a simulation so there is no difference", my point is that there is no sharp cut off point between simulation and not simulation. We have a "know it when you see it" definition with ambiguous edge cases. Decision theory can't have different rules for dealing with dogs and not dogs because some things are on the ambiguous edge of dogginess. Likewise decision theory can't have different rules for you, copies of you and simulations of you as there is no sharp cut off. If you want to propose a continuous "simulatedness" parameter, and explain where that gets added to decision theory, go ahead. (Or propose some sharp cutoff) Transparent Newcomb's Problem and the limitations of the Erasure framing Not at all. Your comments helped me realise that I needed to make some edits to my post. Transparent Newcomb's Problem and the limitations of the Erasure framing In other words, the claim isn't that your program is incorrect, it's that it requires more justification than you might think in order to persuasively show that it correctly represents Newcomb's problem. Maybe you think understanding this isn't particularly important, but I think knowing exactly what is going on is key to understanding how to construct logical-counterfactuals in general. Transparent Newcomb's Problem and the limitations of the Erasure framing I actually don't know Haskell, but I'll take a stab at decoding it tonight or tomorrow. Open-box Newcomb's is normally stated as "you see a full box", not "you or a simulation of you sees a full box". I agree with this reinterpretation, but I disagree with glossing it over. My point was that if we take the problem description super-literally as you seeing the box and not a simulation of you, then you must one-box. Of course, since this provides a trivial decision problem, we'll want to reinterpret it in some way and that's what I'm providing a justification for. 1Vladimir Slepnev1yI see, thanks, that makes it clearer. There's no disagreement, you're trying to justify the approach that people are already using. Sorry about the noise. Transparent Newcomb's Problem and the limitations of the Erasure framing Okay, I have to admit that that's kind of cool; but on the other hand, that also completely misses the point. I think we need to backtrack. A maths proof can be valid, but the conclusion false if at least one premise is false right? So unless a problem has already been formally defined it's not enough to just throw down a maths proof, but you also have to justify that you've formalised it correctly. 1Vladimir Slepnev1yWell, the program is my formalization. All the premises are right there. You should be able to point out where you disagree. Transparent Newcomb's Problem and the limitations of the Erasure framing I've already addressed this in the article above, but my understanding is as follows: This is one of those circumstances where it is important to differentiate between you being in a situation and a simulation of you being in a situation. I really should write a post about this - but in order for a simulation to be accurate it simply has to make the same decisions in decision theory problems. It doesn't have to have anything else the same - in fact, it could be an anti-rational agent with the opposite utility function. Note, that I'm not clai... (read more) 1Donald Hobson1yThese two people might look the same, the might be identical on a quantum level, but one of them is a largely rational agent, and the other is an anti-rational agent with the opposite utility function. I think that calling something an anti-rational agent with the opposite utility function is a wierd description that doesn't cut reality at its joints. The is a simple notion of a perfect sphere. There is also a simple notion of a perfect optimizer. Real world objects aren't perfect spheres, but some are pretty close. Thus "sphere" is a useful approximation, and "sphere + error term" is a useful description. Real agents aren't perfect optimisers, (ignoring contived goals like "1 for doing whatever you were going to do anyway, 0 else") but some are pretty close, hence "utility function + biases" is a useful description. This makes the notion of an anti-rational agent with opposite utility function like an inside out sphere with its surface offset inwards by twice the radius. Its a cack handed description of a simple object in terms of a totally different simple object and a huge error term. I actually don't think that there is a general procedure to tell what is you, and what is a simulation of you. Standard argument about slowly replacing neurons with nanomachines, slowly porting it to software, slowly abstracting and proving theorems about it rather than running it directly. It is an entirely meaningful utility function to only care about copies of your algorithm that are running on certain kinds of hardware. That makes you a "biochemical brains running this algorithm" mazimizer. The paperclip maximizer doesn't care about any copy of its algorithm. Humans worrying about whether the predictors simulation is detailed enough to really suffer is due to specific features of human morality. From the perspective of the paperclip maximizer doing decision theory, what we care about is logical correlation. 1Vladimir Slepnev1yI couldn't understand your comment, so I wrote a small Haskell program [https://repl.it/repls/OrangeVirtuousOffices] to show that two-boxing in the transparent Newcomb problem is a consistent outcome. What parts of it do you disagree with? The Rocket Alignment Problem This is a very important post. It provides a justification for why agent foundations research might be important which was always unclear to me. “embedded self-justification,” or something like that Just a quick note: Sometimes there is a way out of this kind of infinite regress by implementing an algorithm that approximates the limit. Of course, you can also be put back into an infinite regress by asking if there is a better approximation. A Critique of Functional Decision Theory I feel the bomb problem could be better defined. What is the predictor predicting? Is it always predicting what you'll do when you see the note saying it will predict right? What about if you don't see this note because it predicts you'll go left? Then there's the issue that if it makes a prediction by a) trying to predict whether you'll see such a note or not, then b) predicting what the agent does in this case, then it'd already have to predict the agent's choice in order to make the prediction in stage a). In other words, a depends on b and b depends on a; the situation is circular. (edited since my previous comment was incorrect) Counterfactuals are an Answer, Not a Question Hopefully I tie up my old job soon so that I can dive deeper into Agent Foundations, including your sequence on CDT=EDT. Anyway, I'm slightly confused by your comment, because I get the impression you think there is more divergence between our ideas than I think exists. When you talk about it being constructed rather than real, it's very similar to what i meant when I (briefly) noted that some definitions are more natural than others (https://www.lesswrong.com/posts/peCFP4zGowfe7Xccz/natural-structures-and-definitions). It's on this basis that I argue raw co ... (read more) Probability as Minimal Map I hope that I'm not becoming that annoying person who redirects every conversation towards their own pet theory even when its of minimal relevance, but I think there are some very strong links here with my theory of logical counterfactuals as forgetting or erasing information (https://www.lesswrong.com/posts/BRuWm4GxcTNPn4XDX/deconfusing-logical-counterfactuals). In fact, I wasn't quite aware of how strong these similarities were before I read this post. Both probabilities* and logical counterfactuals don't intrinsically exist in the universe; instead they ... (read more) 1johnswentworth2yThat is highly relevant, and is basically where I was planning to go with my next post. In particular, see the dice problem in this comment [https://www.lesswrong.com/posts/hLFD6qSN9MmQxKjG5/embedded-agency-via-abstraction?commentId=G66awwNsqtMpThHZS#G66awwNsqtMpThHZS] - sometimes throwing away information requires randomizing our probability distribution. I suspect that this idea can be used to rederive Nash equilibria in a somewhat-more-embedded-looking scenario, e.g. our opponent making its decision by running a copy of us to see what we do. Thanks for pointing out the relevance of your work, I'll probably use some of it. Troll Bridge I'm finding some of the text in the comic slightly hard to read. 2Abram Demski2yYep, sorry. The illustrations were not actually originally meant for publication; they're from my personal notes. I did it this way (1) because the pictures are kind of nice, (2) because I was frustrated that no one had written a good summary post on Troll Bridge yet, (3) because I was in a hurry. Ideally I'll edit the images to be more suitable for the post, although adding the omitted content is a higher priority. Formalising decision theory is hard I'm pretty confident in my forgetting approach to logical counterfactuals, though I've sadly been too busy the last few months to pick it up again. I haven't quite formalised it yet - I'm planning to try to collect all the toy problems and I think it'll likely drop out fairly quickly that there are a few distinct, but closely related notions of logical counterfactual. Anyway, I think I've finally figured out how to more precisely state my claim that we ought to build upon consistent counterfactuals. Formalising decision theory is hard Utilising a model that assumes we can decouple is different from assuming we can decouple. For example, calculus assumes that space is infinitely divisible, but using a formula derived from calculus to calculate the volume of a sphere doesn't require you to assume space is infinitely divisible. It just has to work as an approximation. AI Alignment Open Thread August 2019 I've just been invited to this forum. How do I decide whether to put a post on the Alignment Forum vs. Less Wrong? 3Matthew "Vaniver" Graves2yBasically, whether you think it's primarily related to alignment vs. rationality. (Everything on the AF is also on LW, but the reverse isn't true.) The feedback loop if you're posting too much or stuff that isn't polished enough is downvotes (or insufficient upvotes). Contest:1,000 for good questions to ask to an Oracle AI

What if another AI would have counterfactually written some of those posts to manipulate us?

3Wei Dai2yIf that seems a realistic concern during the time period that the Oracle is being asked to predict, you could replace the AF with a more secure forum, such as a private forum internal to some AI safety research team.
The Game Theory of Blackmail

"If I have an action that I can take that would help me but hurt you and I ask you for some compensation for refraining from taking this action, then this is more like a value trade than a blackmail" - Maybe. What about if an action gives you 1 utility, but costs me a 100 and you demand 90. That sounds a lot like blackmail!

1Linda Linsefors2yI would decompose that in to a value trade + a blackmail. The default for me would be to take the action that gives me 1 utility. But you can offer me a trade where you give me something better in return for me not taking that action. This would be a value trade. Lets now take me agreeing to your proposition as the default. If I then choose to threaten to call the deal off, unless you pay me a even higher amount, than this is blackmail. I don't think that these parts (the value trade and the blackmail) should be viewed as sequential. I wrote it that way for illustrative purposes. However, I do think that any value trade has a Game of Chicken component, where each player can threaten to call of the trade if they don't get the more favorable deal.
An environment for studying counterfactuals

"The agent receives an observation O as input, from which it can infer whether exploration will occur." - I'm confused here. What is this observation? Does it purely relate to whether it will explore or does it also provide data about the universe? And is it merely a correlation or a binary yes or no?

1Nisan2yThe observation can provide all sorts of information about the universe, including whether exploration occurs. The exact set of possible observations depends on the decision problem. E and O can have any relationship, but the most interesting case is when one can infer E from O with certainty.