All of jessicata's Comments + Replies

Do you think of counterfactuals as a speedup on evolution? Could this be operationalized by designing AIs that quantilize on some animal population, therefore not being far from the population distribution, but still surviving/reproducing better than average?

Speedup on evolution? Maybe? Might work okayish, but doubt the best solution is that speculative.

Note the preceding

Let's first, within a critical agential ontology, disprove some very basic forms of determinism.

I'm assuming use of a metaphysics in which you, the agent, can make choices. Without this metaphysics there isn't an obvious motivation for a theory of decisions. As in, you could score some actions, but then there isn't a sense in which you "can" choose one according to any criterion.

Maybe this metaphysics leads to contradictions. In the rest of the post I argue that it doesn't contradict belief in physical causality including as applied to the self.

I've noticed that issue as well. Counterfactuals are more a convenient model/story than something to be taken literally. You've grounded decision by taking counterfactuals to exist a priori. I ground them by noting that our desire to construct counterfactuals is ultimately based on evolved instincts and/or behaviours so these stories aren't just arbitrary stories but a way in which we can leverage the lessons that have been instilled in us by evolution. I'm curious, given this explanation, why do we still need choices to be actual?

AFAIK the best known way of reconciling physical causality with "free will" like choice is constructor theory, which someone pointed out was similar to my critical agential approach.

I commented directly on your post.

AI improving itself is most likely to look like AI systems doing R&D in the same way that humans do. “AI smart enough to improve itself” is not a crucial threshold, AI systems will get gradually better at improving themselves. Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that hum

... (read more)

My sense is that we are on broadly the same page here. I agree that "AI improving AI over time" will look very different from "humans improving humans over time" or even "biology improving humans over time." But I think that it will look a lot like "humans improving AI over time," and that's what I'd use to estimate timescales (months or years, most likely years) for further AI improvements.

“myopia” (not sure who correctly named this as a corrigibility principle),

I think this is from Paul Christiano, e.g. this discussion.

I assumed EER did account for that based on:

All portable air conditioner’s energy efficiency is measured using an EER score. The EER rating is the ratio between the useful cooling effect (measured in BTU) to electrical power (in W). It’s for this reason that it is hard to give a generalized answer to this question, but typically, portable air conditioners are less efficient than permanent window units due to their size.

3Oliver Habryka8mo
This article explains the difference: [] EER measures performance in BTUs, which are simply measuring how much work the AC performs, without taking into account any backflow of cold air back into the AC, or infiltration issues.

Regarding the back-and-forth on air conditioners, I tried Google searching to find a precedent for this sort of analysis; the first Google result was "air conditioner single vs. dual hose" was this blog post, which acknowledges the inefficiency johnswentworth points out, overall recommends dual-hose air conditioners, but still recommends single-hose air conditioners under some conditions, and claims the efficiency difference is only about 12%.


In general, a single-hose portable air conditioner is best suited for smaller rooms. The reason being

... (read more)
2Oliver Habryka8mo
EER does not account for heat infiltration issues, so this seems confused. CEER does, and that does suggest something in the 20% range, but I am pretty sure you can't use EER to compare a single-hose and a dual-hose system.

Btw, there is some amount of philosophical convergence between this and some recent work I did on critical agential physics; both are trying to understand physics as laws that partially (not fully) predict sense-data starting from the perspective of a particular agent.

It seems like "infra-Bayesianism" may be broadly compatible with frequentism; extending Popper's falsifiability condition to falsify probabilistic (as opposed to deterministic) laws yields frequentist null hypothesis significance testing, e.g. Neyman Pearson; similarly, frequentism also attem... (read more)

3Vanessa Kosoy8mo
Thanks, I'll look at that! Yes! In frequentism, we define probability distributions as limits of frequencies. One problem with this is, what to do if there's no convergence? In the real world, there won't be convergence unless you have an infinite sequence of truly identical experiments, which you never have. At best, you have a long sequence of similar experiments. Arguably, infrabayesianism solves it by replacing the limit with the convex hull of all limit points. But, I view infrabayesianism more as a synthesis between bayesianism and frequentism. Like in frequentism, you can get asymptotic guarantees. But, like in bayesiansim, it makes sense to talk of priors (and even updates), and measure the performance of your policy regardless of the particular decomposition of the prior into hypotheses (as opposed to regret which does depend on the decomposition). In particular, you can define the optimal infrabayesian policy even for a prior which is not learnable and hence doesn't admit frequentism-style guarantees.

Thanks for reading all the posts!

I'm not sure where you got the idea that this was to solve the spurious counterfactuals problem, that was in the appendix because I anticipated that a MIRI-adjacent person would want to know how it solves that problem.

The core problem it's solving is that it's a well-defined mathematical framework in which (a) there are, in some sense, choices, and (b) it is believed that these choices correspond to the results of a particular Turing machine. It goes back to the free will vs determinism paradox, and shows that there's a fo... (read more)

Thanks for that clarification. I suppose that demonstrates that the 5 and 10 problem is a broader problem than I realised. I still think that it's only a hard problem within particular systems that have a vulnerability to it. Yeah, we have significant agreement, but I'm more conservative in my interpretations. I guess this is a result of me being, at least in my opinion, more skeptical of language. Like I'm very conscious of arguments where someone says, "X could be described by phrase Y" and then later they rely on connations of Y that weren't proven. For example, you write, "From the AI’s perspective, it has a choice among multiple actions, hence in a sense “believing in metaphysical free will”. I would suggest it would be more accurate to write: "The AI models the situation as though it had free will" which leaves open the possibility that it is might be just a pragmatic model, rather than the AI necessarily endorsing itself as possessing free will. Another way of framing this: there's an additional step in between observing that an agent acts or models a situation as it believes in freewill and concluding that it actually believes in freewill. For example, I might round all numbers in a calculation to integers in order to make it easier for me, but that doesn't mean that I believe that the values are integers.

It seems like agents in a deterministic universe can falsify theories in at least some sense. Like they take two different weights drop them and see they land at the same time falsifying the fact that heavier objects fall faster

The main problem is that it isn't meaningful for their theories to make counterfactual predictions about a single situation; they can create multiple situations (across time and space) and assume symmetry and get falsification that way, but it requires extra assumptions. Basically you can't say different theories really disagree... (read more)

Agreed, this is yet another argument for considering counterfactuals to be so fundamental that they don't make sense outside of themselves. I just don't see this as incompatible with determinism, b/c I'm grounding using counterfactuals rather than agency. I don't mean utility function optimization, so let me clarify what as I see as the distinction. I guess I see my version as compatible with the determinist claim that you couldn't have run the experiment because the path of the universe was always determined from the start. I'm referring to a purely hypothetical running with no reference to whether you could or couldn't have actually run it. Hopefully, my comments here have made it clear where we diverge and this provides a target if you want to make a submission (that said, the contest is about the potential circular dependency of counterfactuals and not just my views. So it's perfectly valid for people to focus on other arguments for this hypothesis, rather than my specific arguments).

I previously wrote a post about reconciling free will with determinism. The metaphysics implicit in Pearlian causality is free will (In Drescher's words: "Pearl's formalism models free will rather than mechanical choice."). The challenge is reconciling this metaphysics with the belief that one is physically embodied. That is what the post attempts to do; these perspectives aren't inherently irreconcilable, we just have to be really careful about e.g. distinguishing "my action" vs "the action of the computer embodying me" in a the Bayes net and distingu... (read more)

Thoughts on Modeling Naturalized Logic Decision Theory Problems in Linear Logic I hadn't heard of linear logic before - it seems like a cool formalisation - although I tend to believe that formalisations are overrated as unless they are used very carefully they can obscure more than they reveal. I believe that spurious counterfactuals are only an issue with the 5 and 10 problem because of an attempt to hack logical-if to substitute for counterfactual-if in such a way that we can reuse proof-based systems. It's extremely cool that we can do as much as we can working in that fashion, but there's no reason why we should be surprised that it runs into limits. So I don't see inventing alternative formalisations that avoid the 5 and 10 problem as particularly hard as the bug is really quite specific to systems that try to utilise this kind of hack. I'd expect that almost any other system in design space will avoid this. So if, as I claim, attempts at formalisation will avoid this issue by default, the fact that any one formalisation avoids this problem shouldn't give us too much confidence in it being a good system for representing counterfactuals in general. Instead, I think it's much more persuasive to ground any proposed system with philosophical arguments (such as your first post was focusing on), rather than mostly just posting a system and observing it has a few nice properties. I mean, your approach in this article certainly a valuable thing to do, but I don't see it as getting all the way to the heart of the issue. Interestingly enough, this mirrors my position in Why 1-boxing doesn't imply backwards causation [] where I distinguish between Raw Reality (the territory) and Augmented Reality (the territory augmented by counterfactuals). I guess I put more emphasis on delving into the philosophical reasons for such a view and I think that's what this post is a bit sh
Comments on A critical agential account of free will, causation, and physics We can imagine a situation where there is a box containing an apple or a pear. Suppose we believe that it contains a pear, but we believe it contains an apple. If we look in the box (and we have good reason to believe looking doesn't change the contents), then we'll falsfy our pear hypothesis. Similarly, if we're told by an oracle that if we looked we would see a pear, then there'd be no need for us to actually look, we'd have heard enough to falsify our pear hypothesis. However, the situation you've identified isn't the same. Here you aren't just deciding whether to make an observation or not, but what the value of that observation would be. So in this case, the fact that if you took action B you'd observe the action you took was B doesn't say anything about the case where you don't take action B, unlike knowing that if you looked in the box you'd see you an apple provides you information even if you don't look in the box. It simply isn't relevant unless you actually take B. I think it's reasonable to suggest starting from falsification as our most basic assumption. I guess where you lose me is when you claim that this implies agency. I guess my position is as follows: * It seems like agents in a deterministic universe can falsify theories in at least some sense. Like they take two different weights drop them and see they land at the same time falsifying the fact that heavier objects fall faster * On the other hand, some like agency or counterfactuals seems necessary for talking about falsfiability in the abstract as this involves saying that we could falsify a theory if we ran an experiment that we didn't. In the second case, I would suggest that what we need is counterfactuals not agency. That is, we need to be able to say things like, "If I ran this experiment and obtained this result, then theory X would be falsified", not "I could have run this experiment and if I d
You've linked me to three different posts, so I'll address them in separate comments. Two Alternatives to Logical Counterfactuals I actually really liked this post - enough that I changed my original upvote to a strong upvote. I also disagree with the notion that logical counterfactuals make sense when taken literally so I really appreciated you making this point persuasively. I agreed with your criticisms of the material condition approach and I think policy-dependent source code could be potentially promising. I guess this naturally leads to the question of how to justify this approach. This results in questions like, "What exactly is a counterfactual?" and "Why exactly do we want such a notion?" and I believe that following this path leads to the discovery that counterfactuals are circular. I'm more open to saying that I adopt Counterfactual Non-Realism than I was when I originally commented although I don't see theories based on material conditionals as the only approach within this category. I guess I'm also more enthusiastic about thinking in terms of policies rather than action mainly because of the lesson I drew from the Counterfactual Prisoner's Dilemma [] . I don't really know why I didn't make this connection at the time, since I had written that post a few months prior, but I appear to have missed this. I still feel that introducing the term "free will" is too loaded to be helpful here, regardless of whether you are or aren't using it in a non-standard fashion. Like I'd encourage you to structure your posts to try to separate: a) This is how we handle counterfactuals b) This is the implications of this for the free will debate A large part of this is because I suspect many people on Less Wrong are simply allergic to this term.

How do you think this project relates to Ought? Seems like the projects share a basic objective (having AI predict human thoughts had in the course of solving a task). Ought has more detailed proposals for how the thoughts are being used to solve the task (in terms of e.g. factoring a problem into smaller problems, so that the internal thoughts are a load-bearing part of the computation rather than an annotation that is predicted but not checked for being relevant).

So we are taking one of the outputs that current AIs seem to have learned best to design

... (read more)
1John Maxwell1y
Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that story text, then the story is being generated conditional on it. Personally I think I might go for a two-phase process. Do the task with a lot of transparent detail [] in phase 1. Summarize that detail and filter out infohazards in phase 2, but link from the summary to the detailed version so a human can check things as needed (flagging links to plausible infohazards). (I guess you could flag links to parts that seemed especially likely to be incorrigible/manipulative cognition, or parts of the summary that the summarizer was less confident in, as well.)

This section seemed like an instance of you and Eliezer talking past each other in a way that wasn't locating a mathematical model containing the features you both believed were important (e.g. things could go "whoosh" while still being continuous):


Even if we just assume that your AI needs to go off in the corner and not interact with humans, there’s still a question of why the self-contained AI civilization is making ~0 progress and then all of a sudden very rapid progress


unfortunately a lot of what you are saying, fro... (read more)

My claim is that the timescale of AI self-improvement, at the point it takes over from humans, is the same as the previous timescale of human-driven AI improvement. If it was a lot faster, you would have seen a takeover earlier instead. 

This claim is true in your model. It also seems true to me about hominids, that is I think that cultural evolution took over roughly when its timescale was comparable to the timescale for biological improvements, though Eliezer disagrees

I thought Eliezer's comment "there is a sufficiently high level where things go who... (read more)

A bunch of this was frustrating to read because it seemed like Paul was yelling "we should model continuous changes!" and Eliezer was yelling "we should model discrete events!" and these were treated as counter-arguments to each other.

It seems obvious from having read about dynamical systems that continuous models still have discrete phase changes. E.g. consider boiling water. As you put in energy the temperature increases until it gets to the boiling point, at which point more energy put in doesn't increase the temperature further (for a while), it conv... (read more)

(I'm interested in which of my claims seem to dismiss or not adequately account for the possibility that continuous systems have phase changes.)

I don’t really feel like anything you are saying undermines my position here, or defends the part of Eliezer’s picture I’m objecting to.

(ETA: but I agree with you that it's the right kind of model to be talking about and is good to bring up explicitly in discussion. I think my failure to do so is mostly a failure of communication.)

I usually think about models that show the same kind of phase transition you discuss, though usually significantly more sophisticated models and moving from exponential to hyperbolic growth (you only get an exponential in your mo... (read more)

5Matthew Barnett1y
+1 on using dynamical systems models to try to formalize the frameworks in this debate. I also give Eliezer points for trying to do something similar in Intelligence Explosion Microeconomics [] (and to people who have looked at this from the macro perspective [] ).

This is quite good concrete AI forecasting compared to what I've seen elsewhere, thanks for doing it! It seems really plasusible based on how fast AI progress has been going over the past decade and which problems are most tractable.

CDT and EDT have known problems on 5 and 10. TDT/UDT are insufficiently formalized, and seem like they might rely on known-to-be-unfomalizable logical counterfactuals.

So 5 and 10 isn't trivial even without spurious counterfactuals.

What does this add over modal UDT?

  • No requirement to do infinite proof search
  • More elegant handling of multi-step decision problems
  • Also works on problems where the agent doesn't know its source code (of course, this prevents logical dependencies due to source code from being taken into account)

Philosophically, it works as a

... (read more)

Reals are still defined as sets of (a, b) rational intervals. The locale contains countable unions of these, but all these are determined by which (a, b) intervals contain the real number.

Good point; I've changed the wording to make it clear that the rational-delimited open intervals are the basis, not all the locale elements. Luckily, points can be defined as sets of basis elements containing them, since all other properties follow. (Making the locale itself countable requires weakening the definition by making the sets to form unions over countable, e.g. by requiring them to be recursively enumerable)

2Adele Lopez3y
Another way to make it countable would be to instead go to the category of posets, Then the rational interval basis is a poset with a countable number of elements, and by the Alexandroff construction [] corresponds to the real line (or at least something very similar). But, this construction gives a full and faithful embedding of the category of posets to the category of spaces (which basically means you get all and only continuous maps from monotonic function). I guess the ontology version in this case would be the category of prosets. (Personally, I'm not sure that ontology of the universe isn't a type error).
1Vladimir Slepnev3y
I see. In that case does the procedure for defining points stay the same, or do you need to use recursively enumerable sets of opens, giving you only countably many reals?

I've also been thinking about the application of agency abstractions to decision theory, from a somewhat different angle.

It seems like what you're doing is considering relations between high-level third-person abstractions and low-level third-person abstractions. In contrast, I'm primarily considering relations between high-level first-person abstractions and low-level first-person abstractions.

The VNM abstraction itself assumes that "you" are deciding between different options, each of which has different (stochastic) consequences; thus, it is inherently

... (read more)
This comment made a bunch of your other writing click for me. I think I see what you're aiming for now; it's a beautiful vision. In retrospect, this is largely what I've been trying to get rid of, in particular by looking for a third-person interpretation of probability []. Obviously frequentism satisfies that criterion, but the strict form is too narrow for most applications and the less-strict form (i.e. "imagine we repeated this one-shot experiment many times...") isn't actually third-person. I've also started thinking about a third-person grounding of utility maximization and the like via selection processes; that's likely to be a whole months-long project in itself in the not-too-distant future.

Looking back on this, it does seem quite similar to EDT. I'm actually, at this point, not clear on how EDT and TDT differ, except in that EDT has potential problems in cases where it's sure about its own action. I'll change the text so it notes the similarity to EDT.

On XOR blackmail, SIDT will indeed pay up.

Yes, it's about no backwards assumption. Linear has lots of meanings, I'm not concerned about this getting confused with linear algebra, but you can suggest a better term if you have one.

Basically, the assumption that you're participating in a POMDP. The idea is that there's some hidden state that your actions interact with in a temporally linear fashion (i.e. action 1 affects state 2), such that your late actions can't affect early states/observations.

1David Scott Krueger3y
OK, so no "backwards causation" ? (not sure if that's a technical term and/or if I'm using it right...) Is there a word we could use instead of "linear", which to an ML person sounds like "as in linear algebra"?

The way you are using it doesn’t necessarily imply real control, it may be imaginary control.

I'm discussing a hypothetical agent who believes itself to have control. So its beliefs include "I have free will". Its belief isn't "I believe that I have free will".

It’s a “para-consistent material conditional” by which I mean the algorithm is limited in such a way as to prevent this explosion.

Yes, that makes sense.

However, were you flowing this all the way back in time?

Yes (see thread with Abram Demski).

What do you mean by dualistic?

Already fact

... (read more)
Hmm, yeah this could be a viable theory. Anyway to summarise the argument I make in Is Backwards Causation Necessarily Absurd? [] , I point out that since physics is pretty much reversible, instead of A causing B, it seems as though we could also imagine B causing A and time going backwards. In this view, it would be reasonable to say that one-boxing (backwards-)caused the box to be full in Newcombs. I only sketched the theory because I don't have enough physics knowledge to evaluate it. But the point is that we can give justification for a non-standard model of causality.

Secondly, “free will” is such a loaded word that using it in a non-standard fashion simply obscures and confuses the discussion.

Wikipedia says "Free will is the ability to choose between different possible courses of action unimpeded." SEP says "The term “free will” has emerged over the past two millennia as the canonical designator for a significant kind of control over one’s actions." So my usage seems pretty standard.

For example, recently I’ve been arguing in favour of what counts as a valid counterfactual being at least partially a matter of soc

... (read more)
Not quite. The way you are using it doesn't necessarily imply real control, it may be imaginary control. True. Maybe I should clarify what I'm suggesting. My current theory is that there are multiple reasonable definitions of counterfactual and it comes down to social norms as to what we accept as a valid counterfactual. However, it is still very much a work in progress, so I wouldn't be able to provide more than vague details. I guess my point was that this notion of counterfactual isn't strictly a material conditional due to the principle of explosion []. It's a "para-consistent material conditional" by which I mean the algorithm is limited in such a way as to prevent this explosion. Hmm... good point. However, were you flowing this all the way back in time? Such as if you change someone's source code, you'd also have to change the person who programmed them. What do you mean by dualistic?

I think it's worth examining more closely what it means to be "not a pure optimizer". Formally, a VNM utility function is a rationalization of a coherent policy. Say that you have some idea about what your utility function is, U. Suppose you then decide to follow a policy that does not maximize U. Logically, it follows that U is not really your utility function; either your policy doesn't coherently maximize any utility function, or it maximizes some other utility function. (Because the utility function is, by definition, a rationalization of the poli

... (read more)
3Abram Demski3y
OK, all of that made sense to me. I find the direction more plausible than when I first read your post, although it still seems like it'll fall to the problem I sketched. I both like and hate that it treats logical uncertainty in a radically different way from empirical uncertainty -- like, because we have so far failed to find any way to treat the two uniformly (besides being entirely updateful that is); and hate, because it still feels so wrong for the two to be very different.

It seems the approaches we're using are similar, in that they both are starting from observation/action history with posited falsifiable laws, with the agent's source code not known a priori, and the agent considering different policies.

Learning "my source code is A" is quite similar to learning "Omega predicts my action is equal to A()", so these would lead to similar results.

Policy-dependent source code, then, corresponds to Omega making different predictions depending on the agent's intended policy, such that when comparing policies, the agent has to imagine Omega predicting differently (as it would imagine learning different source code under policy-dependent source code).

1Vanessa Kosoy3y
Well, in quasi-Bayesianism for each policy you have to consider the worst-case environment in your belief set, which depends on the policy. I guess that in this sense it is analogous.

I agree this is a problem, but isn't this a problem for logical counterfactual approaches as well? Isn't it also weird for a known fixed optimizer source code to produce a different result on this decision where it's obvious that 'left' is the best decision?

If you assume that the agent chose 'right', it's more reasonable to think it's because it's not a pure optimizer than that a pure optimizer would have chosen 'right', in my view.

If you form the intent to, as a policy, go 'right' on the 100th turn, you should anticipate learning that your source code is not the code of a pure optimizer.

3Abram Demski3y
I'm left with the feeling that you don't see the problem I'm pointing at. My concern is that the most plausible world where you aren't a pure optimizer might look very very different, and whether this very very different world looks better or worse than the normal-looking world does not seem very relevant to the current decision. Consider the "special exception selves" you mention -- the Nth exception-self has a hard-coded exception "go right if it's beet at least N turns and you've gone right at most 1/N of the time". Now let's suppose that the worlds which give rise to exception-selves are a bit wild. That is to say, the rewards in those worlds have pretty high variance. So a significant fraction of them have quite high reward -- let's just say 10% of them have value much higher than is achievable in the real world. So we expect that by around N=10, there will be an exception-self living in a world that looks really good. This suggests to me that the policy-dependent-source agent cannot learn to go left > 90% of the time, because once it crosses that threshhold, the exception-self in the really good looking world is ready to trigger its exception -- so going right starts to appear really good. The agent goes right until it is under the threshhold again. If that's true, then it seems to me rather bad: the agent ends up repeatedly going right in a situation where it should be able to learn to go left easily. Its reason for repeatedly going right? There is one enticing world, which looks much like the real world, except that in that world the agent definitely goes right. Because that agent is a lucky agent who gets a lot of utility, the actual agent has decided to copy its behavior exactly -- anything else would prove the real agent unlucky, which would be sad. Of course, this outcome is far from obvious; I'm playing fast and loose with how this sort of agent might reason.

This indeed makes sense when "obs" is itself a logical fact. If obs is a sensory input, though, 'A(obs) = act' is a logical fact, not a logical counterfactual. (I'm not trying to avoid causal interpretations of source code interpreters here, just logical counterfactuals)

2Abram Demski3y
Ahhh ok.

In the happy dance problem, when the agent is considering doing a happy dance, the agent should have already updated on M. This is more like timeless decision theory than updateless decision theory.

Conditioning on 'A(obs) = act' is still a conditional, not a counterfactual. The difference between conditionals and counterfactuals is the difference between "If Oswald didn't kill Kennedy, then someone else did" and "If Oswald didn't kill Kennedy, then someone else would have".

Indeed, troll bridge will present a problem for "playing chicken" approaches, whic

... (read more)
4Abram Demski3y
I'm not sure how you are thinking about this. It seems to me like this will imply really radical changes to the universe. Suppose the agent is choosing between a left path and a right path. Its actual programming will go left. It has to come up with alternate programming which would make it go right, in order to consider that scenario. The most probable universe in which its programming would make it go right is potentially really different from our own. In particular, it is a universe where it would go right despite everything it has observed, a lifetime of (updateless) learning, which in the real universe, has taught it that it should go left in situations like this. EG, perhaps it has faced an iterated 5&10 problem, where left always yields 10. It has to consider alternate selves who, faced with that history, go right. It just seems implausible that thinking about universes like that will result in systematically good decisions. In the iterated 5&10 example, perhaps universes where its programming fails iterated 5&10 are universes where iterated 5&10 is an exceedingly unlikely situation; so in fact, the reward for going right is quite unlikely to be 5, and very likely to be 100. Then the AI would choose to go right. Obviously, this is not necessarily how you are thinking about it at all -- as you said, you haven't given an actual decision procedure. But the idea of considering only really consistent counterfactual worlds seems quite problematic.
2Abram Demski3y
I still disagree. We need a counterfactual structure in order to consider the agent as a function A(obs). EG, if the agent is a computer program, the function A() would contain all the counterfactual information about what the agent would do if it observed different things. Hence, considering the agent's computer program as such a function leverages an ontological commitment to those counterfactuals. To illustrate this, consider counterfactual mugging [] where we already see that the coin is heads -- so, there is nothing we can do, we are at the mercy of our counterfactual partner. But suppose we haven't yet observed whether Omega gives us the money. A "real counterfactual" is one which can be true or false independently of whether its condition is met. In this case, if we believe in real counterfactuals, we believe that there is a fact of the matter about what we do in the coin=tails case, even though the coin came up heads. If we don't believe in real counterfactuals, we instead think only that there is a fact of how Omega is computing "what I would have done if the coin had been tails" -- but we do not believe there is any "correct" way for Omega to compute that. The obs→act representation and the P(act|obs) representation both appear to satisfy this test of non-realism. The first is always true if the observation is false, so, lacks the ability to vary independently of the observation. The second is undefined when the observation is false, which is perhaps even more appealing for the non-realist. Now consider the A(obs)=act representation. A(tails)=pay can still vary even when we know coin=heads. So, it fails this test -- it is a realist representation! Putting something into functional form imputes a causal/counterfactual structure.
2Abram Demski3y
I agree that this gets around the problem, but to me the happy dance problem is still suggestive -- it looks like the material conditional is the wrong representation of the thing we want to condition on. Also -- if the agent has already updated on observations, then updating on obs→a ct is just the same as updating on act. So this difference only matters in the updateless case, where it seems to cause us trouble.

Yes, this is a specific way of doing policy-dependent source code, which minimizes how much the source code has to change to handle the counterfactual.

Haven't looked deeply into the paper yet but the basic idea seems sound.

The most quintessentially human intellectual accomplishments (e.g. proving theorems, composing symphonies, going into space) were only made possible by culture post-agricultural revolution.

I'm guessing you mean the beginning of agriculture and not the Agricultural Revolution (18th century), which came much later than math and after Baroque music. But the wording is ambiguous.

5Issa Rice3y
It seems like "agricultural revolution []" is used to mean both the beginning of agriculture ("First Agricultural Revolution") and the 18th century agricultural revolution ("Second Agricultural Revolution").

It's a subjectivist approach similar to Bayesianism, starting from the perspective of a given subject. Unlike in idealism, there is no assertion that everything is mental.

In hyper-Solomonoff induction, indeed the direct hypercomputation hypothesis is probably more likely than the arbitration-oracle-emulating-hypercomputation hypothesis. But only by a constant factor. So this isn't really falsification so much as a shift in Bayesian evidence.

I do think it's theoretically cleaner to distinguish this Bayesian reweighting from Popperian logical falsification, and from Neyman-Pearson null hypothesis significance testing (frequentist falsification), both of which in principle require producing an unbounded number of bits of evidence, although in practice rely on unfalsifiable assumptions to avoid radical skepticism e.g. of memory.

This is really important and I missed this, thanks. I've added a note at the top of the post.

Indeed, a constructive halting oracle can be thought of as a black-box that takes a PA statement, chooses whether to play Verifier or Falsifier, and then plays that, letting the user play the other part. Thanks for making this connection.

The recommendation here is for AI designers (and future-designers in general) to decide what is right at some meta level, including details of which extrapolation procedures would be best.

Of course there are constraints on this given by objective reason (hence the utility of investigation), but these constraints do not fully constrain the set of possibilities. Better to say "I am making this arbitrary choice for this psychological reason" than to refuse to make arbitrary choices.

The problem you're running into is that the goals of:

  1. being totally constrained by a system of rules determined by some process outside yourself that doesn't share your values (e.g. value-independent objective reason)
  2. attaining those things that you intrinsically value

are incompatible. It's easy to see once these are written out. If you want to get what you want, on purpose rather than accidentally, you must make choices. Those choices must be determined in part by things in you, not only by things outside you (such as value-independent objective rea

... (read more)
3Charlie Steiner3y
You know, this isn't why I usually get called a tool :P I think I'm saying something pretty different from Nietzsche here. The problem with "Just decide for yourself" as an approach to dealing with moral decisions in novel contexts (like what to do with the whole galaxy) is that, though it may help you choose actions rather than worrying about what's right, it's not much help in building an AI. We certainly can't tell the AI "Just decide for yourself," that's trying to order around the nonexistent ghost in the machine. And while I could say "Do exactly what Charlie would do," even I wouldn't want the AI to do that, let alone other people. Nor can we fall back on "Well, designing an AI is an action, therefore I should just pick whatever AI design I feel like, because God is dead and I should just pick actions how I will," because how I feel like designing an AI has some very exacting requirements - it contains the whole problem in itself.

I think CDT ultimately has to grapple with the question as well, because physics is math, and so physical counterfactuals are ultimately mathematical counterfactuals.

"Physics is math" is ontologically reductive.

Physics can often be specified as a dynamical system (along with interpretations of e.g. what high-level entities it represents, how it gets observed). Dynamical systems can be specified mathematically. Dynamical systems also have causal counterfactuals (what if you suddenly changed the system state to be this instead?).

Causal counterfactuals d

... (read more)
3Abram Demski3y
Yeah, agreed, I no longer endorse the argument I was making there - one has to say more than "physics is math" to establish the importance of dealing with logical counterfactuals.

What does it mean for a map to be “accurate” at an abstract level, and what properties should my map-making process have in order to produce accurate abstracted maps/beliefs?

The notion of a homomorphism in universal algebra and category theory is relevant here. Homomorphisms map from one structure (e.g. a group) to another, and must preserve structure. They can delete information (by mapping multiple different elements to the same element), but the structures that are represented in the structure-being-mapped-to must also exist in the structure-being-

... (read more)

On the subject of intentionality/reference/objectivity/etc, On the Origin of Objects is excellent. My thinking about reference has a kind of discontinuity from before reading this book to after reading it. Seriously, the majority of analytic philosophy discussion of indexicality, qualia, reductionism, etc seems hopelessly confused in comparison.

Reading this now, thanks.

More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans (the same about meta-ethics and theory of human values).

This is also my reason for being pessimistic about solving metaphilosophy before a good number of object-level philosophical problems have been solved (e.g. in decision theory, ontology/metaphysics, and epistemology). If we imagine being in a state where we believe running computation X would solve hard philosophical problem Y, then it would seem that we already have a great de

... (read more)
2Wei Dai4y
I think our positions on this are pretty close, but I may put a bit more weight on other "plausible stories" for solving metaphilosophy relative to your "plausible story". (I'm not sure if overall I'm more or less optimistic than you are.) It seems quite possible that understanding the general class of problems that includes Y is easier than understanding Y itself, and that allows us to find a computation X that would solve Y without much understanding of Y itself. As an analogy, suppose Y is some complex decision problem that we have little understanding of, and X is an AI that is programmed with a good decision theory. This does not seem like a very strong argument for your position. My suggestion in the OP is that humans already know the equivalent of "walking" (i.e., doing philosophy), we're just doing it very slowly. Given this, your analogies don't seem very conclusive about the difficulty of solving metaphilosophy or whether we have to make a bunch more progress on object-level philosophical problems before we can solve metaphilosophy.

I think the fixed point finder won't optimize the fixed point for minimizing expected log loss. I'm going to give a concrete algorithm and show that it doesn't exhibit this behavior. If you disagree, can you present an alternative algorithm?

Here's the algorithm. Start with some oracle (not a reflective oracle). Sample ~1000000 universes based on this oracle, getting 1000000 data points for what the reflective oracle outputs. Move the oracle 1% of the way from its current position towards the oracle that would answer queries correctly given the distrib

... (read more)
Reflective Oracles are a bit of a weird case case because their 'loss' is more like a 0/1 loss than a log loss, so all of the minima are exactly the same(If we take a sample of 100000 universes to score them, the difference is merely incredibly small instead of 0). I was being a bit glib referencing them in the article; I had in mind something more like a model parameterizing a distribution over outputs, whose only influence on the world is via a random sample from this distribution. I think that such models should in general have fixed points for similar reasons, but am not sure. Regardless, these models will, I believe, favour fixed points whose distributions are easy to compute(But not fixed points with low entropy, that is they will punish logical uncertainty but not intrinsic uncertainy). I'm planning to run some experiments with VAEs and post the results later.

The capacity for agency arises because, in a complex environment, there will be multiple possible fixed-points. It’s quite likely that these fixed-points will differ in how the predictor is scored, either due to inherent randomness, logical uncertainty, or computational intractability(predictors could be powerfully superhuman while still being logically uncertain and computationally limited). Then the predictor will output the fixed-point on which it scores the best.

Reflective oracles won't automatically do this. They won't minimize log loss or any oth

... (read more)
The gradient descent is not being done over the reflective oracles, it's being done over some general computational model like a neural net. Any highly-performing solution will necessarily look like a fixed-point-finding computation of some kind, due to the self-referential nature of the predictions. Then, since this fixed-point-finder is *internal* to the model, it will be optimized for log loss just like everything else in the model. That is, the global optimization of the model is distinct from whatever internal optimization the fixed-point-finder uses to choose the reflective oracle. The global optimization will favor internal optimizers that produce fixed-points with good score. So while fixed-point-finders in general won't optimize for anything in particular, the one this model uses will.

Ok, this seems usefully specific. A few concerns:

  1. It seems that, according to your description, my proto-preferences are my current map of the situation I am in (or ones I have already imagined) along with valence tags. However, the AI is going to be in a different location, so I actually want it to form a different map (otherwise, it would act as if it were in my location, not its location). So what I actually want to get copied is more like a map-building and valence-tagging procedure that can be applied to different contexts, which will take differ

... (read more)
2Stuart Armstrong4y
Thanks! I'm not sure I fully get all your concerns, but I'll try and answer to the best of my understanding. 1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model "If someone is drowning, I should help them (if it's an easy thing to do)". Then "someone", "downing", "I", and "help them" are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that "I should help them" further decomposes into "they should be helped" and "I should get the credit for helping them". Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can't do that, then it should save them itself (using nanotechnology or anything else). It doesn't seem that it would be seeing the issue from my narrow perspective, because I don't see the issue just from my narrow perspective. 5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the "science fiction superpowers" end of the spectrum for the moment. What I'm hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand. 6: Currently I don't see reflexes as embodying values

I'm pretty confused by what you mean by proto-preferences. I thought by proto-preferences you meant something like "preferences in the moment, not subject to reflection etc." But you also said there's a definition. What's the definition? (The concept is pre-formal, I don't think you'll be able to provide a satisfactory definition).

You have written a paper about how preferences are not identifiable. Why, then, do you say that proto-preferences are identifiable, if they are just preferences in the moment? The impossibility results apply word-for-word t

... (read more)
5Stuart Armstrong4y
Oh, I don't claim to have a full definition yet, but I believe it's better than pre-formal. Here would be my current definition: * Humans are partially model-based agents. We often generate models (or at least partial models) of situations (real or hypothetical), and, within those models, label certain actions/outcomes/possibilities as better or worse than others (or sometimes just generically "good" or "bad"). This model, along with the label, is what I'd call a proto-preference (or pre-preference). That's why neuroscience is relevant, for identifying the mental model human use. The "previous Alice post" I mentioned is here [] . and was a toy version of this, in the case of an algorithm rather than a human. The reason these get around the No Free Lunch theorem is that they look inside the algorithm (so different algorithms with the same policy can be seen to have different preferences, which breaks NFL), and is making the "normative assumption" that these modelled proto-preferences correspond, (modulo preference synthesis) to the agent's actual preferences. Note that that definition puts preferences and meta-preferences into the same type, the only difference being the sort of model being considered.

The overall approach of finding proto-preferences and meta-preferences, resolving them somehow, then extrapolating from there, seems like a reasonable thing to do.

But, suppose you're going to do this. Then you're going to run into a problem: proto-preferences aren't identifiable.

I interpreted you as trying to fix this problem by looking at how humans infer each other's preferences rather than their (proto-)preferences themselves. You could try learning people's proto-preference-learning-algorithms instead of their proto-preferences.

But, this is not an ea

... (read more)
1Stuart Armstrong4y
The proto-preferences are a definition of the components that make up preferences. Methods of figuring them out - be they stated preferences, revealed preferences, FMRI machines, how other people infer each other's preferences... - are just methods. The advantage of having a definition is that this guides us explicitly as to when a specific method for figuring them out, ceases to be applicable. And I'd argue that proto-preferences are identifiable. We're talking about figuring out how humans model their own situations, and the better-worse judgements they assign in their internal models. This is not unidentifiable, and neuroscience already has some things to say on it. The previous Alice post showed how you could do it a toy model (with my posts on semantics [] and symbol grounding [] , relevant to applying this approach to humans). That second sentence of mine is somewhat poorly phrased, but I agree that "extracting the normative assumptions humans make is no easier than extracting proto-preferences" - I just don't see that second one as being insoluble.

I don't know, but a pseudo-definition that works sometimes is "upon having a lot of time to reflect, information, etc, I would conclude that you have Y values"; of course I can't use this definition when I am doing the reflection, though! "Values" is at the moment a pre-formal concept (utility theory doesn't directly apply to humans), so it has some representation in people's brains that is hard to extract/formalize.

In any case, I reject any AI design that concludes that it ought to act as if you have X values just because my current models imply that you

... (read more)
1Rohin Shah4y
You could imagine examining a human brain and seeing how it models other humans. This would let you get some normative assumptions out that could inform a value learning technique. I would think of this as extracting an algorithm that could infer human preferences out of a human brain. You could run this algorithm for a long time, in which case it would eventually output Y values, even if you would currently judge the person as having X values.
3Stuart Armstrong4y
We're getting close to something important here, so I'll try and sort things out carefully. In my current approach, I'm doing two things: 1. Finding some components of preferences or proto-preferences within the human brain. 2. Synthesising them together in a way that also respects (proto-)meta-preferences. The first step is needed because of the No Free Lunch in preference learning result. We need to have some definition of preferences that isn't behavioural. And the stated-values-after-reflection approach has some specific problems that I listed here [] . Then it took an initial stab at how one could sythesise the preferences in this post [] . If I'm reading you correctly, your main fear is that by focusing on the proto-preferences of the moment, we might end up in a terrible place, foreclosing moral improvements. I share that fear! That's why the process of synthesising values in accordance both with meta-preferences and "far" preferences ("I want everyone to live happy worthwhile lives" is a perfectly valid proto-preference). Where we might differ the most, is that I'm very reluctant to throw away any proto-preferences, even if our meta-preferences would typically overrule it. I would prefer to keep it around, with a very low weight. Once we get in the habit of ditching proto-preferences, there's no telling where that process might end up [] .

Because once we have these parameters, we can learn the values of any given human.

This doesn't make the problem easier, you have to start somewhere. I agree this could reduce the total computational work required but it doesn't seem any easier conceptually.

Whereas “learn what humans model each other’s values (and rationality) to be” is something that makes sense in the world.

This has the same problem as value learning. If I think you have X values but you actually have Y values (and I would think you have Y values upon further reflection etc) then

... (read more)
1Stuart Armstrong4y
What do you mean by "you actually have Y values"? What are you defining values to be?

Instead, we just need to extract the normative assumptions that humans are already making and use these in the value learning process

Okay, but how do you do that if you don't already have a value learning algorithm? Why is it easier to learn the algorithms/parameters humans use in inferring each other's values, than to just learn their values?

2Stuart Armstrong4y
Because once we have these parameters, we can learn the values of any given human. In contrast, it we learn the values of a given human, we don't get to learn the values of any other one. I'd argue further: these parameters form part of a definition of human values. We can't just "learn human values", as these don't exist in the world. Whereas "learn what humans model each other's values (and rationality) to be" is something that makes sense in the world.

It can maximize the utility function: if I take the twitch action in time step otherwise. In a standard POMDP setting this always takes the twitch action.

Load More