All of jessicata's Comments + Replies

What 2026 looks like (Daniel's Median Future)

This is quite good concrete AI forecasting compared to what I've seen elsewhere, thanks for doing it! It seems really plasusible based on how fast AI progress has been going over the past decade and which problems are most tractable.

Modeling naturalized decision problems in linear logic

CDT and EDT have known problems on 5 and 10. TDT/UDT are insufficiently formalized, and seem like they might rely on known-to-be-unfomalizable logical counterfactuals.

So 5 and 10 isn't trivial even without spurious counterfactuals.

What does this add over modal UDT?

  • No requirement to do infinite proof search
  • More elegant handling of multi-step decision problems
  • Also works on problems where the agent doesn't know its source code (of course, this prevents logical dependencies due to source code from being taken into account)

Philosophically, it works as a

... (read more)
Topological metaphysics: relating point-set topology and locale theory

Reals are still defined as sets of (a, b) rational intervals. The locale contains countable unions of these, but all these are determined by which (a, b) intervals contain the real number.

Topological metaphysics: relating point-set topology and locale theory

Good point; I've changed the wording to make it clear that the rational-delimited open intervals are the basis, not all the locale elements. Luckily, points can be defined as sets of basis elements containing them, since all other properties follow. (Making the locale itself countable requires weakening the definition by making the sets to form unions over countable, e.g. by requiring them to be recursively enumerable)

2Adele Lopez1yAnother way to make it countable would be to instead go to the category of posets, Then the rational interval basis is a poset with a countable number of elements, and by the Alexandroff construction [https://ncatlab.org/nlab/show/specialization+topology] corresponds to the real line (or at least something very similar). But, this construction gives a full and faithful embedding of the category of posets to the category of spaces (which basically means you get all and only continuous maps from monotonic function). I guess the ontology version in this case would be the category of prosets. (Personally, I'm not sure that ontology of the universe isn't a type error).
1Vladimir Slepnev1yI see. In that case does the procedure for defining points stay the same, or do you need to use recursively enumerable sets of opens, giving you only countably many reals?
Motivating Abstraction-First Decision Theory

I've also been thinking about the application of agency abstractions to decision theory, from a somewhat different angle.

It seems like what you're doing is considering relations between high-level third-person abstractions and low-level third-person abstractions. In contrast, I'm primarily considering relations between high-level first-person abstractions and low-level first-person abstractions.

The VNM abstraction itself assumes that "you" are deciding between different options, each of which has different (stochastic) consequences; thus, it is inherently

... (read more)
3johnswentworth1yThis comment made a bunch of your other writing click for me. I think I see what you're aiming for now; it's a beautiful vision. In retrospect, this is largely what I've been trying to get rid of, in particular by looking for a third-person interpretation of probability [https://www.lesswrong.com/posts/Lz2nCYnBeaZyS68Xb/probability-as-minimal-map]. Obviously frequentism satisfies that criterion, but the strict form is too narrow for most applications and the less-strict form (i.e. "imagine we repeated this one-shot experiment many times...") isn't actually third-person. I've also started thinking about a third-person grounding of utility maximization and the like via selection processes; that's likely to be a whole months-long project in itself in the not-too-distant future.
Subjective implication decision theory in critical agentialism

Looking back on this, it does seem quite similar to EDT. I'm actually, at this point, not clear on how EDT and TDT differ, except in that EDT has potential problems in cases where it's sure about its own action. I'll change the text so it notes the similarity to EDT.

On XOR blackmail, SIDT will indeed pay up.

Two Alternatives to Logical Counterfactuals

Yes, it's about no backwards assumption. Linear has lots of meanings, I'm not concerned about this getting confused with linear algebra, but you can suggest a better term if you have one.

Two Alternatives to Logical Counterfactuals

Basically, the assumption that you're participating in a POMDP. The idea is that there's some hidden state that your actions interact with in a temporally linear fashion (i.e. action 1 affects state 2), such that your late actions can't affect early states/observations.

1David Krueger2yOK, so no "backwards causation" ? (not sure if that's a technical term and/or if I'm using it right...) Is there a word we could use instead of "linear", which to an ML person sounds like "as in linear algebra"?
Two Alternatives to Logical Counterfactuals

The way you are using it doesn’t necessarily imply real control, it may be imaginary control.

I'm discussing a hypothetical agent who believes itself to have control. So its beliefs include "I have free will". Its belief isn't "I believe that I have free will".

It’s a “para-consistent material conditional” by which I mean the algorithm is limited in such a way as to prevent this explosion.

Yes, that makes sense.

However, were you flowing this all the way back in time?

Yes (see thread with Abram Demski).

What do you mean by dualistic?

Already fact

... (read more)
1Chris_Leong2yHmm, yeah this could be a viable theory. Anyway to summarise the argument I make in Is Backwards Causation Necessarily Absurd? [https://www.lesswrong.com/posts/pa7mvEmEgt336gBSf/is-backwards-causation-necessarily-absurd] , I point out that since physics is pretty much reversible, instead of A causing B, it seems as though we could also imagine B causing A and time going backwards. In this view, it would be reasonable to say that one-boxing (backwards-)caused the box to be full in Newcombs. I only sketched the theory because I don't have enough physics knowledge to evaluate it. But the point is that we can give justification for a non-standard model of causality.
Two Alternatives to Logical Counterfactuals

Secondly, “free will” is such a loaded word that using it in a non-standard fashion simply obscures and confuses the discussion.

Wikipedia says "Free will is the ability to choose between different possible courses of action unimpeded." SEP says "The term “free will” has emerged over the past two millennia as the canonical designator for a significant kind of control over one’s actions." So my usage seems pretty standard.

For example, recently I’ve been arguing in favour of what counts as a valid counterfactual being at least partially a matter of soc

... (read more)
1Chris_Leong2yNot quite. The way you are using it doesn't necessarily imply real control, it may be imaginary control. True. Maybe I should clarify what I'm suggesting. My current theory is that there are multiple reasonable definitions of counterfactual and it comes down to social norms as to what we accept as a valid counterfactual. However, it is still very much a work in progress, so I wouldn't be able to provide more than vague details. I guess my point was that this notion of counterfactual isn't strictly a material conditional due to the principle of explosion [https://www.wikiwand.com/en/Principle_of_explosion]. It's a "para-consistent material conditional" by which I mean the algorithm is limited in such a way as to prevent this explosion. Hmm... good point. However, were you flowing this all the way back in time? Such as if you change someone's source code, you'd also have to change the person who programmed them. What do you mean by dualistic?
Two Alternatives to Logical Counterfactuals

I think it's worth examining more closely what it means to be "not a pure optimizer". Formally, a VNM utility function is a rationalization of a coherent policy. Say that you have some idea about what your utility function is, U. Suppose you then decide to follow a policy that does not maximize U. Logically, it follows that U is not really your utility function; either your policy doesn't coherently maximize any utility function, or it maximizes some other utility function. (Because the utility function is, by definition, a rationalization of the poli

... (read more)
3Abram Demski2yOK, all of that made sense to me. I find the direction more plausible than when I first read your post, although it still seems like it'll fall to the problem I sketched. I both like and hate that it treats logical uncertainty in a radically different way from empirical uncertainty -- like, because we have so far failed to find any way to treat the two uniformly (besides being entirely updateful that is); and hate, because it still feels so wrong for the two to be very different.
Two Alternatives to Logical Counterfactuals

It seems the approaches we're using are similar, in that they both are starting from observation/action history with posited falsifiable laws, with the agent's source code not known a priori, and the agent considering different policies.

Learning "my source code is A" is quite similar to learning "Omega predicts my action is equal to A()", so these would lead to similar results.

Policy-dependent source code, then, corresponds to Omega making different predictions depending on the agent's intended policy, such that when comparing policies, the agent has to imagine Omega predicting differently (as it would imagine learning different source code under policy-dependent source code).

1Vanessa Kosoy2yWell, in quasi-Bayesianism for each policy you have to consider the worst-case environment in your belief set, which depends on the policy. I guess that in this sense it is analogous.
Two Alternatives to Logical Counterfactuals

I agree this is a problem, but isn't this a problem for logical counterfactual approaches as well? Isn't it also weird for a known fixed optimizer source code to produce a different result on this decision where it's obvious that 'left' is the best decision?

If you assume that the agent chose 'right', it's more reasonable to think it's because it's not a pure optimizer than that a pure optimizer would have chosen 'right', in my view.

If you form the intent to, as a policy, go 'right' on the 100th turn, you should anticipate learning that your source code is not the code of a pure optimizer.

3Abram Demski2yI'm left with the feeling that you don't see the problem I'm pointing at. My concern is that the most plausible world where you aren't a pure optimizer might look very very different, and whether this very very different world looks better or worse than the normal-looking world does not seem very relevant to the current decision. Consider the "special exception selves" you mention -- the Nth exception-self has a hard-coded exception "go right if it's beet at least N turns and you've gone right at most 1/N of the time". Now let's suppose that the worlds which give rise to exception-selves are a bit wild. That is to say, the rewards in those worlds have pretty high variance. So a significant fraction of them have quite high reward -- let's just say 10% of them have value much higher than is achievable in the real world. So we expect that by around N=10, there will be an exception-self living in a world that looks really good. This suggests to me that the policy-dependent-source agent cannot learn to go left > 90% of the time, because once it crosses that threshhold, the exception-self in the really good looking world is ready to trigger its exception -- so going right starts to appear really good. The agent goes right until it is under the threshhold again. If that's true, then it seems to me rather bad: the agent ends up repeatedly going right in a situation where it should be able to learn to go left easily. Its reason for repeatedly going right? There is one enticing world, which looks much like the real world, except that in that world the agent definitely goes right. Because that agent is a lucky agent who gets a lot of utility, the actual agent has decided to copy its behavior exactly -- anything else would prove the real agent unlucky, which would be sad. Of course, this outcome is far from obvious; I'm playing fast and loose with how this sort of agent might reason.
Two Alternatives to Logical Counterfactuals

This indeed makes sense when "obs" is itself a logical fact. If obs is a sensory input, though, 'A(obs) = act' is a logical fact, not a logical counterfactual. (I'm not trying to avoid causal interpretations of source code interpreters here, just logical counterfactuals)

2Abram Demski2yAhhh ok.
Two Alternatives to Logical Counterfactuals

In the happy dance problem, when the agent is considering doing a happy dance, the agent should have already updated on M. This is more like timeless decision theory than updateless decision theory.

Conditioning on 'A(obs) = act' is still a conditional, not a counterfactual. The difference between conditionals and counterfactuals is the difference between "If Oswald didn't kill Kennedy, then someone else did" and "If Oswald didn't kill Kennedy, then someone else would have".

Indeed, troll bridge will present a problem for "playing chicken" approaches, whic

... (read more)
4Abram Demski2yI'm not sure how you are thinking about this. It seems to me like this will imply really radical changes to the universe. Suppose the agent is choosing between a left path and a right path. Its actual programming will go left. It has to come up with alternate programming which would make it go right, in order to consider that scenario. The most probable universe in which its programming would make it go right is potentially really different from our own. In particular, it is a universe where it would go right despite everything it has observed, a lifetime of (updateless) learning, which in the real universe, has taught it that it should go left in situations like this. EG, perhaps it has faced an iterated 5&10 problem, where left always yields 10. It has to consider alternate selves who, faced with that history, go right. It just seems implausible that thinking about universes like that will result in systematically good decisions. In the iterated 5&10 example, perhaps universes where its programming fails iterated 5&10 are universes where iterated 5&10 is an exceedingly unlikely situation; so in fact, the reward for going right is quite unlikely to be 5, and very likely to be 100. Then the AI would choose to go right. Obviously, this is not necessarily how you are thinking about it at all -- as you said, you haven't given an actual decision procedure. But the idea of considering only really consistent counterfactual worlds seems quite problematic.
2Abram Demski2yI still disagree. We need a counterfactual structure in order to consider the agent as a function A(obs). EG, if the agent is a computer program, the function A() would contain all the counterfactual information about what the agent would do if it observed different things. Hence, considering the agent's computer program as such a function leverages an ontological commitment to those counterfactuals. To illustrate this, consider counterfactual mugging [https://wiki.lesswrong.com/wiki/Counterfactual_mugging] where we already see that the coin is heads -- so, there is nothing we can do, we are at the mercy of our counterfactual partner. But suppose we haven't yet observed whether Omega gives us the money. A "real counterfactual" is one which can be true or false independently of whether its condition is met. In this case, if we believe in real counterfactuals, we believe that there is a fact of the matter about what we do in the coin=tails case, even though the coin came up heads. If we don't believe in real counterfactuals, we instead think only that there is a fact of how Omega is computing "what I would have done if the coin had been tails" -- but we do not believe there is any "correct" way for Omega to compute that. The obs→act representation and the P(act|obs) representation both appear to satisfy this test of non-realism. The first is always true if the observation is false, so, lacks the ability to vary independently of the observation. The second is undefined when the observation is false, which is perhaps even more appealing for the non-realist. Now consider the A(obs)=act representation. A(tails)=pay can still vary even when we know coin=heads. So, it fails this test -- it is a realist representation! Putting something into functional form imputes a causal/counterfactual structure.
2Abram Demski2yI agree that this gets around the problem, but to me the happy dance problem is still suggestive -- it looks like the material conditional is the wrong representation of the thing we want to condition on. Also -- if the agent has already updated on observations, then updating on obs→a ct is just the same as updating on act. So this difference only matters in the updateless case, where it seems to cause us trouble.
Two Alternatives to Logical Counterfactuals

Yes, this is a specific way of doing policy-dependent source code, which minimizes how much the source code has to change to handle the counterfactual.

Haven't looked deeply into the paper yet but the basic idea seems sound.

How special are human brains among animal brains?

The most quintessentially human intellectual accomplishments (e.g. proving theorems, composing symphonies, going into space) were only made possible by culture post-agricultural revolution.

I'm guessing you mean the beginning of agriculture and not the Agricultural Revolution (18th century), which came much later than math and after Baroque music. But the wording is ambiguous.

5Issa Rice2yIt seems like "agricultural revolution [https://en.wikipedia.org/wiki/Agricultural_revolution]" is used to mean both the beginning of agriculture ("First Agricultural Revolution") and the 18th century agricultural revolution ("Second Agricultural Revolution").
A critical agential account of free will, causation, and physics

It's a subjectivist approach similar to Bayesianism, starting from the perspective of a given subject. Unlike in idealism, there is no assertion that everything is mental.

On the falsifiability of hypercomputation, part 2: finite input streams

In hyper-Solomonoff induction, indeed the direct hypercomputation hypothesis is probably more likely than the arbitration-oracle-emulating-hypercomputation hypothesis. But only by a constant factor. So this isn't really falsification so much as a shift in Bayesian evidence.

I do think it's theoretically cleaner to distinguish this Bayesian reweighting from Popperian logical falsification, and from Neyman-Pearson null hypothesis significance testing (frequentist falsification), both of which in principle require producing an unbounded number of bits of evidence, although in practice rely on unfalsifiable assumptions to avoid radical skepticism e.g. of memory.

On the falsifiability of hypercomputation

This is really important and I missed this, thanks. I've added a note at the top of the post.

On the falsifiability of hypercomputation

Indeed, a constructive halting oracle can be thought of as a black-box that takes a PA statement, chooses whether to play Verifier or Falsifier, and then plays that, letting the user play the other part. Thanks for making this connection.

Can we make peace with moral indeterminacy?

The recommendation here is for AI designers (and future-designers in general) to decide what is right at some meta level, including details of which extrapolation procedures would be best.

Of course there are constraints on this given by objective reason (hence the utility of investigation), but these constraints do not fully constrain the set of possibilities. Better to say "I am making this arbitrary choice for this psychological reason" than to refuse to make arbitrary choices.

Can we make peace with moral indeterminacy?

The problem you're running into is that the goals of:

  1. being totally constrained by a system of rules determined by some process outside yourself that doesn't share your values (e.g. value-independent objective reason)
  2. attaining those things that you intrinsically value

are incompatible. It's easy to see once these are written out. If you want to get what you want, on purpose rather than accidentally, you must make choices. Those choices must be determined in part by things in you, not only by things outside you (such as value-independent objective rea

... (read more)
3Charlie Steiner2yYou know, this isn't why I usually get called a tool :P I think I'm saying something pretty different from Nietzsche here. The problem with "Just decide for yourself" as an approach to dealing with moral decisions in novel contexts (like what to do with the whole galaxy) is that, though it may help you choose actions rather than worrying about what's right, it's not much help in building an AI. We certainly can't tell the AI "Just decide for yourself," that's trying to order around the nonexistent ghost in the machine. And while I could say "Do exactly what Charlie would do," even I wouldn't want the AI to do that, let alone other people. Nor can we fall back on "Well, designing an AI is an action, therefore I should just pick whatever AI design I feel like, because God is dead and I should just pick actions how I will," because how I feel like designing an AI has some very exacting requirements - it contains the whole problem in itself.
A Critique of Functional Decision Theory

I think CDT ultimately has to grapple with the question as well, because physics is math, and so physical counterfactuals are ultimately mathematical counterfactuals.

"Physics is math" is ontologically reductive.

Physics can often be specified as a dynamical system (along with interpretations of e.g. what high-level entities it represents, how it gets observed). Dynamical systems can be specified mathematically. Dynamical systems also have causal counterfactuals (what if you suddenly changed the system state to be this instead?).

Causal counterfactuals d

... (read more)
3Abram Demski2yYeah, agreed, I no longer endorse the argument I was making there - one has to say more than "physics is math" to establish the importance of dealing with logical counterfactuals.
The Missing Math of Map-Making

What does it mean for a map to be “accurate” at an abstract level, and what properties should my map-making process have in order to produce accurate abstracted maps/beliefs?

The notion of a homomorphism in universal algebra and category theory is relevant here. Homomorphisms map from one structure (e.g. a group) to another, and must preserve structure. They can delete information (by mapping multiple different elements to the same element), but the structures that are represented in the structure-being-mapped-to must also exist in the structure-being-

... (read more)
Towards an Intentional Research Agenda

On the subject of intentionality/reference/objectivity/etc, On the Origin of Objects is excellent. My thinking about reference has a kind of discontinuity from before reading this book to after reading it. Seriously, the majority of analytic philosophy discussion of indexicality, qualia, reductionism, etc seems hopelessly confused in comparison.

2romeostevensit2yReading this now, thanks.
Some Thoughts on Metaphilosophy

More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans (the same about meta-ethics and theory of human values).

This is also my reason for being pessimistic about solving metaphilosophy before a good number of object-level philosophical problems have been solved (e.g. in decision theory, ontology/metaphysics, and epistemology). If we imagine being in a state where we believe running computation X would solve hard philosophical problem Y, then it would seem that we already have a great de

... (read more)
2Wei Dai3yI think our positions on this are pretty close, but I may put a bit more weight on other "plausible stories" for solving metaphilosophy relative to your "plausible story". (I'm not sure if overall I'm more or less optimistic than you are.) It seems quite possible that understanding the general class of problems that includes Y is easier than understanding Y itself, and that allows us to find a computation X that would solve Y without much understanding of Y itself. As an analogy, suppose Y is some complex decision problem that we have little understanding of, and X is an AI that is programmed with a good decision theory. This does not seem like a very strong argument for your position. My suggestion in the OP is that humans already know the equivalent of "walking" (i.e., doing philosophy), we're just doing it very slowly. Given this, your analogies don't seem very conclusive about the difficulty of solving metaphilosophy or whether we have to make a bunch more progress on object-level philosophical problems before we can solve metaphilosophy.
Predictors as Agents

I think the fixed point finder won't optimize the fixed point for minimizing expected log loss. I'm going to give a concrete algorithm and show that it doesn't exhibit this behavior. If you disagree, can you present an alternative algorithm?

Here's the algorithm. Start with some oracle (not a reflective oracle). Sample ~1000000 universes based on this oracle, getting 1000000 data points for what the reflective oracle outputs. Move the oracle 1% of the way from its current position towards the oracle that would answer queries correctly given the distrib

... (read more)
2interstice3yReflective Oracles are a bit of a weird case case because their 'loss' is more like a 0/1 loss than a log loss, so all of the minima are exactly the same(If we take a sample of 100000 universes to score them, the difference is merely incredibly small instead of 0). I was being a bit glib referencing them in the article; I had in mind something more like a model parameterizing a distribution over outputs, whose only influence on the world is via a random sample from this distribution. I think that such models should in general have fixed points for similar reasons, but am not sure. Regardless, these models will, I believe, favour fixed points whose distributions are easy to compute(But not fixed points with low entropy, that is they will punish logical uncertainty but not intrinsic uncertainy). I'm planning to run some experiments with VAEs and post the results later.
Predictors as Agents

The capacity for agency arises because, in a complex environment, there will be multiple possible fixed-points. It’s quite likely that these fixed-points will differ in how the predictor is scored, either due to inherent randomness, logical uncertainty, or computational intractability(predictors could be powerfully superhuman while still being logically uncertain and computationally limited). Then the predictor will output the fixed-point on which it scores the best.

Reflective oracles won't automatically do this. They won't minimize log loss or any oth

... (read more)
1interstice3yThe gradient descent is not being done over the reflective oracles, it's being done over some general computational model like a neural net. Any highly-performing solution will necessarily look like a fixed-point-finding computation of some kind, due to the self-referential nature of the predictions. Then, since this fixed-point-finder is *internal* to the model, it will be optimized for log loss just like everything else in the model. That is, the global optimization of the model is distinct from whatever internal optimization the fixed-point-finder uses to choose the reflective oracle. The global optimization will favor internal optimizers that produce fixed-points with good score. So while fixed-point-finders in general won't optimize for anything in particular, the one this model uses will.
Figuring out what Alice wants: non-human Alice

Ok, this seems usefully specific. A few concerns:

  1. It seems that, according to your description, my proto-preferences are my current map of the situation I am in (or ones I have already imagined) along with valence tags. However, the AI is going to be in a different location, so I actually want it to form a different map (otherwise, it would act as if it were in my location, not its location). So what I actually want to get copied is more like a map-building and valence-tagging procedure that can be applied to different contexts, which will take differ

... (read more)
2Stuart Armstrong3yThanks! I'm not sure I fully get all your concerns, but I'll try and answer to the best of my understanding. 1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model "If someone is drowning, I should help them (if it's an easy thing to do)". Then "someone", "downing", "I", and "help them" are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that "I should help them" further decomposes into "they should be helped" and "I should get the credit for helping them". Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can't do that, then it should save them itself (using nanotechnology or anything else). It doesn't seem that it would be seeing the issue from my narrow perspective, because I don't see the issue just from my narrow perspective. 5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the "science fiction superpowers" end of the spectrum for the moment. What I'm hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand. 6: Currently I don't see reflexes as embodying values
Figuring out what Alice wants: non-human Alice

I'm pretty confused by what you mean by proto-preferences. I thought by proto-preferences you meant something like "preferences in the moment, not subject to reflection etc." But you also said there's a definition. What's the definition? (The concept is pre-formal, I don't think you'll be able to provide a satisfactory definition).

You have written a paper about how preferences are not identifiable. Why, then, do you say that proto-preferences are identifiable, if they are just preferences in the moment? The impossibility results apply word-for-word t

... (read more)
5Stuart Armstrong3yOh, I don't claim to have a full definition yet, but I believe it's better than pre-formal. Here would be my current definition: * Humans are partially model-based agents. We often generate models (or at least partial models) of situations (real or hypothetical), and, within those models, label certain actions/outcomes/possibilities as better or worse than others (or sometimes just generically "good" or "bad"). This model, along with the label, is what I'd call a proto-preference (or pre-preference). That's why neuroscience is relevant, for identifying the mental model human use. The "previous Alice post" I mentioned is here [https://www.lesswrong.com/posts/rcXaY3FgoobMkH2jc/figuring-out-what-alice-wants-part-ii] . and was a toy version of this, in the case of an algorithm rather than a human. The reason these get around the No Free Lunch theorem is that they look inside the algorithm (so different algorithms with the same policy can be seen to have different preferences, which breaks NFL), and is making the "normative assumption" that these modelled proto-preferences correspond, (modulo preference synthesis) to the agent's actual preferences. Note that that definition puts preferences and meta-preferences into the same type, the only difference being the sort of model being considered.
Figuring out what Alice wants: non-human Alice

The overall approach of finding proto-preferences and meta-preferences, resolving them somehow, then extrapolating from there, seems like a reasonable thing to do.

But, suppose you're going to do this. Then you're going to run into a problem: proto-preferences aren't identifiable.

I interpreted you as trying to fix this problem by looking at how humans infer each other's preferences rather than their (proto-)preferences themselves. You could try learning people's proto-preference-learning-algorithms instead of their proto-preferences.

But, this is not an ea

... (read more)
1Stuart Armstrong3yThe proto-preferences are a definition of the components that make up preferences. Methods of figuring them out - be they stated preferences, revealed preferences, FMRI machines, how other people infer each other's preferences... - are just methods. The advantage of having a definition is that this guides us explicitly as to when a specific method for figuring them out, ceases to be applicable. And I'd argue that proto-preferences are identifiable. We're talking about figuring out how humans model their own situations, and the better-worse judgements they assign in their internal models. This is not unidentifiable, and neuroscience already has some things to say on it. The previous Alice post showed how you could do it a toy model (with my posts on semantics [https://www.lesswrong.com/posts/XApNuXPckPxwp5ZcW/bridging-syntax-and-semantics-with-quine-s-gavagai] and symbol grounding [https://www.lesswrong.com/posts/EEPdbtvW8ei9Yi2e8/bridging-syntax-and-semantics-empirically] , relevant to applying this approach to humans). That second sentence of mine is somewhat poorly phrased, but I agree that "extracting the normative assumptions humans make is no easier than extracting proto-preferences" - I just don't see that second one as being insoluble.
Figuring out what Alice wants: non-human Alice

I don't know, but a pseudo-definition that works sometimes is "upon having a lot of time to reflect, information, etc, I would conclude that you have Y values"; of course I can't use this definition when I am doing the reflection, though! "Values" is at the moment a pre-formal concept (utility theory doesn't directly apply to humans), so it has some representation in people's brains that is hard to extract/formalize.

In any case, I reject any AI design that concludes that it ought to act as if you have X values just because my current models imply that you

... (read more)
1Rohin Shah3yYou could imagine examining a human brain and seeing how it models other humans. This would let you get some normative assumptions out that could inform a value learning technique. I would think of this as extracting an algorithm that could infer human preferences out of a human brain. You could run this algorithm for a long time, in which case it would eventually output Y values, even if you would currently judge the person as having X values.
3Stuart Armstrong3yWe're getting close to something important here, so I'll try and sort things out carefully. In my current approach, I'm doing two things: 1. Finding some components of preferences or proto-preferences within the human brain. 2. Synthesising them together in a way that also respects (proto-)meta-preferences. The first step is needed because of the No Free Lunch in preference learning result. We need to have some definition of preferences that isn't behavioural. And the stated-values-after-reflection approach has some specific problems that I listed here [https://www.lesswrong.com/posts/zvrZi95EHqJPxdgps/why-we-need-a-theory-of-human-values] . Then it took an initial stab at how one could sythesise the preferences in this post [https://www.lesswrong.com/posts/Y2LhX3925RodndwpC/resolving-human-values-completely-and-adequately] . If I'm reading you correctly, your main fear is that by focusing on the proto-preferences of the moment, we might end up in a terrible place, foreclosing moral improvements. I share that fear! That's why the process of synthesising values in accordance both with meta-preferences and "far" preferences ("I want everyone to live happy worthwhile lives" is a perfectly valid proto-preference). Where we might differ the most, is that I'm very reluctant to throw away any proto-preferences, even if our meta-preferences would typically overrule it. I would prefer to keep it around, with a very low weight. Once we get in the habit of ditching proto-preferences, there's no telling where that process might end up [https://www.lesswrong.com/posts/WeAt5TeS8aYc4Cpms/values-determined-by-stopping-properties] .
Figuring out what Alice wants: non-human Alice

Because once we have these parameters, we can learn the values of any given human.

This doesn't make the problem easier, you have to start somewhere. I agree this could reduce the total computational work required but it doesn't seem any easier conceptually.

Whereas “learn what humans model each other’s values (and rationality) to be” is something that makes sense in the world.

This has the same problem as value learning. If I think you have X values but you actually have Y values (and I would think you have Y values upon further reflection etc) then

... (read more)
1Stuart Armstrong3yWhat do you mean by "you actually have Y values"? What are you defining values to be?
Figuring out what Alice wants: non-human Alice

Instead, we just need to extract the normative assumptions that humans are already making and use these in the value learning process

Okay, but how do you do that if you don't already have a value learning algorithm? Why is it easier to learn the algorithms/parameters humans use in inferring each other's values, than to just learn their values?

2Stuart Armstrong3yBecause once we have these parameters, we can learn the values of any given human. In contrast, it we learn the values of a given human, we don't get to learn the values of any other one. I'd argue further: these parameters form part of a definition of human values. We can't just "learn human values", as these don't exist in the world. Whereas "learn what humans model each other's values (and rationality) to be" is something that makes sense in the world.
Coherence arguments do not imply goal-directed behavior

It can maximize the utility function: if I take the twitch action in time step otherwise. In a standard POMDP setting this always takes the twitch action.

Topological Fixed Point Exercises

Clarifying question for #9:

How does the decomposition into segments/triangles generalize to 3+ dimensions? If you try decomposing a tetrahedron into multiple tetrahedra, you actually get 4 tetrahedra and 1 octahedron, as shown here.

EDIT: answered my own question:

You can decompose an octahedron into 4 tetrahedrons. They're irregular, but this is actually fine for the purpose of the lemma.

Asymptotic Decision Theory (Improved Writeup)

Yes, the continuity condition on embedders in the ADT paper would eliminate the embedder I meant. Which means the answer might depend on whether ADT considers discontinuous embedders. (The importance of the continuity condition is that it is used in the optimality proof; the optimality proof can't apply to chicken for this reason).

Asymptotic Decision Theory (Improved Writeup)

In the original ADT paper, the agents are allowed to output distributions over moves.

The fact that we take the limit as epsilon goes to 0 means the evil problem can't be constructed, even if randomization is not allowed. (The proof in the ADT paper doesn't work, but that doesn't mean something like it couldn't possibly work)

It's basically saying "since the two actions A and A′ get equal expected utility in the limit, the total variation distance between a distribution over the two actions, and one of the actions, limits to zero", which is false

You'

... (read more)
2Diffractor3yWasn't there a fairness/continuity condition in the original ADT paper that if there were two "agents" that converged to always taking the same action, then the embedder would assign them the same value? (more specifically, if Et(|At−Bt| )<δ, then Et(|Et(At)−Et(Bt)|)<ϵ ) This would mean that it'd be impossible to have Et(Et(ADTt,ϵ)) be low while Et(Et(straightt)) is high, so the argument still goes through. Although, after this whole line of discussion, I'm realizing that there are enough substantial differences between the original formulation of ADT and the thing I wrote up that I should probably clean up this post a bit and clarify more about what's different in the two formulations. Thanks for that.
Asymptotic Decision Theory (Improved Writeup)

OK, I checked this more and I'm more suspicious now.

First: in the ADT paper, the asymptotic dominance argument is about the limit of the agent's action as epsilon goes to 0. This limit is not necessarily computable, so the embedder can't contain the agent, since it doesn't know epsilon. So the evil problem doesn't work. The optimality proof might be valid. I didn't understand which specific step you thought was wrong.

Second: This also means the chicken problem is ill-typed, since you can't put an ADT in the environment. But there is a well-typed versi

... (read more)
1Diffractor3yAgreed that the evil problem doesn't work for the original ADT paper. In the original ADT paper, the agents are allowed to output distributions over moves. I didn't like this because it implicitly assumes that it's possible for the agent to perfectly randomize, and I think randomization is better modeled by a (deterministic) action that consults an environmental random-number generator, which may be correlated with other things. What I meant was that, in the version of argmax that I set up, if A is the two constant policies "take blank box" and "take shiny box", then for the embedder F where the opponent runs argmax to select which box to fill, the argmax agent will converge to deterministically randomizing between the two policies, by the logical inductor assigning very similar expected utility to both options such that the inductor can't predict which action will be chosen. And this occurs because the inductor outputting more of "take the blank box" will have F(shiny) converge to a higher expected value (so argmax will learn to copy that), and the inductor outputting more of "take the shiny box" will have F(blank) converge to a higher expected value (so argmax will learn to copy that). So, the original statement in the paper was The issue with this is the last sentence. It's basically saying "since the two actions A and A′ get equal expected utility in the limit, the total variation distance between a distribution over the two actions, and one of the actions, limits to zero", which is false And it is specifically disproved by the second counterexample, where there are two actions that both result in 1 utility, so they're both in the same equivalence class, but a probabilistic mixture between them (as sadtη converges to playing, for all η) gets less than 1 utility. You'll have to be more specific about "who knows what you are". If it unpacks as "opponent only uses the embedder where it is up against [whatever policy you plugged in]", then NeverSwerveBot will
A Rationality Condition for CDT Is That It Equal EDT (Part 1)

COEDT can be thought of as "learning" from an infinite sequence of agents who explore less and less.

Interestingly, the issue COEDT has with sequential decision problems looks suspiciously similar to folk theorems in iterated game theory (which also imply that completely-aligned agents can get a very bad outcome because they will each maximally punish anyone who doesn't play the grim trigger strategy). There might be some kind of folk theorem for COEDT, though there's a complication in that, conditioning on yourself taking a probability-0 action, you ge

... (read more)
A Rationality Condition for CDT Is That It Equal EDT (Part 1)

#4 (implementability): I think of this as the shakiest assumption; it is easy to set up decision problems which violate it. However, I tend to think such setups get the causal structure wrong. Other parents of the action should instead be thought of as children of the action. Furthermore, if an agent is learning about the structure of a situation by repeated exposure to that situation, implementability seems necessary for the agent to come to understand the situation it is in: parents of the action will look like children if you try to perform experiments

... (read more)
1Abram Demski3yI maybe should have clarified that when I say CDT I'm referring to a steel-man CDT which would use some notion of logical causality. I don't think the physical counterfactuals are a live hypothesis in our circles, but several people advocate reasoning which looks like logical causality. Implementability asserts that you should think of yourself as logico-causally controlling your clone when it is a perfect copy.
EDT solves 5 and 10 with conditional oracles

The agents in this post aren't proof-based. Proof-based issues have some issues with weird counterfactuals. Perhaps the only worlds where you take some specific action are ones where PA is inconsistent. (COEDT also has issues since the nearby oracles are not reflective, but it's a different set of issues)

In general queries to reflective-oracle-like world models that have forms like "is the probability of this exactly 0?" are problematic, since they are vulnerable to liar's paradoxes. What if you take action A iff the probability of taking action A is ex

... (read more)
1Vladimir Slepnev3yYeah, my comment was more about proof-based agents rather than oracle-based, sorry about going off topic. In a proof-based setting, conditionals with P(B)=0 aren't necessarily garbage, some of them are intended and we want the agent to find them. Your example might be one of those. The hard part is defining which is which.
EDT solves 5 and 10 with conditional oracles

I realized there is a problem when you have a 3-action problem, where the first 2-action agent chooses between 0 utility and passing control to the second 2-action agent, and the second 2-action agent chooses between 1/2 and 1 expected utility.

The problem is that there's a stable equilibrium where the first agent passes off control and the second agent always chooses 1/2 expected utility. The second agent makes this choice because they think that, if they choose 1, then the first agent will choose 0. The probability that the first agent chooses 0 is 0 bu

... (read more)
EDT solves 5 and 10 with conditional oracles

By definition , regardless of . (The subscript to only affects the distribution of )

EDIT: clarified notation in the post

Asymptotic Decision Theory (Improved Writeup)

Nice catch on both the optimality result being wrong and ADT not crashing into itself on the chicken problem! It's really great to know these things. This indicates that we weren't checking decision theory arguments with enough rigor.

OK, I checked this more and I'm more suspicious now.

First: in the ADT paper, the asymptotic dominance argument is about the limit of the agent's action as epsilon goes to 0. This limit is not necessarily computable, so the embedder can't contain the agent, since it doesn't know epsilon. So the evil problem doesn't work. The optimality proof might be valid. I didn't understand which specific step you thought was wrong.

Second: This also means the chicken problem is ill-typed, since you can't put an ADT in the environment. But there is a well-typed versi

... (read more)
Asymptotic Decision Theory (Improved Writeup)

This does not pay up on counterfactual blackmail.

What's counterfactual blackmail?

EDIT: if you meant counterfactual mugging, I think one way to solve this is to use a low amount of computation power to select which agent to emulate, then use a high amount of computation power to run that agent. Of course, this is somewhat unsatisfying, since there isn't a canonical way of choosing how much less computation power to use.

1Diffractor3yYup, I meant counterfactual mugging. Fixed.
Asymptotic Decision Theory (Improved Writeup)

OK, I helped invent ADT so I know it conceptually came after. (I don't think it was "shortly after"; logical EDT was invented very shortly after logical inductors, in early 2016, and ADT was in late 2016). I think you should link to the ADT paper in the intro section so people know what you're talking about.

Asymptotic Decision Theory (Improved Writeup)

The most prominent undesirable feature of this is that it's restricted to a finite set of embedders. Optimistic choice fails very badly on an infinite set of embedders, because we can consider an infinite sequence of embedders that are like "pressing the button dispenses 1 utility forever", "pressing the button delivers an electric shock the first time, and then dispenses 1 utility forever"... "pressing the button delivers an electric shock for the first n times, and then dispenses 1 utility forever"... "pressing the button just always delivers an electri

... (read more)
Asymptotic Decision Theory (Improved Writeup)

ADT (asymptotic decision theory) was an initial attempt at decision theory with logical inductors before the standard form (that has exploration steps), which is detailed in this post.

I'm confused by this sentence. First, what is the standard form? ADT was definitely invented after logical EDT with exploration (the thing you link to). Second, why do you link to a post on logical EDT and not to the ADT paper?

1Diffractor3yI think I remember the original ADT paper showing up on agent foundations forum before a writeup on logical EDT with exploration, and my impression of which came first was affected by that. Also, the "this is detailed in this post" was referring to logical EDT for exploration. I'll edit for clarity.
Load More