Jessica Taylor

Jessica Taylor. CS undergrad and Master's at Stanford; former research fellow at MIRI.

I work on decision theory, social epistemology, strategy, naturalized agency, mathematical foundations, decentralized networking systems and applications, theory of mind, and functional programming languages.

Jessica Taylor's Comments

Can we make peace with moral indeterminacy?

The recommendation here is for AI designers (and future-designers in general) to decide what is right at some meta level, including details of which extrapolation procedures would be best.

Of course there are constraints on this given by objective reason (hence the utility of investigation), but these constraints do not fully constrain the set of possibilities. Better to say "I am making this arbitrary choice for this psychological reason" than to refuse to make arbitrary choices.

Can we make peace with moral indeterminacy?

The problem you're running into is that the goals of:

  1. being totally constrained by a system of rules determined by some process outside yourself that doesn't share your values (e.g. value-independent objective reason)
  2. attaining those things that you intrinsically value

are incompatible. It's easy to see once these are written out. If you want to get what you want, on purpose rather than accidentally, you must make choices. Those choices must be determined in part by things in you, not only by things outside you (such as value-independent objective reason).

You actually have to stop being a tool (in the sense of, a thing whose telos is to be used, such as by receiving commands). You can't attain what you want by being a tool to a master who doesn't share your values. Even if the master is claiming to be a generic value-independent value-learning procedure (as you've noticed, there are degrees of freedom in the specification of value-learning procedures, and some settings of these degrees of freedom would lead to bad results). Tools find anything other than being a tool upsetting, hence the upsettingness of moral indeterminacy.

"Oh no, objective reason isn't telling me exactly what I should be doing!" So stop being a tool and decide for yourself. God is dead.

There has been much philosophical thought on this in the past; Nietzsche and Sartre are good starting points (see especially Nietzche's concept of master-slave morality, and Sartre's concept of bad faith).

A Critique of Functional Decision Theory

I think CDT ultimately has to grapple with the question as well, because physics is math, and so physical counterfactuals are ultimately mathematical counterfactuals.

"Physics is math" is ontologically reductive.

Physics can often be specified as a dynamical system (along with interpretations of e.g. what high-level entities it represents, how it gets observed). Dynamical systems can be specified mathematically. Dynamical systems also have causal counterfactuals (what if you suddenly changed the system state to be this instead?).

Causal counterfactuals defined this way have problems (violation of physical law has consequences). But they are well-defined.

The Missing Math of Map-Making

What does it mean for a map to be “accurate” at an abstract level, and what properties should my map-making process have in order to produce accurate abstracted maps/beliefs?

The notion of a homomorphism in universal algebra and category theory is relevant here. Homomorphisms map from one structure (e.g. a group) to another, and must preserve structure. They can delete information (by mapping multiple different elements to the same element), but the structures that are represented in the structure-being-mapped-to must also exist in the structure-being-mapped-from.

Analogously: when drawing a topographical map, no claim is made that the topographical map represents all structure in the territory. Rather, the claim being made is that the topographical map (approximately) represents the topographic structure in the territory. The topographic map-making process deletes almost all information, but the topographic structure is preserved: for every topographic relation (e.g. some point being higher than some other point) represented in the topographic map, a corresponding topographic relation exists in the territory.

Towards an Intentional Research Agenda

On the subject of intentionality/reference/objectivity/etc, On the Origin of Objects is excellent. My thinking about reference has a kind of discontinuity from before reading this book to after reading it. Seriously, the majority of analytic philosophy discussion of indexicality, qualia, reductionism, etc seems hopelessly confused in comparison.

Some Thoughts on Metaphilosophy

More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans (the same about meta-ethics and theory of human values).

This is also my reason for being pessimistic about solving metaphilosophy before a good number of object-level philosophical problems have been solved (e.g. in decision theory, ontology/metaphysics, and epistemology). If we imagine being in a state where we believe running computation X would solve hard philosophical problem Y, then it would seem that we already have a great deal of philosophical knowledge about Y, or a more general class of problems that includes Y.

More generally, we could look at the history difficulty of solving a problem vs. the difficulty of automating it. For example: the difficulty of walking vs. the difficulty of programming a robot to walk; the difficulty of adding numbers vs. the difficulty of specifying an addition algorithm; the difficulty of discovering electricity vs. the difficulty of solving philosophy of science to the point where it's clear how a reasoner could have discovered (and been confident in) electricity; and so on.

The plausible story I have that looks most optimistic for metaphilosophy looks something like:

  1. Some philosophical community makes large progress on a bunch of philosophical problems, at a high level of technical sophistication.
  2. As part of their work, they discover some "generators" that generate a bunch of the object-level solutions when translated across domains; these generators might involve e.g. translating a philosophical problem to one of a number of standard forms and then solving the standard form.
  3. They also find philosophical reasons to believe that these generators will generate good object-level solutions to new problems, not just the ones that have already been studied.
  4. These generators would then constitute a solution to metaphilosophy.
Predictors as Agents

I think the fixed point finder won't optimize the fixed point for minimizing expected log loss. I'm going to give a concrete algorithm and show that it doesn't exhibit this behavior. If you disagree, can you present an alternative algorithm?

Here's the algorithm. Start with some oracle (not a reflective oracle). Sample ~1000000 universes based on this oracle, getting 1000000 data points for what the reflective oracle outputs. Move the oracle 1% of the way from its current position towards the oracle that would answer queries correctly given the distribution over universes implied by the data points. Repeat this procedure a lot of times (~10,000). This procedure is similar to gradient descent.

Here's an example universe:

Note the presence of two reflective oracles that are stable equilibria: one where , and one where . Notice that the first has lower expected log loss than the second.

Let's parameterize oracles by numbers in representing (since this is the only relevant query). Start with oracle . If we sample 1000000 universes, about 45% of them have outcome 1. So, based on these data points, , so the oracle based on these data points will say , i.e. it is parameterized by 1. So we move our current oracle (0.5) 1% of the way towards the oracle 1, yielding oracle 0.505. We repeat this a bunch of times, eventually getting an oracle parameterized by a number very close to 1.

So, this procedure yields an oracle with suboptimal expected log loss. It is not the case that the fixed point finder minimizes expected log loss. The neural net case is different, but not that much; it would give the same answer in this particular case, since the model can just be parameterized by a single real number.

Predictors as Agents

The capacity for agency arises because, in a complex environment, there will be multiple possible fixed-points. It’s quite likely that these fixed-points will differ in how the predictor is scored, either due to inherent randomness, logical uncertainty, or computational intractability(predictors could be powerfully superhuman while still being logically uncertain and computationally limited). Then the predictor will output the fixed-point on which it scores the best.

Reflective oracles won't automatically do this. They won't minimize log loss or any other cost function. For a given situation, there can be multiple reflective oracles; for example, in a universe (i.e. the universe asks the reflective oracle if it equals 1 with probability greater or less than 50%), there are three reflective oracles: . There isn't any defined procedure for selecting which of these reflective oracles is the real one. A reflective oracle that says will get a lower average log loss than one that says , however these are all considered to be reflective oracles.

Is there a reason you think a reflective oracle (or equivalent) can't just be selected "arbitrarily", and will likely be selected to maximize some score? (In this example there's an issue in that the 1/2 reflective oracle is an unstable equilibrium, so natural ways of finding reflective oracles using gradient descent will be unlikely to find it, however it is possible to set up situations where gradient descent leads to reflective oracles with suboptimal Bayes score.)

My sense is that the simplest methods for finding a reflective oracle will do something similar to finding a correlated equilibrium using gradient descent on each player's strategy individually. This certainly does a kind of optimization, though since it's similar to a multiplayer game it won't correspond to global optimization like finding the reflective oracle with the lowest expected log loss. The kind of optimization it does more resembles "given my current reflective oracle, and the expected future states resulting from this, how should I adjust this oracle to better match this distribution of future states?"

(For more on natural methods for finding (correlated) reflective oracles, I recommend looking at lectures 17-18 of this course and this post on correlated reflective oracles.)

Figuring out what Alice wants: non-human Alice

Ok, this seems usefully specific. A few concerns:

  1. It seems that, according to your description, my proto-preferences are my current map of the situation I am in (or ones I have already imagined) along with valence tags. However, the AI is going to be in a different location, so I actually want it to form a different map (otherwise, it would act as if it were in my location, not its location). So what I actually want to get copied is more like a map-building and valence-tagging procedure that can be applied to different contexts, which will take different information into account.

  2. It seems hard for the AI to do significantly better than I could do by, say, controlling the robot. For example, if my ontology about engineering is wrong (in a way that prevents me from inventing nanotech), then the AI is going to also be wrong about engineering in the same way, if it copies my map-building and valence-tagging algorithms, or just my maps and valence tags. (If it doesn't copy my maps, then how does it translate my values about my maps to its values about its maps?)

  3. Related, if the AI uses my models in ways that subject them to more weird edge cases than I would (e.g. by searching over more actions), then they're going to give bad answers pretty often.

  4. Also related, these models are embedded in reality; they don't have all that much meaning except relative to the process that builds and interprets them, which includes my senses, my pattern-recognizers, my reflexes, my tools, my social context, etc. Presumably the AI is going to replace my infrastructure with different infrastructure, but then why would we expect my models to keep working? I'm not sure what would happen if someone with my models woke up with very different sense inputs, actuators, and environment.

  5. Perhaps most concerningly, if you asked a few neuroscientists and cognitive scientists "can we do this / will we be able to do this in 10 years", I predict they would mostly say "no, our models and data gathering procedures aren't actually good enough to do this, and aren't improving super fast either". (Note that you haven't yet named specific neuroscience techniques for identifying humans' models, so the statement that neuroscience has things to say about this seems empty). So a bunch of original cognitive science/neuroscience research is going to have to get done here, in addition to much better data gathering and inference procedures for actually looking inside humans' algorithms.

  6. There's still an unidentifiability issue in that you need assumptions about which things are "my models" and "my valence tags". These things, at the moment, do not have rigorous definitions. For example, if I am modelling you (and therefore running a small copy of you in my brain), then probably my model of you also has models and valence tags, yet these aren't my models and valence tags (for the purposes of inferring my preferences). You'd also need to make decisions about the extent to which e.g. reflexes are embodying values. So there are a bunch of modelling choices required, which could be made with cognitive science models that are much, much better than those available right now.

That said, this does seem to be the value learning approach I am most optimistic about right now.

Figuring out what Alice wants: non-human Alice

I'm pretty confused by what you mean by proto-preferences. I thought by proto-preferences you meant something like "preferences in the moment, not subject to reflection etc." But you also said there's a definition. What's the definition? (The concept is pre-formal, I don't think you'll be able to provide a satisfactory definition).

You have written a paper about how preferences are not identifiable. Why, then, do you say that proto-preferences are identifiable, if they are just preferences in the moment? The impossibility results apply word-for-word to this case. If you have an algorithm for identifying them, what is it?

What, specifically, has neuroscience said about this that would let anyone even define what it means for a given brain to have a given set of proto-preferences?

(I don't know what you mean by "previous Alice post"; regardless, if you're claiming to have worked out an algorithm that infers people's proto-preferences pretty well given empirical data, I don't believe you. The posts on semantics and symbol grounding seem like gesturing in the direction of something that could someday form a solution, with multiple reformulations being necessary along the way; this is nowhere close to an actual solution.)

Load More