Jessica Taylor. CS undergrad and Master's at Stanford; former research fellow at MIRI.

I work on decision theory, social epistemology, strategy, naturalized agency, mathematical foundations, decentralized networking systems and applications, theory of mind, and functional programming languages.



Ok, I misunderstood. (See also my post on the relation between local and global optimality, and another post on coordinating local decisions using MCMC)

UDT1.0, since it’s just considering modifying its own move, corresponds to a player that’s acting as if it’s independent of what everyone else is deciding, instead of teaming up with its alternate selves to play the globally optimal policy.

I thought UDT by definition pre-computes the globally optimal policy? At least, that's the impression I get from reading Wei Dai's original posts.

I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.

There are evolutionary priors for what to be afraid of but some of it is learned. I've heard children don't start out fearing snakes but will easily learn to if they see other people afraid of them, whereas the same is not true for flowers (sorry, can't find a ref, but this article discusses the general topic). Fear of heights might be innate but toddlers seem pretty bad at not falling down stairs. Mountain climbers have to be using mainly mechanical reasoning to figure out which heights are actually dangerous. It seems not hard to learn the way in which heights are dangerous if you understand the mechanics required to walk and traverse stairs and so on.

Instincts like curiosity are more helpful at the beginning of life, over time they can be learned as instrumental goals. If an AI learns advanced metacognitive strategies instead of innate curiosity that's not obviously a big problem from a human values perspective but it's unclear.

From a within-lifetime perspective, getting bored is instrumentally useful for doing "exploration" that results in finding useful things to do, which can be economically useful, be effective signalling of capacity, build social connection, etc. Curiosity is partially innate but it's also probably partially learned. I guess that's not super different from pain avoidance. But anyway, I don't worry about an AI that fails to get bored, but is otherwise basically similar to humans, taking over, because not getting bored would result in being ineffective at accomplishing open-ended things.

I think use of AI tools could have similar results to human cognitive enhancement, which I expect to basically be helpful. They'll have more problems with things that are enhanced by stuff like "bigger brain size" rather than "faster thought" and "reducing entropic error rates / wisdom of the crowds" because they're trained on humans. One can in general expect more success on this sort of thing by having an idea of what problem is even being solved. There's a lot of stuff that happens in philosophy departments that isn't best explained by "solving the problem" (which is under-defined anyway) and could be explained by motives like "building connections", "getting funding", "being on the good side of powerful political coalitions", etc. So psychology/sociology of philosophy seems like an approach to understand what is even being done when humans say they're trying to solve philosophy problems.

Something approximating utility function optimization over partial world configurations. What scope of world configuration space is optimized by effective systems depends on the scope of the task. For something like space exploration, the scope of the task is such that accomplishing it requires making trade-offs over a large sub-set of the world, and efficient ways of making these trade-offs are parametrized by utility function over this sub-set.

What time-scale and spatial scope the "pick thoughts in your head" optimization is over depends on what scope is necessary for solving the problem. Some problems like space exploration have a necessarily high time and space scope. Proving hard theorems has a smaller spatial scope (perhaps ~none) but a higher temporal scope. Although, to the extent the distribution over theorems to be proven depends on the real world, having a model of the world might help prove them better.

Depending on how the problem-solving system is found, it might be that the easily-findable systems that solve the problem distribution sufficiently well will not only model the world but care about it, because the general consequentalist algorithms that do planning cognition to solve the problem would also plan about the world. This of course depends on the method for finding problem-solving systems, but one could imagine doing hill climbing over ways of wiring together a number of modules that include optimization and world-modeling modules, and easily-findable configurations that solve the problem well might solve it by deploying general-purpose consequentialist optimization on the world model (as I said, many possible long-term goals lead to short-term compliant problem solving as an instrumental strategy).

Again, this is relatively speculative, and depends on the AI paradigm and problem formulation. It's probably less of a problem for ML-based systems because the cognition of an ML system is aggressively gradient descended to be effective at solving the problem distribution.

The problem is somewhat intensified in cases where the problem relates to already-existing long-term agents such as in the case of predicting or optimizing with respect to humans, because the system at some capability level would simulate the external long-term optimizer. However, it's unclear how much this would constitute creation of an agent with different goals from humans.

Note that beyond not-being-mentioned, such arguments are also anthropically filtered against: in worlds where such arguments have been out there for longer, we died a lot quicker, so we’re not there to observe those arguments having been made.

This anthropic analysis doesn't take into account past observers (see this post).

Competitive paperclip maximization in a controlled setting sounds like it might be fun. The important thing is that it's one thing that's fun out of many things, and variety is important.

What if I’m mainly interested in how philosophical reasoning ideally ought to work?

My view would suggest: develop a philosophical view of normativity and apply that view to the practice of philosophy itself. For example, if it is in general unethical to lie, then it is also unethical to lie about philosophy. Philosophical practice being normative would lead to some outcomes being favored over others. (It seems like a problem if you need philosophy to have a theory of normativity and a theory of normativity to do meta-philosophy and meta-philosophy to do better philosophy, but earlier versions of each theory can be used to make later versions of them, in a bootstrapping process like with compilers)

I mean normativity to include ethics, aesthetics, teleology, etc. Developing a theory of teleology in general would allow applying that theory to philosophy (taken as a system/practice/etc). It would be strange to have a distinct normative theory for philosophical practice than for other practices, since philosophical practice is a subset of practice in general; philosophical normativity is a specified variant of general normativity, analogous to normativity about other areas of study. The normative theory is mostly derived from cases other than cases of normative philosophizing, since most activity that normativity could apply to is not philosophizing.

How would you flesh out the non-foundationalist view?

That seems like describing my views about things in general, which would take a long time. The original comment was meant to indicate what is non-foundationalist about this view.

I don’t understand this sentence at all. Please explain more?

Imagine a subjective credit system. A bunch of people think other people are helpful/unhelpful to them. Maybe they help support helpful people and so people who are more helpful to helpful people (etc) succeed more. It's subjective, there's no foundation where there's some terminal goal and other things are instrumental to that.

An intersubjective credit system would be the outcome of something like Pareto optimal bargaining between the people, which would lead to a unified utility function, which would imply some terminal goals and other goals being instrumental.

Speculatively, it's possible to create an intersubjective credit system (implying a common currency) given a subjective credit system.

This might apply at multiple levels. Perhaps individual agents seem to have terminal goals because different parts of their mind create subjective credit systems and then they get transformed into an objective credit system in a way that prevents money pumps etc (usual consequences of not being a VNM agent).

I'm speculating that a certain kind of circular-seeming discourse, where area A is explained in terms of area B and vice versa, might be in some way analogous to a subjective credit network, and there might be some transformation of it that puts foundations on everything, analogous to founding an intersubjective credit network in terminal goals. Some things that look like circular reasoning can be made valid and others can't. The cases I'm considering are like, cases where your theory of normativity depends on your theory of philosophy and your theory of philosophy depends on your theory of meta-philosophy and your theory of meta-philosophy depends on your theory of normativity, which seems kind of like a subjective credit system.

Sorry if this is confusing (it's confusing to me too).

