## AI ALIGNMENT FORUMAF

Jessica Taylor

Jessica Taylor. CS undergrad and Master's at Stanford; former research fellow at MIRI.

I work on decision theory, social epistemology, strategy, naturalized agency, mathematical foundations, decentralized networking systems and applications, theory of mind, and functional programming languages.

Blog: unstableontology.com

# Wiki Contributions

Sorted by

Yes I still endorse the post. Some other posts:

Two alternatives to logical counterfactuals (note: I think policy dependent source code works less well than I thought it did at the time of writing)

A critical agential account... (general framework, somewhat underspecified or problematic in places but leads to more specific things like the linear logic post; has similarities to constructor theory)

The axioms of U are recursively enumerable. You run all M(i,j) in parallel and output a new axiom whenever one halts. That's enough to computably check a proof if the proof specifies the indices of all axioms used in the recursive enumeration.

Thanks, didn't know about the low basis theorem.

U axiomatizes a consistent guessing oracle producing a model of T. There is no consistent guessing oracle applied to U.

In the previous post I showed that a consistent guessing oracle can produce a model of T. What I show in this post is that the theory of this oracle can be embedded in propositional logic so as to enable provability preserving translations.

LS shows to be impossible one type of infinitarian reference, namely to uncountably infinite sets. I am interested in showing to be impossible a different kind of infinitarian reference. "Impossible" and "reference" are, of course, interpreted differently by different people.

Ok, I misunderstood. (See also my post on the relation between local and global optimality, and another post on coordinating local decisions using MCMC)

UDT1.0, since it’s just considering modifying its own move, corresponds to a player that’s acting as if it’s independent of what everyone else is deciding, instead of teaming up with its alternate selves to play the globally optimal policy.

I thought UDT by definition pre-computes the globally optimal policy? At least, that's the impression I get from reading Wei Dai's original posts.

I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.

There are evolutionary priors for what to be afraid of but some of it is learned. I've heard children don't start out fearing snakes but will easily learn to if they see other people afraid of them, whereas the same is not true for flowers (sorry, can't find a ref, but this article discusses the general topic). Fear of heights might be innate but toddlers seem pretty bad at not falling down stairs. Mountain climbers have to be using mainly mechanical reasoning to figure out which heights are actually dangerous. It seems not hard to learn the way in which heights are dangerous if you understand the mechanics required to walk and traverse stairs and so on.

Instincts like curiosity are more helpful at the beginning of life, over time they can be learned as instrumental goals. If an AI learns advanced metacognitive strategies instead of innate curiosity that's not obviously a big problem from a human values perspective but it's unclear.

From a within-lifetime perspective, getting bored is instrumentally useful for doing "exploration" that results in finding useful things to do, which can be economically useful, be effective signalling of capacity, build social connection, etc. Curiosity is partially innate but it's also probably partially learned. I guess that's not super different from pain avoidance. But anyway, I don't worry about an AI that fails to get bored, but is otherwise basically similar to humans, taking over, because not getting bored would result in being ineffective at accomplishing open-ended things.