## AI ALIGNMENT FORUMAF

Jessica Taylor

Jessica Taylor. CS undergrad and Master's at Stanford; former research fellow at MIRI.

I work on decision theory, social epistemology, strategy, naturalized agency, mathematical foundations, decentralized networking systems and applications, theory of mind, and functional programming languages.

Blog: unstableontology.com

# Wiki Contributions

$1000 USD prize - Circular Dependency of Counterfactuals Thanks for reading all the posts! I'm not sure where you got the idea that this was to solve the spurious counterfactuals problem, that was in the appendix because I anticipated that a MIRI-adjacent person would want to know how it solves that problem. The core problem it's solving is that it's a well-defined mathematical framework in which (a) there are, in some sense, choices, and (b) it is believed that these choices correspond to the results of a particular Turing machine. It goes back to the free will vs determinism paradox, and shows that there's a formalism that has some properties of "free will" and some properties of "determinism". A way that EDT fails to solve 5 and 10 is that it could believe with 100% certainty that it takes$5 so its expected value for $10 is undefined. (I wrote previously about a modification of EDT to avoid this problem.) CDT solves it by constructing physically impossible counterfactuals which has other problems, e.g. suppose there's a Laplace's demon that searches for violations of physics and destroys the universe if physics is violated; this theoretically shouldn't make a difference but it messes up the CDT counterfactuals. It does look like your post overall agrees with the view I presented. I would tend to call augmented reality "metaphysics" in that it is a piece of ontology that goes beyond physics. I wrote about metaphysical free will a while ago and didn't post it on LW because I anticipated people would be allergic to the non-physicalist philosophical language.$1000 USD prize - Circular Dependency of Counterfactuals

It seems like agents in a deterministic universe can falsify theories in at least some sense. Like they take two different weights drop them and see they land at the same time falsifying the fact that heavier objects fall faster

The main problem is that it isn't meaningful for their theories to make counterfactual predictions about a single situation; they can create multiple situations (across time and space) and assume symmetry and get falsification that way, but it requires extra assumptions. Basically you can't say different theories really disagree unless there's some possible world / counterfactual / whatever in which they disagree; finding a "crux" experiment between two theories (e.g. if one theory says all swans are white and another says there are black swans in a specific lake, the cruxy experiment looks in that lake) involves making choices to optimize disagreement.

In the second case, I would suggest that what we need is counterfactuals not agency. That is, we need to be able to say things like, "If I ran this experiment and obtained this result, then theory X would be falsified", not "I could have run this experiment and if I did and we obtained this result, then theory X would be falsified".

Those seem pretty much equivalent? Maybe by agency you mean utility function optimization, which I didn't mean to imply was required.

The part I thought was relevant was the part where you can believe yourself to have multiple options and yet be implemented by a specific computer.

\$1000 USD prize - Circular Dependency of Counterfactuals

I previously wrote a post about reconciling free will with determinism. The metaphysics implicit in Pearlian causality is free will (In Drescher's words: "Pearl's formalism models free will rather than mechanical choice."). The challenge is reconciling this metaphysics with the belief that one is physically embodied. That is what the post attempts to do; these perspectives aren't inherently irreconcilable, we just have to be really careful about e.g. distinguishing "my action" vs "the action of the computer embodying me" in a the Bayes net and distinguishing the interventions on them.

I wrote another post about two alternatives to logical counterfactuals: one says counterfactuals don't exist, one says that your choice of policy should affect your anticipation of your own source code. (I notice you already commented on this post, just noting it for completeness)

And a third post, similar to the first, reconciling free will with determinism using linear logic.

I'm interested in what you think of these posts and what feels unclear/unresolved, I might write a new explanation of the theoretical perspective or improve/extend/modify it in response.

Visible Thoughts Project and Bounty Announcement

How do you think this project relates to Ought? Seems like the projects share a basic objective (having AI predict human thoughts had in the course of solving a task). Ought has more detailed proposals for how the thoughts are being used to solve the task (in terms of e.g. factoring a problem into smaller problems, so that the internal thoughts are a load-bearing part of the computation rather than an annotation that is predicted but not checked for being relevant).

So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

Christiano, Cotra, and Yudkowsky on AI progress

This section seemed like an instance of you and Eliezer talking past each other in a way that wasn't locating a mathematical model containing the features you both believed were important (e.g. things could go "whoosh" while still being continuous):

[Christiano][13:46]

Even if we just assume that your AI needs to go off in the corner and not interact with humans, there’s still a question of why the self-contained AI civilization is making ~0 progress and then all of a sudden very rapid progress

[Yudkowsky][13:46]

unfortunately a lot of what you are saying, from my perspective, has the flavor of, “but can’t you tell me about your predictions earlier on of the impact on global warming at the Homo erectus level”

you have stories about why this is like totally not a fair comparison

I do not share these stories

[Christiano][13:46]

I don’t understand either your objection nor the reductio

like, here’s how I think it works: AI systems improve gradually, including on metrics like “How long does it take them to do task X?” or “How high-quality is their output on task X?”

[Yudkowsky][13:47]

I feel like the thing we know is something like, there is a sufficiently high level where things go whooosh humans-from-hominids style

[Christiano][13:47]

We can measure the performance of AI on tasks like “Make further AI progress, without human input”

Any way I can slice the analogy, it looks like AI will get continuously better at that task

Christiano, Cotra, and Yudkowsky on AI progress

A bunch of this was frustrating to read because it seemed like Paul was yelling "we should model continuous changes!" and Eliezer was yelling "we should model discrete events!" and these were treated as counter-arguments to each other.

It seems obvious from having read about dynamical systems that continuous models still have discrete phase changes. E.g. consider boiling water. As you put in energy the temperature increases until it gets to the boiling point, at which point more energy put in doesn't increase the temperature further (for a while), it converts more of the water to steam; after all the water is converted to steam, more energy put in increases the temperature further.

So there are discrete transitions from (a) energy put in increases water temperature to (b) energy put in converts water to steam to (c) energy put in increases steam temperature.

In the case of AI improving AI vs. humans improving AI, a simple model to make would be one where AI quality is modeled as a variable, , with the following dynamical equation:

where is the speed at which humans improve AI and is a recursive self-improvement efficiency factor. The curve transitions from a line at early times (where ) to an exponential at later times (where ). It could be approximated as a piecewise function with a linear part followed by an exponential part, which is a more-discrete approximation than the original function, which has a continuous transition between linear and exponential.

This is nowhere near an adequate model of AI progress, but it's the sort of model that would be created in the course of a mathematically competent discourse on this subject on the way to creating an adequate model.

Dynamical systems contains many beautiful and useful concepts like basins of attraction which make sense of discrete and continuous phenomena simultaneously (i.e. there are a discrete number of basins of attraction which points fall into based on their continuous properties).

I've found Strogatz's book, Nonlinear Dynamics and Chaos, helpful for explaining the basics of dynamical systems.

What 2026 looks like

This is quite good concrete AI forecasting compared to what I've seen elsewhere, thanks for doing it! It seems really plasusible based on how fast AI progress has been going over the past decade and which problems are most tractable.

Modeling naturalized decision problems in linear logic

CDT and EDT have known problems on 5 and 10. TDT/UDT are insufficiently formalized, and seem like they might rely on known-to-be-unfomalizable logical counterfactuals.

So 5 and 10 isn't trivial even without spurious counterfactuals.

What does this add over modal UDT?

• No requirement to do infinite proof search
• More elegant handling of multi-step decision problems
• Also works on problems where the agent doesn't know its source code (of course, this prevents logical dependencies due to source code from being taken into account)

Philosophically, it works as a nice derivation of similar conclusions to modal UDT. The modal UDT algorithm doesn't by itself seem entirely well-motivated; why would material implication be what to search for? On the other hand, every step in the linear logic derivation is quite natural, building action into the logic, and encoding facts about what the agent can be assured of upon taking different actions. This makes it easier to think clearly about what the solution says about counterfactuals, e.g. in a section of this post.

Topological metaphysics: relating point-set topology and locale theory

Reals are still defined as sets of (a, b) rational intervals. The locale contains countable unions of these, but all these are determined by which (a, b) intervals contain the real number.

Topological metaphysics: relating point-set topology and locale theory

Good point; I've changed the wording to make it clear that the rational-delimited open intervals are the basis, not all the locale elements. Luckily, points can be defined as sets of basis elements containing them, since all other properties follow. (Making the locale itself countable requires weakening the definition by making the sets to form unions over countable, e.g. by requiring them to be recursively enumerable)