UDT1.01: The Story So Far (1/10)

Diffractor

We now resume your regularly scheduled LessWrong tradition of decision theory posting.

This is a sequence, be sure to note.

Just the first and last post will be on Alignment Forum, and the whole thing will be linked together.

Epistemic Status: This is mostly just recapping old posts so far. If you're a decision-theory veteran, new stuff only starts arising in the "Computational Intractability" section and further down.

You may have heard of a thing called Updateless Decision Theory. It's been discussed for over a decade by now, but progress on rendering it down into something that could maybe someday be run on a computer has been very slow. Last winter, I decided to try helping out Abram and Martin with their efforts at actually formalizing it into some sort of algorithm that nice things could be proved about. I didn't quite succeed, but did manage to have a fair few novel insights that feel like they got a lot further than previous attempts, and I do see the basic outlines of an algorithm here.

So this is post 1 of a 10-part series which will try to cram some of the insights into the heads of people who are not me.

Why UDT?

Summarizing some old history, standard decision theory works great in a single-player environment, where your only pathway to influence reality is what you actually do. Once you transition to a multiplayer setting, where there are things running around that can make predictions about you, and select their actions based on what you will do, or would do in a different situation, things get a lot spicier.

Four examples of "things selecting their actions based on what you will do" are:

A human going "I will help you gain power iff it's very likely that you'll use some of that power to leave me better off for having done so" to another human.

A bank with a brain scanner going "I will give you a loan iff you seem like the sort of person who would predictably pay back the loan"

An AI thinking about another AI and concluding "This other AI, with a different utility function, will predictably fight me in the future if they get stronger, and won't accept binding precommitments. I can do math about their source code which strongly suggests this. Time to preemptively delete their ass."

A human going "It's late and I don't want to brush my teeth but there's a bunch of future-me's who are in the same situation and will make the same decision for the same reasons, so it's basically one decision tiled across time. FINE I'll brush my teeth."

Four examples of "things selecting their actions based on what you would do" are:

Scott Garrabrant in a game of One Night Ultimate Werewolf going "If I were the werewolf, which I may or may not be, lying and saying I'm not the werewolf would give you information you'd use to make me lose, if you're good at detecting lies. The globally optimal policy here is never answering questions about whose team I'm on in social deception games!" and I go "dammit" because he's the sort of person who does things like that whether or not he's a Villager or Werewolf.

A canny hacker who will threaten an AI if they can see any flaw in the supposedly "threat-resistant" decision theory of an AI. They never actually carry out any threat, just use it to walk away with thousands of dollars. This is a case where, based on what the AI would do in a wide range of scenarios, something decides to steer into the unpleasant ones.

A human trying to rip someone off in a deal because the other party is the sort of person who tries to rip others off in deals, even if they're being honorable in this specific one.

An AI working with clones of itself on a dispersed underwater construction project (being underwater badly messes with radio communications), that's doing a bunch of reasoning of the form "If I was in this alternate situation, I'd do X" and "the best up-front policy for all of us completing this project is Y" to coordinate workloads without needing to constantly swim back and forth to the designated meeting place to communicate with its alternate selves.

As it turns out, the best thing to do in this broader class of situations looks something like "act according to the policy you would have wanted to precommit to, if you had the ability to lock in your policy ahead of time." Just because you lose in a certain scenario, doesn't mean it's the wrong move to take, because predictably taking a losing move in a certain situation might make things better off overall. Other things looking in on your predictable decision, whether they be in your past, or future, or far-distant in space, must also be taken into account when you make a decision, and they can make the situation you're in be more or less likely, a-priori. It's called updateless, because instead of updating and going "oh, I'm in situation X, which has been revealed as the only situation to exist. Time to make situation X go well", you go "oh, I'm in situation X, time to act in such a way as to make things go well overall". Observations are for locating yourself, not for going "the rest of the game tree doesn't exist".

Interestingly enough, these sorts of "updateless" scenarios tend to arise in worst-case reasoning, precisely because worst-case reasoning can be interpreted as playing a game against an adversary with some limited ability to react to your move and make things worse off for you.

When making a strong AI, you'd really like it to be stable, so the nice properties in theory carry over to practice. If it's predictably going to rewrite itself to make decisions in a different way as soon as it gains self-modification ability, then all the work you put into the original decision procedure will be predictably wasted, and you should have been focusing on decision procedures where, when the AI grows up, it'll want to keep them around. So, if standard decision theory will make predictably crappy decisions in any environment that has predictors-of-your-actions, and near-clones running around, that standard decision theory will predictably get overwritten, and research effort would be better focused on whichever decision theories remain stable instead.

Of course, to crystallize these sorts of problems a bit better, a bunch of standardized toy problems have been assembled and discussed to death. For the benefit of people who are just now going "what's up with the decision-theory focus", here are some of the standard toy problems. They typically take the form of assuming a (near)-omniscient predictor, Omega, who likes putting people in weird decision theory scenarios, and has a >1000-to-1 ratio of correct to incorrect predictions. The standardized thought experiments are as follows.

Toy Problems And Terminology

Counterfactual Mugging: Omega commits to flip a coin, and if it comes up tails, ask you for 10 dollars, and if it comes up heads, give you 100 dollars iff you would have given it 10 dollars if the coin was tails. Omega explains this to you and then reveals that the coin came up tails. Do you give it the 10 dollars or not? Well, policies which pay up in this specific situation get 45 more dollars (in expectation) than policies which don't pay in this specific situation.

Parfit's Hitchhiker: You're stranded in the desert and Omega comes up. It will give you a ride out of the desert iff it predicts you'd give it 10,000 dollars upon reaching civilization again. You get a ride. When in civilization again, do you go over to the bank and withdraw some money? Well, policies which pay up in this specific situation get (value of a life - 10,000 dollars) more than policies which don't pay in this specific situation, which just die.

XOR Blackmail: Your house may or may not have termites, 1 percent chance. If it does, it costs 10,000 dollars for termite remediation. You recieve a letter from Omega which says "I have predicted that your house has a termite infestation xor you send 100 dollars to me". Policies which pay up in this specific situation will receive the letter with 99 percent probability, for -199 dollars in expectation. Policies which don't pay up in this specific situation will recieve the letter with 1 percent probability, for -100 dollars in expectation. So don't pay. Omega wasn't providing termite remediation services, just trying to take your money.

To establish a classification that we'll be using in future posts in this sequence, let's use "acausal effects" to denote the effects of a decision on timelines you're not in, because something's predicting what you would do in the situation you're in. Counterfactual mugging is an example of your decision in the tails branch having the acausal effect of Omega giving you money in the heads branch. These are also called cross-branch effects. And lets use "retrocausal effects" to denote the effects of a decision on what happened in your past, because something was predicting what you will do in the future. Parfit's Hitchhiker and XOR Blackmail are examples of your decision later-on having the retrocausal effect of getting saved in the desert/getting the letter sent to you or not. For Parfit's Hitchiker, you're trying to boost the probability of the branch you're in, and for XOR Blackmail, you're trying to minimize the probability of the branch you're in. These are also called retroactive effects. And "causal effects" just denotes the ordinary sort of effect where you affect the future through either completely typical means, or through something in the future acting based on what (they think) you did in the past.

UDT1.0 was the original formulation of UDT (updateless decision theory), which basically says "in a situation, select the action which maximizes a-priori expected utility". And UDT1.1 was the enhanced form, which says "pick a policy at the start which would maximize a-priori expected utility, and in a situation, just act according to what the policy recommends"

Ok, But Why Isn't UDT1.0 Enough?

Ok, so why wouldn't UDT1.0 work? Well, it was later pointed out that, in any situation where you need to correlate your action with what you do in alternate situations, UDT1.0 doesn't necessarily achieve that correlation behavior. If you're in an environment where stuff can predict your actions in future scenarios, or past scenarios, or alternate scenarios, then it's presumably possible to combine these predictions in ways which make it so that your optimal action depends on what the yous in alternate situations are doing. Local optimality doesn't imply global optimality here, you've gotta figure out a policy for what you do in all the possible situations.

Put another way, if you're already postulating an Omega who can predict what you'd do in a wide range of situations, it can presumably implement payoffs that depend on what you do in the various situations. If there are 4 situations that Omega is considering, and in each situation you have two actions available, then Omega can presumably set up any scenario of type (a payoff for each combination of what you do in the four scenarios). So, in full generality, with agents predicting you, it comes down to picking a general policy that, for each of the four scenarios, specifies what action to take. Just winging it by individually selecting your action might not cut it.

If we consider the version of you in a situation as a player in a game, then you're basically in a single-shot really large game against the yous in different situations, and all players have the same utility function. If you were updating, then the single-shot really large game would have the players (the yous in different situations) having different utility functions, because everyone only cares about the specific branch they're on, and the alternate versions of themself can fuck off.

But, even if all players in a game have the same utility function, there can be Nash equilibria that aren't globally optimal. Therefore, if you want global optimality, you've gotta coordinate everyone's moves. Ie, decide on the best policy/observation-to-action mapping at the start of time, and then just act in accordance with that.

UDT1.0, since it's just considering modifying its own move, corresponds to a player that's acting as if it's independent of what everyone else is deciding, instead of teaming up with its alternate selves to play the globally optimal policy.

UDT1.1 is the improved version which, at the start of time, constructs a policy for how to react to every possible situation, and then just follows that policy.

Computational Intractability

Of course, finding a globally optimal policy is ridiculously hard. If your notion of decision-making is "find the best possible mapping of observations to behavior at the start of time" then you run directly into the problem that you at the start of time is kinda dumb and might not make the best decisions, nor be able to forsee all eventualities. This obstacle gets far worse when we take into account our uncertainty over mathematics instead of just uncertainty over the environment, but that's a matter for a different post.

Let's try to formalize a bit of how impractical it is, with a nice toy setting that will get reused over and over again in this sequence of posts.

Lets say you're in an environment where there's a finite set of observations and actions. At every timestep, an observation will show up, and then you can take an action, and you're playing for n timesteps.

$O$ will denote our space of observations. $O^{n}$ is the space of n-digit bitstrings, ie, histories. $O^{\leq n}$ is the space of n-or-less-digit bitstrings. $h$ is often used to denote a string of observations/history. Observation strings/histories will also be referred to as "locations".

There is a tree of depth $n$ , whose nodes can be indexed by $\leq n$ -digit observation strings $O^{\leq n}$ . Ie, the tree of all possible histories.

There is a utility function on completed histories, $U : O^{n} \to [0, 1]$ .

At each node that isn't a leaf, an action in $A$ may be taken. The space of ways to react to bitstrings, $O^{< n} \to A$ , is henceforth denoted as $Π$ , the space of policies. We'll be a bit sloppy in distinguishing between deterministic and probabilistic policies. $π$ denotes a specific policy, and $π (h)$ is the action taken at location $h$ by $π$ , as you'd expect.

There is a function $e$ (for environment) of type $Π \times O^{< n} \to Δ O$ . Ie, given your policy, and the location $h$ you're at, it gives a probability distribution over which observation comes after h. UDT is typically associated with decision problems where something is predicting what you will do in future circumstances, or what you would do if things were different, and the most general form of this is that the probability distribution over what happens depends on your entire policy (what you do in various situations).

Given a policy $π$ and environment $e$ , $π ⋈ e$ denotes the probability distribution on length n histories, $Δ (O^{n})$ , produced by $π$ interacting with $e$ . We'll abuse notation a bit so that $(π ⋈ e) (b)$ for $b$ with $| b | < n$ actually refers to the probability of the final string being a length n extension of $b$ .

So that's our basic setup. There's a tree of observation strings, you can act at each observation string, and the environment may look at your policy and the bitstring you're at to decide which observation to extend it with. And the goal is to maximize expected utility. $E_{π ⋈ e} [U]$

UDT1.1 is then just ${argmax}_{π} E_{π ⋈ e} [U]$

But it's ludicrously hard to compute. The runtime of this is around $| A |^{| O |^{n}} \cdot T (n)$ where $T (n)$ is the time it takes to assess the expected utility of $π ⋈ e$ . After all, there are about $| O |^{n}$ elements of $O^{< n}$ and the space of policies is $O^{< n} \to A$ so there are around $| A |^{| O |^{n}}$ policies to check. So the naive version of UDT 1.1 runs in double-exponential time (wrt bitstring length).

Also, a "typical" environment with the specified type signature takes takes about $| O | \cdot | O |^{n} \cdot | A |^{| O |^{n}}$ numbers to describe, because for all history,policy,observation pairs you have to specify the odds of that observation coming after that history if you played that policy. So there are double-exponentially many environments. This is a whole lot of data. The sorts of environments we have a snowball's chance in hell of learning are presumably describable with far less data.

Implicitly, UDT1.1, despite being "updateless", is still scarfing down an absolutely massive amount of data about the environment to make its decision, except that this "massive amount of data", instead of being learned on the fly, is arbitrarily wadded up and stuck in the problem description at the very start, with no explanation of where it comes from. More on this tidbit, where the data is being snuck in at the very start, in a later post.

To understate things, double-exponential algorithms are terrible if our goal is to one day build a thing, and we should find something better.

But this doesn't look like it can be notably improved without further restrictions. This is because for any policy $π$ you can always come up with a really perverse environment $e_{π}$ that cooordinates the entire tree to give you a good outcome exactly when your policy is $π$ , and a terrible outcome otherwise. Ie, "do exactly what I say in all concievable circumstances and you win, anything else means you lose".

This is about where all the previous research on UDT stalled out. More progress requires making more assumptions.

The Rant About Learning Theory

I've lately become more of a fan of the classic learning theory setting than I used to be. The standard Bayesian story is that you have a prior over which environment you're in, and as you see more, you update your prior over which environment you're in, and act to optimize the environment you seem to be in.

The learning theory story is that you have a set of environments, where, no matter the environment in the set, you want to converge to decent behavior. It's more of a worst-case behavior guarantee than an "on-average according to the prior" guarantee. The link between this, and the Bayesian story, is that if you've got an algorithm that converges to optimal behavior in all the environments in your set, it'll also converge to optimal behavior in any probabilistic mix of the environments in your set. And conversely, if there's an algorithm that'll converge to optimal behavior according to some suitable Bayesian prior, it'll also converge to optimal behavior in all the component environments if it's possible to do so.

A key part of the learning theory story, is that not every set of environments is learnable. There are some sets of environments which are so rich, that there's no strategy which converges to doing well on all of them. No matter your strategy, there will be some environment where that strategy fails horribly. The goal is to get a set of environments that's rich enough that it probably contains the thing you want to learn somewhere in there, but not rich enough to completely sacrifice your ability to learn anything.

If there's some property of n-bit bitstrings, which can be either true or false, then trying to learn what that property is, to make predictions about unobserved bitstrings, is basically trying to learn an unknown function ${0, 1}^{n} \to {0, 1}$ . Now, if your class of functions is sufficiently restricted, and the true property is in that restricted class, you can do alright, because the class is "small enough" that the number of data points required to pin down a specific function grows polynomially with n instead of exponentially, so you don't need to observe too much data to pin down what's going on. But if your class of functions is "all boolean functions", the number of data points required to pin down a specific hypothesis is roughly the amount of data you'd need to write the entire damn lookup table.

And so, the nontrivial claim of the learning theory side of things is that, if the Bayesian side says "I've got a prior over hypotheses and in most of my prior I converge to learning what's going on pretty fast", then the Bayesian side is implicitly assigning large prior probability to a learnable set of hypotheses. (In a vague sense. For the precise definition of learnability, it may fail, but I expect something learnability-like to be present)

Applying this philosophy to the question "can we come up with any remotely reasonable algorithm that behaves kinda sorta like UDT" produces a rather clear diagnosis of where the decade of little progress came from.

There's some class of environments where converging to good behavior in them requires figuring out stuff like "what sorts of things are running around? How well can they predict me? Which sorts of acausal/retrocausal links can actually be learned by experimenting? What sorts of precommitments are the sort that produce good results, and what sorts of precommitments are the sort which lock in stupid decisions for long periods of time? When should I defer to my future self?", and exploiting that environmental structure by turning into the sort of thing that predictably acts as it would want things to predict it acts.

However, the sort of problem setup from earlier, where we're considering all the policy-prediction environments, is obviously far too rich a class of environments to get any sort of reasonable guarantee. It's even worse than trying to learn the class of all boolean functions. And so it should come as no surprise that everyone's implicitly running face-first into the No Free Lunch Theorem when they try to think about UDT.

To make any further progress, we must restrict the sorts of environments we're thinking about, to a class with more structure to exploit, and have the algorithm somehow figure out what sort of setting it's in.

And the objection of "well isn't that updating then?" can only get made because everyone's being unacceptably vague about exactly what that "prior over policy-selection environments" consists of, and handwaving away the fact that the entire problem and all its complications and data about the environment got stuffed into the prior and that argmax computation. In fact, the advance prediction can be made that the more Bayesian side of things will either never be able to make progress on showing a "prior on policy-selection problems with nice properties", or if they do come up with a prior that has nontrivial nice properties, it will implicitly be because that prior is putting a lot of mass on some substantially-more-restricted class of policy-selection environments with a lot more structure to exploit, which is learnable in some sense.

To say more about a piece of math, one must make more assumptions. More properties are true of an apple than true of all the apples in the world, and more can be shown about linear algebra in particular than about modules in general.

We shouldn't expect good performance in every conceivable environment. There are some really messed-up functions out there. But the sorts of functions that we tend to deal with in everyday life are much more manageable functions. There's a no free lunch theorem in learning theory. But in practice we're often faced with problems where it's possible to learn to do well. Loss functions of neural networks aren't convex. And yet gradient descent works pretty well anyways.

And so, we began work on "if you're willing to make some simplifying assumptions that rule out most possible policy-selection environments, and look for a UDT1.0 sort of algorithm to just climb up to a local optimum in policy-space that's probably pretty good in practice, and ignore the cross-branch coordination aspect, how far can you get?" Well, a lot further than this, it turns out.

To be continued.

21

UDT1.01: The Story So Far (1/10)

21