I still don't completely understand what your assumptions are supposed to model, but if we take them on face value, then it seems to me that always making rainbows is the right answer. After all, if both hypotheses are "nosy neighbors" that don't care which universe we end up in, there's no point figuring out which universe we end up in: we should just make rainbows because it's cheaper. No?
Oh cool!
We could call the non-nosy hypotheses "nice neighbors".
Seems like a bad name: "nice neighbors" don't care if everyone 'around' them is being tortured.
I've framed things in this post in terms of value uncertainty, but I believe everything can be re-framed in terms of uncertainty about what the correct prior is (which connects better with the motivation in my previous post on the subject).
Wait, do you think value uncertainty is equivalent/reducible to uncertainty about the correct prior? Would that mean the correct prior to use depends on your values?
One issue with Geometric UDT is that it doesn't do very well in the presence of some utility hypotheses which are exactly or approximately negative of others: even if there is a Pareto-improvement, the presence of such enemies prevents us from maximizing the product of gains-from-trade, so Geometric UDT is indifferent between such improvements and the BATNA. This can probably be improved upon.
So one conflicting pair spoils the whole thing, i.e. ignoring the pair is a pareto improvement?
This post owes credit to discussions with Caspar Oesterheld, Scott Garrabrant, Sahil, Daniel Kokotajlo, Martín Soto, Chi Nguyen, Lukas Finnveden, Vivek Hebbar, Mikhail Samin, and Diffractor. Inspired in particular by a discussion with cousin_it.
Short version: The idea here is to combine the Nash bargaining solution with Wei Dai's UDT to give an explicit model of some ideas from Open-Minded Updatelessness by Nicolas Macé, Jesse Clifton, and SMK. Start with a prior divided into hypotheses . BATNA: updateful policy Actual policy choice maximizes the product of gains from trade across hypotheses: |
I wrote an essay arguing for open-minded updatelessness last year, but it was mostly motivation, and lacked a concrete proposal. The present essay fills that gap. I don't think the proposal is perfect, but it does address the problems I raised.
I'm calling my proposal Geometric UDT, due to taking inspiration from Scott Garrabrant's Geometric Rationality, although the connection is a bit weak; what I'm doing is VNM-rational, whereas Geometric Rationality is not.
You might be persuaded by reflective consistency concerns, and conclude that if humanity builds machine superintelligence, it had better use updateless decision theory (UDT)[1].[2] You might separately be convinced that machine superintelligence needs to do some kind of value learning, in order to learn human values, because human values are too difficult for humans to write down directly.[3]
You might then wonder whether these two beliefs stand in contradiction. "Value learning" sounds like it involves updating on evidence. UDT involves never updating on evidence. Is there a problem here?
"Maybe not" you might hope. "UDT is optimal in strictly more situations than updateful DT. This means UDT behaves updatefully in cases where behaving updatefully is optimal, even though it doesn't actually update.[4] Maybe value learning is one of those cases!"
You might do the math as follows:
Suppose that, before building machine superintelligence, we narrow down human values to two possibilities: puppies, or rainbows. We aren't sure which of these represents human values, but we're sure it is one of the two, and we assign them even odds.
In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility.
| Rainbow World | Puppy World | |
| Maximize Rainbows | +100 | 0 |
| Maximize Puppies | 0 | +90 |
The machine superintelligence will be able to observe enough information about humans to distinguish puppy-world from rainbow-world within 3 seconds of being switched on. There are four policies it could follow:
The highest EV is to do the obvious value-learning thing; so, there's no problem. UDT behaves updatefully, as is intuitively desirable!
Unfortunately, not all versions of this problem work out so nicely.
Some hypotheses will "play nice" like the example above, and updateless value learning will work fine.
However, there are some versions of "valuing puppies" and "valuing rainbows" which value puppies/rainbows regardless of which universe the puppies/rainbows are in. These utility functions are called "nosy neighbors" because they care about what happens in other realities, not just their own.[5] (We could call the non-nosy hypotheses "nice neighbors".)
Nosy Neighbors: Technical Details
Utility functions are standardly modeled as random variables. A random variable is a function from worlds (aka "outcomes") to real numbers; you feed the world to the utility function, and the utility function outputs a score for the world.
A world, in turn, can be understood as a truth-valuation: a function from propositions to true/false. Feed a proposition to a world, and the world will tell you whether the proposition was true or false.[6] You can imagine the utility function looking at the assignment of propositions to true/false in order to compute the world's score. For example, at each location L, the utility function could check the proposition puppy(L), which is true if there's a puppy at L. The score of a world could be the total number of true instances of puppy(L).[7]
In order to model uncertainty between utility functions, what I'm really suggesting is a utility function which behaves like one function or another depending on some fact. When I describe "puppy world" vs "rainbow world", I have in mind that there is some fact we are uncertain about (the fact of "what human values are") which is one way in case humans value puppies, and another way in case humans value rainbows. The utility function encodes our value uncertainty by first checking this fact, and then, if it is one way, proceeding to act like the puppy-valuing utility function (scoring the world based on puppies); otherwise, it acts like the rainbow-valuing function (scoring the world based on rainbows).
Naively, you might think that this means "nosy neighbors" don't make sense: the score of a world only depends on stuff in that world. With our uncertain utility function, the puppy-score is only computed in puppy-world. It can't check for puppies in rainbow-world. What could I possibly mean, when I say that the utility function "values puppies/rainbows regardless of which universe the puppies/rainbows are in"?
What I have in mind is that we believe in some sort of counterfactual information. In puppy-world, there should be a fact of the matter about "what would the machine superintelligence have done in rainbow world" -- some propositions tracking this contingency.
You don't have to be a realist about counterfactuals or possible worlds (a "modal realist") in order to believe this. There doesn't literally have to be a fact of the matter of what actually would happen in a non-real world. There just have to be good proxies, to cause problems. For example, there could be a fact of the matter of what happens when someone thinks through "what would have happened in rainbow-world?" in detail. Maybe at some point someone runs a detailed simulation of rainbow-world within puppy-world. This is sufficient to cause a problem.
This sort of thing is quite difficult to rule out, actually. For example, if you grant that mathematical propositions are true/false, that is enough: mathematics will have, somewhere inside it, a specification of both rainbow-world and puppy-world (or at least, adequately good approximations thereof).
I think of such propositions as "windows to other worlds" which exist inside a single world. Nosy Neighbor hypotheses are utility functions which depend on those propositions.[8]
If you care about what happens in the Game of Life (in the abstract, not just instances you see in front of you) then your values are nosy-neighbor values. If a puppy-loving utility hypothesis checks for puppy-like structures in Game of Life and adds them to the score, that's a nosy-neighbor hypothesis.
Suppose the puppy hypothesis and the rainbow hypothesis are both nosy neighbors. I'll assume they're nosy enough that they value puppies/rainbows in other universes exactly as much as in their own. There are four policies:
Hence, in this scenario, UDT will choose to always make rainbows.
This shows that in the presence of nosy neighbors, the naive concern can be vindicated: UDT has trouble with "value learning".
One response to this problem might be to assign nosy-neighbor hypotheses probability zero. The issue I have with this solution is that human values may well be nosy.
Instead, I propose we treat this as a bargaining problem between the hypotheses.
I'll use the Nash Bargaining Solution. This has some particularly nice properties, but the choice here is mostly out of habit, and we could certainly consider other options.
We start by choosing a BATNA (Best Alternative to Negotiated Agreement). This is a policy which we'd default to if the parties at the bargaining table (in our case, the hypotheses) were not able to come to an agreement.
My proposal is to set the BATNA to the policy an updateful decision theory would choose. For example, we could use updateful EDT, or we could use Thompson sampling based on the updated hypothesis weights. Either way, a hypothesis gets "more control" as it becomes more probable. This starting-point for bargaining gives each hypothesis a minimal guarantee of expected utility: puppy-world will get at least as much expected utility as if humans had built an updateful machine superintelligence instead of an updateless one.
Once we've chosen a BATNA, the Nash Bargaining Solution maximizes the product of the gains-from-trade:
This ensures a Pareto-optimal policy (unlike the updateful policy). This can only increase the expected utility every individual hypothesis expects to get, in comparison to what they'd get from the updateful policy. You can imagine that the hypotheses are trading away some things they'd control (in the updateful policy) in exchange for consideration from others. The puppy-maximizer might produce some rainbows in exchange for the rainbow-maximizer producing some puppies, but you won't get an all-rainbow policy like we did before, because we're not allowed to completely ignore any hypothesis.
Another reasonable option is to adjust the weight of each hypothesis based on its probability:
The probability of a hypothesis already adjusts what it gets in the BATNA, but this additional adjustment makes the solution less dependent on how we partition the probability distribution into hypotheses.
Since the chosen policy will be Pareto-optimal, Complete Class Theorems guarantee that it can be interpreted as maximizing expected utility with respect to some prior; that is, it'll still be UDT-rational. The interesting thing is that it isn't UDT-optimal with respect to our honest prior (the one we use to define the BATNA). Geometric UDT advises us to use a fair prior (in the Nash Bargaining Solution sense) instead of our honest prior. The honest prior has a chance of completely screwing over humans (completely ignoring the correct value-hypothesis even in the face of overwhelming evidence), whereas the fair prior does not.
I've framed things in this post in terms of value uncertainty, but I believe everything can be re-framed in terms of uncertainty about what the correct prior is (which connects better with the motivation in my previous post on the subject).
One issue with Geometric UDT is that it doesn't do very well in the presence of some utility hypotheses which are exactly or approximately negative of others: even if there is a Pareto-improvement, the presence of such enemies prevents us from maximizing the product of gains-from-trade, so Geometric UDT is indifferent between such improvements and the BATNA. This can probably be improved upon.
In this essay, UDT means UDT 1.1.
I'm not claiming there's a totally watertight argument for this conclusion given this premise; I'm only claiming that if you believe something like this you should probably care about what I'm doing here.
Even a simple strategy like "trust whatever the humans tell you about what they want" counts as value learning in this sense; the important thing is that the AI system doesn't start out totally confident about what humans want, and it observes things that let it learn.
This isn't a strictly true mathematical assertion; in reality, we need to make more assumptions in order to prove such a theorem (eg, I'm not defining what 'optimality' means here). The point is more that this is the sort of thing someone who is convinced of UDT is inclined to believe (they'll tend to be comfortable with making the appropriate additional assumptions such that this is true).
This phenomenon was discovered and named by Diffractor (private communication).
In logic, it is more typical to understand a world as a truth-valuation like this, so that worlds are functions from propositions to {true, false}. In probability, it is more typical to reverse things, treating a proposition (aka "event") as a set of worlds, so that given a world, you can check if it is in the set (so a proposition can be thought of as a function from worlds to {true, false}).
This distinction doesn't matter very much, at least not for our purposes here.
This particular utility function will not be well-defined if there are infinitely many locations, since the sum could fail to converge. There are many possible solutions to this problem, but the discussion goes beyond the present topic.
We can also get a nosy-neighbor effect without putting terminal utility on other worlds, if we believe that what happens in other worlds impacts our world. For example, maybe in puppy-world, a powerful being named Omega simulates what happens in rainbow-world, and creates or destroys some puppies accordingly. Care about what happens in other worlds is induced indirectly through beliefs.