AI ALIGNMENT FORUM
AF

Reframing Impact
Impact RegularizationExercises / Problem-SetsWorld Modeling
Frontpage

22

Attainable Utility Landscape: How The World Is Changed

by Alex Turner
10th Feb 2020
7 min read
7

22

Impact RegularizationExercises / Problem-SetsWorld Modeling
Frontpage
Previous:
Seeking Power is Often Convergently Instrumental in MDPs
34 comments162 karma
Next:
The Catastrophic Convergence Conjecture
11 comments45 karma
Log in to save where you left off
Attainable Utility Landscape: How The World Is Changed
2Pattern
2Alex Turner
1Rafael Harth
2Alex Turner
1Stuart Armstrong
New Comment
5 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:43 PM
[-]Pattern5y20
Going to the green state means you can't get to the purple state as quickly.
On a deep level, why is the world structured such that this happens? Could you imagine a world without opportunity cost of any kind?

In a complete graph, all nodes are directly connected.


Equivalently, we assumed the agent isn't infinitely farsighted (γ<1); if it were, it would be possible to be in "more than one place at the same time", in a sense (thanks to Rohin Shah for this interpretation).

The opposite of this, is that if it were possible for an agent to be in more than one place at the same time, they could be infinitely farsighted. (Possibly as a consequence of FTL.)

Reply
[-]Alex Turner5y20

In a complete graph, all nodes are directly connected.

Surprisingly, unless you're talking about K1 (complete 1-graph), opportunity cost still exists in Kn (n>1). Each round, you choose where to go next (and you can go to any state immediately). Going to one state next round means you can't go to a different state next round, so for any given action there exists a reward function which incurs opportunity cost.

Definition. We say opportunity cost exists at a state s if there exist child states s1,s2 of state s such that V∗R(s1)≠V∗R(s2) for some reward function R. That is, s has successor states with different (optimal) AUs for some reward function.

The opposite of this, is that if it were possible for an agent to be in more than one place at the same time, they could be infinitely farsighted. (Possibly as a consequence of FTL.)

Things get weird here, depending on your theory of identity and how that factors into the planning / reward process? Can you spell this out some more?

Reply
[-]Rafael Harth5y10

The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.

The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it's part of the object ("how fertile is the soil?"). Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?

I did interpret the first exercise as "you planned to go onto the moon" and came up with stuff like "how valuable are the stones I can take home" and "how pleasant will it be to hang around."

One thing I noticed is that the formal policies don't allow for all possible "strategies." In the graph we had to reconstruct, I can't start at s1, then go to s1 once and then go to s3. So you could think of the larger set ΠL where the policies are allowed to depend on the time step. But I assume there's no point unless the reward function also depends on the time step. (I don't know anything about MDPs.)

Am I correct that a deterministic transition function is a function T:S×A→S and a non-deterministic one is a function T:S×A×S→[0,1]?

Reply
[-]Alex Turner5y20

Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?

yes

One thing I noticed is that the formal policies don't allow for all possible "strategies."

yeah, this is because those are “nonstationary” policies - you change your mind about what to do at a given state. A classic result in MDP theory is that you never need these policies to find an optimal policy.

Am I correct that a deterministic transition function is

yup!

Reply
[-]Stuart Armstrong5y10

I find the existing MDP isomorphisms/equivalences to be pretty lacking.

I have a paper on equivalences (and counterfactual equivalences, which is stronger) for POMDPs: https://arxiv.org/abs/1801.03737

Reply
Moderation Log
Curated and popular this week
5Comments

(This is one interpretation of the prompt, in which you haven't chosen to go to the moon. If you imagined yourself as more prepared, that's also fine.)

If you were plopped onto the moon, you'd die pretty fast. Maybe the "die as quickly as possible" AU is high, but not much else - not even the "live on the moon" AU! We haven't yet reshaped the AU landscape on the moon to be hospitable to a wide range of goals. Earth is special like that.

AU landscape as a unifying frame

Attainable utilities are calculated by winding your way through possibility-space, considering and discarding possibility after possibility to find the best plan you can. This frame is unifying.

Sometimes you advantage one AU at the cost of another, moving through the state space towards the best possibilities for one goal and away from the best possibilities for another goal. This is opportunity cost.

Sometimes you gain more control over the future: most of the best possibilities make use of a windfall of cash. Sometimes you act to preserve control over the future: most Tic-Tac-Toe goals involve not ending the game right away. This is power.

Other people usually objectively impact you by decreasing or increasing a bunch of your AUs (generally, by changing your power). This happens for an extremely wide range of goals because of the structure of the environment.

Sometimes, the best possibilities are made unavailable or worsened only for goals very much like yours. This is value impact.

Sometimes a bunch of the best possibilities go through the same part of the future: fast travel to random places on Earth usually involves the airport. This is instrumental convergence.

Exercise: Track what’s happening to your various AUs during the following story: you win the lottery. Being an effective spender, you use most of your cash to buy a majority stake in a major logging company. Two months later, the company goes under.

Technical appendix: AU landscape and world state contain equal information

In the context of finite deterministic Markov decision processes, there's a wonderful handful of theorems which basically say that the AU landscape and the environmental dynamics encode each other. That is, they contain the same information, just with different emphasis. This supports thinking of the AU landscape as a "dual" of the world state.

Let ⟨S,A,T,γ⟩ be a rewardless deterministic MDP with finite state and action spaces S,A, deterministic transition function T, and discount factor γ∈(0,1). As our interest concerns optimal value functions, we consider only stationary, deterministic policies: Π:=AS.

The first key insight is to consider not policies, but the trajectories induced by policies from a given state; to not look at the state itself, but the paths through time available from the state. We concern ourselves with the possibilities available at each juncture of the MDP.

To this end, for π∈Π, consider the mapping of π↦(I−γTπ)−1 (where Tπ(s,s′):=T(s,π(s),s′)); in other words, each policy π maps to a function mapping each state s0 to a discounted state visitation frequency vector fπs0, which we call a possibility. The meaning of each frequency vector is: starting in state s0 and following policy π, what sequence of states s0,s1,… do we visit in the future? States visited later in the sequence are discounted according to γ: the sequence s0s1s2s2… would induce 1 visitation frequency on s0, γ visitation frequency on s1, and γ21−γ visitation frequency on s2.

The possibility function F(s) outputs the possibilities available at a given state s:

Put differently, the possibilities available are all of the potential film-strips of how-the-future-goes you can induce from the current state.

Possibility isomorphism

We say two rewardless MDPs M and M′ are isomorphic up to possibilities if they induce the same possibilities. Possibility isomorphism captures the essential aspects of an MDP's structure, while being invariant to state representation, state labelling, action labelling, and the addition of superfluous actions (actions whose results are duplicated by other actions available at that state). Formally, M≃FM′ when there exists a bijection ϕ:S→S′ (letting Pϕ be the corresponding |S|-by-|S′| permutation matrix) satisfying FM(s)={Pϕf′|f′∈FM′(ϕ(s))} for all s∈S.

This isomorphism is a natural contender[1] for the canonical (finite) MDP isomorphism:

Theorem: M and M′ are isomorphic up to possibilities iff their directed graphs are isomorphic (and they have the same discount rate).

Representation equivalence

Suppose I give you the following possibility sets, each containing the possibilities for a different state:

⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩⎛⎜⎝400⎞⎟⎠,⎛⎜⎝1.752.25⎞⎟⎠,⎛⎜ ⎜⎝1.43754−1.43750⎞⎟ ⎟⎠⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭⎧⎪⎨⎪⎩⎛⎜⎝004⎞⎟⎠⎫⎪⎬⎪⎭⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩⎛⎜⎝013⎞⎟⎠,⎛⎜ ⎜⎝4−1.43751.43750⎞⎟ ⎟⎠,⎛⎜⎝310⎞⎟⎠⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭

Exercise: What can you figure out about the MDP structure? Hint: each entry in the column corresponds to the visitation frequency of a different state; the first entry is always s1, second s2, and third s3.

You can figure out everything: ⟨S,A,T,γ⟩, up to possibility isomorphism. Solution here.

How? Well, the L1 norm of the possibility vector is always 11−γ, so you can deduce γ=.75 easily. The single possibility state must be isolated, so we can mark that down in our graph. Also, it's in the third entry.

The other two states correspond to the "1" entries in their possibilities, so we can mark that down. The rest follows straightforwardly.

Theorem: Suppose the rewardless MDP M has possibility function F. Given only F,[2] M can be reconstructed up to possibility isomorphism.

In MDPs, the "AU landscape" is the set of optimal value functions for all reward functions over states in that MDP. If you know the optimal value functions for just |S| reward functions, you can also reconstruct the rewardless MDP structure.[3]

From the environment (rewardless MDP), you can deduce the AU landscape (all optimal value functions) and all possibilities. From possibilities, you can deduce the environment and the AU landscape. From the AU landscape, you can deduce the environment (and thereby all possibilities).

All of these encode the same mathematical object.

Technical appendix: Opportunity cost

Opportunity cost is when an action you take makes you more able to achieve one goal but less able to achieve another. Even this simple world has opportunity cost:

Going to the green state means you can't get to the purple state as quickly.

On a deep level, why is the world structured such that this happens? Could you imagine a world without opportunity cost of any kind? The answer, again in the rewardless MDP setting, is simple: "yes, but the world would be trivial: you wouldn't have any choices". Using a straightforward formalization of opportunity cost, we have:

Theorem: Opportunity cost exists in an environment iff there is a state with more than one possibility.

Philosophically, opportunity cost exists when you have meaningful choices. When you make a choice, you're necessarily moving away from some potential future but towards another; since you can't be in more than one place at the same time, opportunity cost follows. Equivalently, we assumed the agent isn't infinitely farsighted (γ<1); if it were, it would be possible to be in "more than one place at the same time", in a sense (thanks to Rohin Shah for this interpretation).

While understanding opportunity cost may seem like a side-quest, each insight is another brick in the edifice of our understanding of the incentives of goal-directed agency.

Notes

  • Just as game theory is a great abstraction for modelling competitive and cooperative dynamics, AU landscape is great for thinking about consequences: it automatically excludes irrelevant details about the world state. We can think about the effects of events without needing a specific utility function or ontology to evaluate them. In multi-agent systems, we can straightforwardly predict the impact the agents have on each other and the world.
  • “Objective impact to a location” means that agents whose plans route through the location tend to be objectively impacted.
  • The landscape is not the territory: AU is calculated with respect to an agent's beliefs, not necessarily with respect to what really "could" or will happen.

  1. The possibility isomorphism is new to my work, as are all other results shared in this post. This apparent lack of basic theory regarding MDPs is strange; even stranger, this absence was actually pointed out in two published papers!

    I find the existing MDP isomorphisms/equivalences to be pretty lacking. The details don't fit in this margin, but perhaps in a paper at some point. If you want to coauthor this (mainly compiling results, finding a venue, and responding to reviews), let me know and I can share what I have so far (extending well beyond the theorems in my recent work on power). ↩︎

  2. In fact, you can reconstruct the environment using only a limited subset of possibilities: the non-dominated possibilities. ↩︎

  3. As a tensor, the transition function T has size |A|⋅|S|2, while the AU landscape representation only has size |S|2. However, if you're just representing T as a transition function, it has size |A|⋅|S|. ↩︎

Mentioned in
32Non-Obstruction: A Simple Concept Motivating Corrigibility
25Attainable Utility Preservation: Empirical Results
27Review of 'But exactly how complex and fragile?'
192019 Review Rewrite: Seeking Power is Often Robustly Instrumental in MDPs
13Reasons for Excitement about Impact of Impact Measure Research
Load More (5/6)