Summary. In this post, we present the formal framework we adopt during the sequence, and the simplest form of the type of aspiration-based algorithms we study. We do this for a simple form of aspiration-type goals: making the expectation of some variable equal to some given target value. The algorithm is based on the idea of propagating aspirations along time, and we prove that the algorithm gives a performance guarantee if the goal is feasible. Later posts discuss safety criteria, other types of goals, and variants of the basic algorithm.

Assumptions

In line with the working hypotheses stated in the previous post, we assume more specifically the following in this post:

The agent is a general-purpose AI system that is given a potentially long sequence of tasks, one by one, which it does not know in advance. Most aspects of what we discuss focus on the current task only, but some aspects relate to the fact that there will be further, unknown tasks later (e.g., the question of how much power the agent shall aim to retain at the end of the task).

It possesses an overall world model that represents a good enough general understanding of how the world works.

Whenever the agent is given a task, an episode begins and its overall world model provides it with a (potentially much simpler) task-specific world model that represents everything that is relevant for the time period until the agent gets a different task or is deactivated, and that can be used to predict the potentially stochastic consequences of taking certain actions in certain world states.

That task-specific world model has the form of a (fully observed) Markov Decision Process (MDP) that however does not contain a reward function R but instead contains what we call an evaluation function related to the task (see 2nd to next bullet point).

As a consequence of a state transition, i.e., of taking a certain actiona in a certain states and finding itself in a certain successor states′, a certain task-relevant evaluation metric changes by some amount. Importantly, we do not assume that the evaluation metric inherently encodes things of which more is better. E.g., the evaluation metric could be global mean temperature, client's body mass, x coordinate of the agent's right thumb, etc. We call the step-wise change in the evaluation metric the received Delta in that time step, denoted δ. We call its cumulative sum over all time steps of the episode the Total, denotedτ. Formally, Delta and Total play a similar role for our aspiration-based approach as the concepts of "reward" and "return" play for maximization-based approaches. The crucial difference is that our agent is not tasked to maximize Total (since the evaluation metric does not have the interpretation of "more is better") but to aim for some specific value of the Total.

The evaluation function contained in the MDP specifies the expected value of δ for all possible transitions: Eδ(s,a,s′).^{[1]}

First challenge: guaranteeing the fulfillment of expectation-type goals

The challenge in this post is to design a decision algorithm for tasks where the agent's goal is to make the expected (!) Total equal (!) a certain value E∈R which we call the aspiration value. ^{[2]} This is a crucial difference from a "satisficing" approach that would aim to make expected Total at least as large asE and would thus still be happy to maximize Total. Later we consider other types of tasks, both less restrictive ones (including those related to satisficing) and more specific ones that also care about other aspects of the resulting distribution of Total or states.

It turns out that we can guarantee the fulfillment of this type of goal under some weak conditions!

Notice that a special case of such expectation-type goals is making sure that the probability of reaching a certain set of acceptable terminal states equals a certain value, because we can simply assume that each such acceptable terminal state gives 1 Delta and all others give zero Delta. We will come back to that special case in the next post when discussing aspiration intervals.

Example: Shopping for apples

The agent is a certain household's AI butler. Among all kinds of other tasks, roughly once a week it is tasked to go shopping for apples. When it gets this task in the morning, an apple shopping episode begins, which ends when the agent returns to the household to get new instructions or is called by a household member on its smartphone, or when its battery runs empty.

The relevant evaluation metric for this task is the household's and agent's joint stock of apples. It changes by some Delta each time the agent gets handed some apples by a merchant or takes some from the supermarket's shelf or when some apples fall from its clumsy hands or get stolen by some robot hater (or when some member of the household eats an apple).

As the household's demand for apples is 21 apples per week on average, the task in a single apple shopping episode is to buy a certain number E of apples in expectation. They also want to maintain some small stock for hard times, say about 5 apples. So the household's policy is to set the aspiration to E=21+(5−x), where x is its current stock of apples. As this is a recurrent task, it is perfectly fine if this aspiration is only fulfilled in expectation if the Total doesn't vary too much, since over many weeks (and assuming long-lived apples), the deviations will probably average out and the stock will vary around 26 apples right after a shopping mission and 5 apples just before the next shopping mission. In reality, the household would of course also want the variance to be small, but that is a performance criterion we will only add later.

Possible generalizations (can be skipped safely)

In later posts, we will generalize the above assumptions in the following ways:

Instead of as a single value E∈R that expected Total shall equal, the task can be given as an aspiration intervalE=[e–,¯¯¯e] into which expected Total shall fall (e.g., "buy about 21+(5−x)±2 apples").

Instead of a single evaluation metric (stock of apples), there can be d>1 many evaluation metrics (stock of apples, stock of pears, and money in pocket), and the task can be given as a convex aspiration setE⊆Rd (e.g., buy at least two apples and one pear but don't spend more than 1 Euro per item).

Instead of in terms of evaluation metrics, the task could be given in terms of the terminal state of the episode, by specifying a particular "desired" state, a set of "acceptable" states, a desired probability distribution of terminal states, or a set of acceptable probability distribution of terminal states. For example, demanding that the expected number of stocked apples after the episode be 26 is the same as saying that all probability distributions of terminal states are acceptable that have the feature that the expected number of stocked apples is 26. A more demanding task would then be to say that only those probability distributions of terminal states are acceptable for which the expected number of stocked apples is 26, its standard deviation is at most 3, its 5 per cent quantile is at least 10, and the 5 per cent quantile of the number of surviving humans is at least 8 billion.

The world model might have a more general form than an MDP: to represent different forms of uncertainty, it might be an only partially observed MDP (POMDP), or an ambiguous POMDP (APOMDP); to represent complex tasks, it might be a hierarchical MDP whose top-level actions (e.g., buy 6 apples from this merchant) are represented as lower-level MDPs with lower-level aspirations specified in terms of auxiliary evaluation metrics (e.g., don't spend more than 5 minutes waiting at this merchant), and its lowest level might also have continuous rather than discrete time (if it represents, e.g., the continuous control of a robot's motors).

Notation

We focus on a single episode for a specific task.

Environment. We assume the agent's interaction with the environment consists of an alternating sequence of discrete observations and actions. As usual, we formalize this by assuming that after each observation, the agent chooses an action a and then "calls" a (potentially stochastic) function step(a) provided by the environment that returns the next observation.^{[3]}

World model. The episode's world model, M, is a finite, acyclic MDP. The model's discrete time counter, t, advances whenever the agent makes an observation. From the sequence of observations made until time t, the world model constructs a representation of the state of the world, which we call the model state or simply the state and denote by st, and which also contains information about the time index t itself.^{[4]} The set of all possible (model) states is a finite set S. A subset S⊤ of states is considered terminal. If the state is terminal, the episode ends without further action or Delta, which represents the fact that the agent becomes inactive until given a new task. Otherwise, the agent chooses an actionat∈A. The world model predicts the consequences of each possible action by providing a probability distribution PM(st+1|st,at) for the successor state st+1. It also predicts the expected Delta for the task-relevant evaluation metric as a function of the state, action, and successor state: Eδt+1=Eδ(st,at,st+1).

The following two graphs depict all this. Entities occurring in the world model are blue, those in the real world red, green is what we want to design, and dotted things are unknown to the agent:

We hide the process of constructing the next state from the previous state, action, and next observation by simply assuming that the agent can call a version of the function step that is given the current state and action and returns the successor state constructed from the next observation: st+1←step(st,at).

Goal. The goal is given by an aspiration valueE(s0)∈R. The task is to choose actions so that the expected Total,

Eτ=E(s0,a0,s1,a1,…)∑tEδ(st,at,st+1),

equals E(s0).

Auxiliary notation for interval arithmetic. We will use the following abbreviations:

x:λ:z=x(1−λ)+λz (interpolation between x and z),

x∖y∖z=y−xz−x (relative position of y in interval [x,z], with the convention that 00=12),

x[y]z=min{max{x,y},z} ("clipping'' y to interval [x,z]).

Sequential decisions based on propagated aspirations

Main idea

Our agent will achieve the goal by

propagating the aspiration along the trajectory as we go from states st via actions at to successor states st+1, leading to an alternating sequence of state aspirationsE(st) and action aspirationsE(st,at).

sequentially deciding on the next action at on the basis of the current state aspiration E(st) and suitably chosen action-aspirations E(st,a) for all possible actions a∈A.

For both aspiration propagation and decision making, the agent uses some auxiliary quantities that it computes upfront at the beginning of the episode from the world model as follows.

Feasibility intervals

Similar to what is done in optimal control theory, the agent computes^{[5]} the V- and Q-functions of the hypothetical policy that would maximize expected Total, here denoted ¯¯¯¯V and ¯¯¯¯Q, by solving the respective Bellman equations

with ¯¯¯¯V(s)=0 for terminal states s∈S⊤. It also computes the analogous quantities for the hypothetical policy that would minimize expected Total, denoted V–– and Q––:

The eventual use of these intervals will be to rescale aspirations from step to step. Before we come to that, however, we can already prove a first easy fact about goals of the type "make sure that expected Total equals a certain value":

Lemma: Trivial guarantee

If the world model predicts state transitions correctly, then there is a decision algorithm that fulfills the goal Eτ=E(s0)if and only if the episode's starting aspiration E(s0) is in the initial state's feasibility interval V(s0).

Proof. The values ¯¯¯¯V(s0) and V––(s0) are, by definition, the expected Total of the maximizing resp. minimizing policy, and hence it is clear that there cannot be a policy which attains E(s0) in expectation if E(s0) is larger than ¯¯¯¯V(s0) or smaller than V––(s0).

Conversely, assuming that E(s0) lies inside the interval V(s0), the following procedure fulfills the goal:

We compute the relative position of E(s0) inside V(s0), p=E(s0)−V––(s0)¯¯¯¯V(s0)−V––(s0)∈[0,1].

With probability p, we use the maximizing policy ¯¯¯π throughout the episode, and with probability 1−p, we use the minimizing policy π–– throughout the episode.

This fulfills the goal, since the correctness of the model implies that, when using ¯¯¯π or π––, we actually get an expected Total of ¯¯¯¯V(s0) resp. V––(s0). □

Of course, using this "either maximize or minimize the evaluation metric" approach would be catastrophic for safety. For example, if we tasked an agent with restoring Earth's climate to a pre-industrial state, using as our evaluation metric the global mean temperature, this decision algorithm might randomize, with carefully chosen probability, between causing an ice age and inducing a runaway greenhouse effect! This is very different from what we want, which is something roughly similar to pre-industrial climate.

Another trivial idea is to randomize in each time step t between the action with the largest ¯¯¯¯Q(st,a) and the one with the smallest Q––(st,a), using a fixed probability p′ resp. 1−p′. Since expected Total is a continuous function of p′ which varies between V––(s0)and ¯¯¯¯V(s0), by the Intermediate Value Theorem there exists some value of p′ for which this algorithm gives the correct expected Total; however, it is unclear how to compute the right p′ in practice.

If the episode consists of many time steps, this method might not lead to extreme values of the Total, but it would still make the agent take an extreme action in each time step. Intuition also suggests that the agent's behavior would be less predictable and fluctuate more than in the first version, where it consistently maximizes or minimizes after the initial randomization, and that this is undesirable.

So let us study more intelligent ways to guarantee that Eτ=E(s0).

In order to avoid extreme actions, our actual decision algorithm chooses "suitable" intermediate actions which it expects to allow it to fulfill the goal in expectation. When in state s, it does so by

setting action-aspirations E(s,a) for each possible action a∈A on the basis of the current state-aspiration E(s) and the action's feasibility interval Q(s,a), trying to keep E(s,a) close to E(s),

choosing in some arbitrary way an "under-achieving" action a− and an "over-achieving" action a+ w.r.t. E(s) and these computed action-aspirations E(s,a),

choosing probabilistically between a− and a+ with suitable probabilities,

executing the chosen action a and observing the resulting successor state s′, and

propagating the action-aspiration E(s,a) to the new state-aspiration E(s′) by rescaling between the feasibility intervals of a and s′.

More precisely: First compute (or learn) the functions V––,¯¯¯¯V,Q––, and ¯¯¯¯Q. Then, given state s and a feasible state-aspiration E(s)∈V(s),

For all available actions a∈A, compute action-aspirations E(s,a)=Q––(s,a)[E(s)]¯¯¯¯Q(s,a)∈Q(s,a).

Pick some actions a−,a+∈A with E(s,a−)≤E(s)≤E(s,a+); these necessarily exist because E(s)∈V(s).

Compute p=E(s,a−)∖E(s)∖E(s,a+)∈[0,1].

With probability p, let a=a+, otherwise let a=a−.

Execute action a in the environment and observe the successor state s′←step(s,a).

Compute λ=Q––(s,a)∖E(s,a)∖¯¯¯¯Q(s,a)∈[0,1] and the successor state's state-aspiration E(s′)=V––(s′):λ:¯¯¯¯V(s′)∈V(s′).

If we add the state- and action aspirations as entities to the diagram of Fig. 1, we get this:

Example: Shopping for apples, revisited with math

We return to the apple-shopping scenario mentioned above, which we model by the following simple state-action diagram:

Our agent starts at home (state s) and wishes to obtain a certain number of apples, which are available at a market (state m). It can either walk to the market (action a), which will certainly succeed, or take public transportation (action b), which gives a 2/3 chance of arriving successfully at the market and a 1/3 chance of not reaching the market before it closes and returning home empty-handed. Of course, the agent can also decide to take the null action (action c) and simply stay home the entire day doing nothing. Once it reaches the market m, the agent can buy either one or two packs of three apples (actions m1 and m2, respectively) before returning home at the end of the day (state t).

To apply Algorithm 1, we first compute the Q- and V-functions for the maximizing and minimizing policies. Since there are no possible cycles in this environment, straightforwardly unrolling the recursive definitions from the back ("backward induction") yields:

V––(t)=0

¯¯¯¯V(t)=0

Q––(m,m1)=3*

¯¯¯¯Q(m,m1)=3

Q––(m,m2)=6

¯¯¯¯Q(m,m2)=6*

V––(m)=3

¯¯¯¯V(m)=6

Q––(s,a)=3

¯¯¯¯Q(s,a)=6*

Q––(s,b)=2

¯¯¯¯Q(s,b)=4

Q––(s,c)=0*

¯¯¯¯Q(s,c)=0

V––(s)=0

¯¯¯¯V(s)=6

(The asterisks* mark which actions give the V values)

Suppose that the agent is in the initial state s and has the aspiration E(s)=2.5.

First, we calculate the action-aspirations: if I were to take a certain action, what would I aspire to? Here, the state-aspiration E(s) lies within the action's feasibility set Q(s,b), so the action-aspiration E(s,b) is simply set equal to E(s). By contrast, the intervals Q(s,a) and Q(s,c) do not contain the point E(s), and so E(s) is clipped to the nearest admissible value:

Next, we choose an over-achieving action a+ and an under-achieving action a−. Suppose we arbitrarily choose actions c (do nothing) and a (walk to the market).

We calculate the relative position of the aspiration E(s) between E(s,c) and E(s,a): p=E(s,c)∖E(s)∖E(s,a)=56.

We roll a die and choose our next action to be either c with probability 1−p=16 or a with probability p=56.

We take the chosen action, walking to the market or doing nothing depending on the result of the die roll, and observe the consequences of our actions! In this case, there are no surprises, as the transitions are deterministic.

We rescale the action-aspiration to determine our new state-aspiration. If we chose action a, we deterministically transitioned to state m, and so the feasibility interval Q(s,a) is equal to V(m) and no rescaling is necessary (in other words, the rescaling is simply the identity map): we simply set our new state-aspiration to be E(m)=E(s,a)=3. Likewise, if we took action c, we end up in state t with state-aspiration E(t)=0.

Suppose now that we started with the same initial aspiration E(s)=2.5, but instead chose action b as our over-achieving action in step 2. In this case, algorithm execution would go as follows:

Determine action-aspirations as before.

Choose a−=c,a+=b.

Since E(s,b) is exactly equal to our state-aspiration E(s) and E(s,a) is not, p is 1!

Hence, our next action is deterministically b.

We execute action b and observe whether public transportation is late today (ending up in state t) or not (which brings us to state m).

For the rescaling, we determine the relative position of our action-aspiration in the feasibility interval: λ=Q––(s,b)∖E(s,b)∖¯¯¯¯Q(s,b)=1/4. If we ended up in state m, our new state-aspiration is then E(m)=V––(m):λ:¯¯¯¯V(m)=3.75; if we ended up in state t, the state-aspiration is E(t)=0.

These examples demonstrate two cases:

If neither of the action feasibility intervals Q(a+) nor Q(a−) contain the state aspiration E(s), we choose between taking action a+ and henceforth minimizing, or taking action a− and henceforth maximizing, with the probability p that makes the expected total match our original aspiration E(s).

If exactly one of the feasibility intervals Q(s,a+) or Q(s,a−) contains E(s), then we choose the corresponding action with certainty, and propagate the aspiration by rescaling it.

There is one last case, where both feasibility intervals contain E(s); this is the case, for example, if we choose E(s)=3.5 in the above environment. Execution then proceeds as follows:

We determine action-aspirations, as shown here:

Suppose now that we choose actions a and b as a+ and a−. (If action c is chosen, then we are in the second case shown above.)

p is defined as the relative position of E(s) between E(s,a−) and E(s,a+), but in this case, these three values are equal! We have chosen above (see auxiliary notation) that p=12 in this case, but any other probability would also be acceptable.

We toss a coin to decide between actions a and b.

The chosen action is performed, either taking us deterministically to the market m if we chose action a or randomizing between m and t if we chose action b.

We propagate the action-aspiration to state-aspiration as before. If we chose action a, then we have Q(s,a)=V(m) and so the new state-aspiration is E(m)=E(s,a)=3.5. If we chose action b, then the new state-aspiration is E(m)=V––(s,m):34:¯¯¯¯V(s,m)=5.25 if we reached the market and E(t)=0 otherwise.

Now that we have an idea of how the decision algorithm works, it is time to prove its correctness.

Theorem: Algorithm 1 fulfills the goal

If the world model predicts state transitions correctly and the episode-aspiration E(s0) is in the initial state's feasibility interval, E(s0)∈V(s0), then decision algorithm 1 fulfills the goal Eτ=E(s0).

Proof.

First, let us observe that algorithm 1 preserves feasibility: if we start from state s0 with state-aspiration E(s0)∈V(s0), then for all states s and actions a visited, we will have E(s)∈V(s) and E(s,a)∈Q(s,a

Summary.In this post, we present the formal framework we adopt during the sequence, and the simplest form of the type of aspiration-based algorithms we study. We do this for a simple form of aspiration-type goals: making the expectation of some variable equal to some given target value. The algorithm is based on the idea of propagating aspirations along time, and we prove that the algorithm gives a performance guarantee if the goal is feasible. Later posts discuss safety criteria, other types of goals, and variants of the basic algorithm.## Assumptions

In line with the working hypotheses stated in the previous post, we assume more specifically the following in this post:

world modelthat represents a good enough general understanding of how the world works.episodebegins and its overall world model provides it with a (potentially much simpler) task-specific world model that represents everything that is relevant for the time period until the agent gets a different task or is deactivated, and that can be used to predict the potentially stochastic consequences of taking certain actions in certain world states.Markov Decision Process (MDP)that however does not contain a reward function R but instead contains what we call anevaluation functionrelated to the task (see 2nd to next bullet point).state transition,i.e., of taking a certainactiona in a certainstates and finding itself in a certainsuccessor states′, a certain task-relevantevaluation metricchanges by some amount. Importantly, we donotassume that the evaluation metric inherently encodes things of which more is better. E.g., the evaluation metric could be global mean temperature, client's body mass, x coordinate of the agent's right thumb, etc. We call the step-wise change in the evaluation metric the receivedDeltain that time step, denoted δ. We call its cumulative sum over all time steps of the episode theTotal,denotednottasked to maximize Total (since the evaluation metric does not have the interpretation of "more is better") but to aim forsome specific valueof the Total.evaluation functioncontained in the MDP specifies the expected value of δ for all possible transitions: Eδ(s,a,s′).^{[1]}## First challenge: guaranteeing the fulfillment of expectation-type goals

The challenge in this post is to design a

decision algorithmfor tasks wherethe agent's goal is to make theE∈Rexpected (!)Total equal (!) a certain valuewhich we call theaspiration value.^{[2]}This is a crucial difference from a "satisficing" approach that would aim to make expected Totalat least as large asE and would thus still be happy to maximize Total. Later we consider other types of tasks, both less restrictive ones (including those related to satisficing) and more specific ones that also care about other aspects of the resulting distribution of Total or states.It turns out that we can

guaranteethe fulfillment of this type of goal under some weak conditions!Notice that a special case of such expectation-type goals is making sure that the

probability of reaching a certain set of acceptable terminal statesequals a certain value, because we can simply assume that each such acceptable terminal state gives 1 Delta and all others give zero Delta. We will come back to that special case in the next post when discussing aspiration intervals.## Example: Shopping for apples

evaluation metricfor this task is the household's and agent's joint stock of apples. It changes by some Delta each time the agent gets handed some apples by a merchant or takes some from the supermarket's shelf or when some apples fall from its clumsy hands or get stolen by some robot hater (or when some member of the household eats an apple).in expectation. They also want to maintain some small stock for hard times, say about 5 apples. So the household's policy is to set the aspiration to E=21+(5−x), where x is its current stock of apples. As this is a recurrent task, it is perfectly fine if this aspiration is only fulfilled in expectation if the Total doesn't vary too much, since over many weeks (and assuming long-lived apples), the deviations will probably average out and the stock will vary around 26 apples right after a shopping mission and 5 apples just before the next shopping mission. In reality, the household would of course also want the variance to be small, but that is a performance criterion we will only add later.## Possible generalizations (can be skipped safely)

In later posts, we will

generalizethe above assumptions in the following ways:aspiration intervalE=[e–,¯¯¯e] into which expected Total shall fall (e.g., "buy about 21+(5−x)±2 apples").aspiration setE⊆Rd (e.g., buy at least two apples and one pear but don't spend more than 1 Euro per item).terminal stateof the episode, by specifying a particular "desired" state, a set of "acceptable" states, a desired probability distribution of terminal states, or a set of acceptable probability distribution of terminal states. For example, demanding that the expected number of stocked apples after the episode be 26 is the same as saying that all probability distributions of terminal states are acceptable that have the feature that the expected number of stocked apples is 26. A more demanding task would then be to say that only those probability distributions of terminal states are acceptable for which the expected number of stocked apples is 26, its standard deviation is at most 3, its 5 per cent quantile is at least 10, and the 5 per cent quantile of the number of surviving humans is at least 8 billion.partially observedMDP (POMDP), or anambiguousPOMDP (APOMDP); to represent complex tasks, it might be ahierarchicalMDP whose top-level actions (e.g., buy 6 apples from this merchant) are represented as lower-level MDPs with lower-level aspirations specified in terms of auxiliary evaluation metrics (e.g., don't spend more than 5 minutes waiting at this merchant), and its lowest level might also havecontinuousrather than discrete time (if it represents, e.g., the continuous control of a robot's motors).## Notation

We focus on a single episode for a specific task.

Environment.We assume the agent's interaction with the environment consists of an alternating sequence of discrete observations and actions. As usual, we formalize this by assuming that after each observation, the agent chooses an action a and then "calls" a (potentially stochastic) function step(a) provided by the environment that returns the next observation.^{[3]}World model.The episode's world model, M, is a finite, acyclic MDP. The model's discrete time counter, t, advances whenever the agent makes an observation. From the sequence of observations made until time t, the world model constructs a representation of the state of the world, which we call themodel stateor simply thestateand denote by st, and which also contains information about the time index t itself.^{[4]}The set of all possible (model) states is a finite set S. A subset S⊤ of states is consideredterminal. If the state is terminal, the episode ends without further action or Delta, which represents the fact that the agent becomes inactive until given a new task. Otherwise, the agent chooses anactionat∈A. The world model predicts the consequences of each possible action by providing a probability distribution PM(st+1|st,at) for thesuccessorstate st+1. It also predicts theexpected Deltafor the task-relevant evaluation metric as a function of the state, action, and successor state: Eδt+1=Eδ(st,at,st+1).The following two graphs depict all this. Entities occurring in the world model are blue, those in the real world red, green is what we want to design, and dotted things are unknown to the agent:

We hide the process of constructing the next state from the previous state, action, and next observation by simply assuming that the agent can call a version of the function step that is given the current state and action and returns the successor state constructed from the next observation: st+1←step(st,at).

Eτ=E(s0,a0,s1,a1,…)∑tEδ(st,at,st+1),Goal.The goal is given by anaspiration valueE(s0)∈R. The task is to choose actions so that theexpected Total,equals E(s0).

Auxiliary notation for interval arithmetic.We will use the following abbreviations:(interpolation between x and z),

(relative position of y in interval [x,z], with the convention that 00=12),

("clipping'' y to interval [x,z]).

## Sequential decisions based on propagated aspirations

## Main idea

Our agent will achieve the goal by

propagatingthe aspiration along the trajectory as we go from states st via actions at to successor states st+1, leading to an alternating sequence ofstate aspirationsE(st) andaction aspirationsE(st,at).sequentially decidingon the next action at on the basis of the current state aspiration E(st) and suitably chosen action-aspirations E(st,a) for all possible actions a∈A.For both aspiration propagation and decision making, the agent uses some auxiliary quantities that it computes upfront at the beginning of the episode from the world model as follows.

## Feasibility intervals

Similar to what is done in optimal control theory, the agent computes

¯¯¯¯V(s)=maxa∈A¯¯¯¯Q(s,a),¯¯¯¯Q(s,a)=Es′∼PM(⋅|s,a)(Eδ(s,a,s′)+¯¯¯¯V(s′)),^{[5]}the V- and Q-functions of the hypothetical policy that would maximize expected Total, here denoted ¯¯¯¯V and ¯¯¯¯Q, by solving the respective Bellman equationswith ¯¯¯¯V(s)=0 for terminal states s∈S⊤. It also computes the analogous quantities for the hypothetical policy that would minimize expected Total, denoted V–– and Q––:

V––(s)=mina∈AQ––(s,a),Q––(s,a)=Es′∼PM(⋅|s,a)(Eδ(s,a,s′)+V––(s′)),with V––(s)=0 for terminal states s∈S⊤. These define the state's and action's

V(s)=[V––(s),¯¯¯¯V(s)],Q(s,a)=[Q––(s,a),¯¯¯¯Q(s,a)].(F)feasibility intervals,The eventual use of these intervals will be to

rescaleaspirations from step to step. Before we come to that, however, we can already prove a first easy fact about goals of the type "make sure that expected Total equals a certain value":## Lemma: Trivial guarantee

If the world model predicts state transitions correctly, then there is a decision algorithm that fulfills the goalEτ=E(s0)if and only if the episode's starting aspirationE(s0)is in the initial state's feasibility intervalV(s0).Proof.The values ¯¯¯¯V(s0) and V––(s0) are, by definition, the expected Total of the maximizing resp. minimizing policy, and hence it is clear that there cannot be a policy which attains E(s0) in expectation if E(s0) is larger than ¯¯¯¯V(s0) or smaller than V––(s0).Conversely, assuming that E(s0) lies inside the interval V(s0), the following procedure fulfills the goal:

relative positionof E(s0) inside V(s0), p=E(s0)−V––(s0)¯¯¯¯V(s0)−V––(s0)∈[0,1].This fulfills the goal, since the correctness of the model implies that, when using ¯¯¯π or π––, we actually get an expected Total of ¯¯¯¯V(s0) resp. V––(s0). □

Of course, using this "either maximize or minimize the evaluation metric" approach would be catastrophic for safety. For example, if we tasked an agent with restoring Earth's climate to a pre-industrial state, using as our evaluation metric the global mean temperature, this decision algorithm might randomize, with carefully chosen probability, between causing an ice age and inducing a runaway greenhouse effect! This is very different from what we want, which is something roughly similar to pre-industrial climate.

Another trivial idea is to randomize in each time step t between the action with the largest ¯¯¯¯Q(st,a) and the one with the smallest Q––(st,a), using a fixed probability p′ resp. 1−p′. Since expected Total is a continuous function of p′ which varies between V––(s0)and ¯¯¯¯V(s0), by the Intermediate Value Theorem there exists some value of p′ for which this algorithm gives the correct expected Total; however, it is unclear how to compute the right p′ in practice.

If the episode consists of many time steps, this method might not lead to extreme values of the Total, but it would still make the agent take an extreme action in each time step. Intuition also suggests that the agent's behavior would be less predictable and fluctuate more than in the first version, where it consistently maximizes or minimizes after the initial randomization, and that this is undesirable.

So let us study more intelligent ways to guarantee that Eτ=E(s0).

## Decision Algorithm 1: steadfast action-aspirations, rescaled state-aspirations

In order to avoid extreme actions, our actual decision algorithm chooses "suitable" intermediate actions which it expects to allow it to fulfill the goal in expectation. When in state s, it does so by

More precisely: First compute (or learn) the functions V––,¯¯¯¯V,Q––, and ¯¯¯¯Q. Then, given state s and a feasible state-aspiration E(s)∈V(s),

and the successor state's state-aspiration E(s′)=V––(s′):λ:¯¯¯¯V(s′)∈V(s′).

If we add the state- and action aspirations as entities to the diagram of Fig. 1, we get this:

## Example: Shopping for apples, revisited with

mathWe return to the apple-shopping scenario mentioned above, which we model by the following simple state-action diagram:

Our agent starts at home (state s) and wishes to obtain a certain number of apples, which are available at a market (state m). It can either walk to the market (action a), which will certainly succeed, or take public transportation (action b), which gives a 2/3 chance of arriving successfully at the market and a 1/3 chance of not reaching the market before it closes and returning home empty-handed. Of course, the agent can also decide to take the null action (action c) and simply stay home the entire day doing nothing.

Once it reaches the market m, the agent can buy either one or two packs of three apples (actions m1 and m2, respectively) before returning home at the end of the day (state t).

To apply Algorithm 1, we first compute the Q- and V-functions for the maximizing and minimizing policies. Since there are no possible cycles in this environment, straightforwardly unrolling the recursive definitions from the back ("backward induction") yields:

(The asterisks* mark which actions give the V values)

Suppose that the agent is in the initial state s and has the aspiration E(s)=2.5.

Suppose we arbitrarily choose actions c (do nothing) and a (walk to the market).

Suppose now that we started with the same initial aspiration E(s)=2.5, but instead chose action b as our over-achieving action in step 2. In this case, algorithm execution would go as follows:

If we ended up in state m, our new state-aspiration is then E(m)=V––(m):λ:¯¯¯¯V(m)=3.75; if we ended up in state t, the state-aspiration is E(t)=0.

These examples demonstrate two cases:

There is one last case, where both feasibility intervals contain E(s); this is the case, for example, if we choose E(s)=3.5 in the above environment. Execution then proceeds as follows:

Now that we have an idea of how the decision algorithm works, it is time to prove its correctness.

## Theorem: Algorithm 1 fulfills the goal

If the world model predicts state transitions correctly and the episode-aspirationE(s0)is in the initial state's feasibility interval,E(s0)∈V(s0), then decision algorithm 1 fulfills the goalEτ=E(s0).Proof.First, let us observe that algorithm 1 preserves feasibility: if we start from state s0 with state-aspiration E(s0)∈V(s0), then for all states s and actions a visited, we will have E(s)∈V(s) and E(s,a)∈Q(s,a