We introduce a variant of the concept of a "quantilizer" for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the "quantilal" policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the "quantilum" value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter approaches 1. Finally, we demonstrate a polynomialtime algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.
Background
Quantilization (introduced in Taylor 2015) is a method of dealing with "Extremal Goodhart's Law". According to Extremal Goodhart, when we attempt to optimize a utility function by aggressively optimizing a proxy , we are likely to land outside of the domain where the proxy is useful. Quantilization addresses this by assuming an unknown cost function whose expectation w.r.t. some reference probability measure is bounded by 1. can be thought of as defining the "domain" within which is wellbehaved (for example it can be the probability measure of choices made by humans). We can then seek to maximize while constraining by a fixed bound :
Alternatively, we can choose some parameter and maximize the minimal guaranteed expectation of :
These two formulations are almost equivalent since both amount to finding a strategy that is Pareto efficient w.r.t. to the two objectives and . For the tradeoff is governed by the parameter and for the tradeoff is governed by the parameter . Indeed, it is easy to see that any is also optimal for the first criterion if we take , and any is also optimal for the latter criterion for an appropriate choice of (it needs to be a subderivative of the Pareto frontier).
In the following, we will use as our starting point the second formulation, which can be thought of as a zerosum game in which is the utility function of an agent whose strategy set is , and is chosen by the adversary. The quantilal strategy is the Nash equilibrium of the game.
This formulation seems natural if we take (a measure of how "optimistic" is in the domain ) and . In particular, the quantilum value (the value of the game) is a lower bound on the expectation of .
In principle, this formalism can be applied to sequential interaction between an agent and an environment, if we replace by the set of policies. However, if it is possible to make structural assumptions about and , we can do better. Taylor explores one such structural assumption, namely a sequence of independent games in which both and are additive across the games. We consider a more general setting, namely that of a finite Markov decision process (MDP).
Notation
Given a set , the notation will denote the set of finite strings over alphabet , i.e.
denotes the space of infinite strings over alphabet , equipped with the product topology and the corresponding Borel sigmaalgebra. Given and , is the th symbol of the string (in our conventions, so the string begins from the 0th symbol.) Given and , the notation means that is a prefix of .
Given a set , and , the notation will indicate the prefix of of length . That is, and .
Given a measurable space , we denote the space of probability measures on .
Given measurable spaces and , the notation means that is a Markov kernel from to . Given , is the corresponding probability measure on . Given measurable, . Given , . Given , is the composition of and , and when , is the th composition power.
Results
A finite MDP is defined by a nonempty finite set of states , a nonempty finite set of actions , a transition kernel and a reward function . To specify the utility function, we also need to fix a time discount function . This allows defining by
We also fix an initial distribution over states . In "classical" MDP theory, it is sufficient to consider a deterministic initial state , since the optimal policy doesn't depend on the initial state anyway. However, quantilization is different since the worstcase cost function depends on the initial conditions.
We now assume that the cost function (or, the true utility function ) has the same form. That is, there is some penalty function s.t.
Given a policy (where the factor represents the past history the factor represents the current state), we define in the usual way (the distribution over histories resulting from ). Finally, we fix . We are now ready to define quantilization in this setting
Definition 1
is said to be quantilal relatively to reference policy when
We also define the quantilum value by
( cannot be since taking yields a lower bound of .)
In the original quantilization formalism, the quantilal strategy can be described more explicitly, as sampling according to the reference measure from some top fraction of actions ranked by expected utility. Here, we don't have an analogous description, but we can in some sense evaluate the infimum over .
For any , define by
For any , the notation signifies the Renyi divergence of order :
In general, .
Proposition 1
is quantilal relatively to reference policy if and only if
Also, we have
If the maximization in Proposition 1 was over arbitrary rather than of the form , we would get ordinary quantilization and sampling would be equivalent to sampling some top fraction of . However, in general, the image of the operator is some closed convex set which is not the entire .
So far we considered arbitrary (nonstationary) policies. From classical MDP theory, we know that an optimal policy can always be chosen to be Markov:
Definition 2
is said to be a Markov policy, when there is some s.t. .
Note that a priori it might be unclear whether there is even a nonstationary quantilal policy. However, we have
Proposition 2
For any , there exists a Markov policy which is quantilal relatively to .
Now assume that our time discount function is geometric, i.e. there exists s.t. . Then it is known than an optimal policy can be chosen to be stationary:
Definition 3
is said to be a stationary policy, when there is some s.t. .
Once again, the situation for quantilal policies is analogous:
Proposition 3
If is geometric, then for any , there exists a stationary policy which is quantilal relatively to .
What is not analogous is that an optimal policy can be chosen to be deterministic whereas, of course, this is not the case for quantilal policies.
It is known that the value of an optimal policy depends on the parameters as a piecewise rational function, and in particular it converges as and has a Taylor expansion at . The quantilum value has the same property.
Proposition 4
is a piecewise rational continuous function of , , the matrix elements of , the values of , the values of and the values of , with a final number of "pieces".
Corollary 1
Assume that is a stationary policy. Then, converges as , holding all other parameters fixed (in the sense that, is fixed whereas changes as a function of ). It is analytic in for some interval and therefore can be described by a Taylor expansion around inside this interval.
Note that for optimal policies, Proposition 4 holds for a simpler reason. Specifically, the optimal policy is piecewise constant (since it's deterministic) and there is a Blackwell policy i.e. a fixed policy which is optimal for any sufficiently close to 1. And, it is easy to see that for a constant policy, the value is a rational function of . On the other hand, the quantilal policy is not guaranteed to be locally constant anywhere. Nevertheless the quantilum value is still piecewise rational.
Finally, we address the question of the computational complexity of quantilization. We prove the following
Proposition 5
Assume geometric time discount. Assume further that , , , , and are rational numbers. Then:
a. There is an algorithm for computing which runs in time polynomial in the size of the input , , , , and . Also, if is stationary and are rational, then are also rational and can be computed from , , and in polynomial time.
b. Given an additional rational input parameter , there is an algorithm for computing a stationary policy which is an equilibrium in the zerosum game associated with quantilization, which runs in time polynomial in the size of the input and .
EDIT: In fact, it is possible to do better and compute an exact quantilal policy in polynomial time.
Future Directions
To tease a little, here are some developments of this work that I'm planning:

Apply quantilization to reinforcement learning, i.e. when we don't know the MDP in advance. In particular, I believe that the principles of quantilization can be used not only to deal with misspecified rewards, but also to deal with traps to some extent (assuming it a priori known that the reference policy has at most a small probability of falling into a trap). This has some philosophical implications on how humans avoid traps.

Unify that formalism with DRL. The role of the reference policy will be played by the advisor (thus the reference policy is not known in advance but is learned online). This means we can drop the sanity condition for the advisor, at the price of (i) having a regret bound defined w.r.t. some kind of quantilum value rather than optimal value (ii) having a term in the regret bound proportional to the (presumably small) rate of falling into traps when following the reference (advisor) policy. It should be possible to further develop that by unifying it with the ideas of catastrophe DRL.

Deal with more general environments, e.g. POMDPs and continuous state spaces.
Proofs
Proposition A.1
Proof of Proposition A.1
The definition of implies that
Also
Observe that
It follows that for any that satisfies the constraint , we have and therefore
To show that the inequality can be arbitrarily close to equality, choose s.t. and . If , we can define by
Clearly and . We get
In the case , we can take any and define by
Clearly and . We get
Since , we can make this expression arbitrarily low and therefore the infimum is which is what we need since in this case .
Proposition 1 now follows immediately from Proposition A.1.
We will use the notation (the space of all policies). We also have (the space of Markov policies) and (the space of stationary policies). Mildly abusing notation, we will view as a subspace of and as a subspace of .
Proposition A.2
For any , there exists some policy which is quantilal relatively to .
Proof of Proposition A.2
is the product of a countable number of copies of (indexed by ). is naturally a topological space (a simplex), and we can thereby equip by the product topology. By Tychonoff's theorem, is compact. Moreover, it is easy to see that is a continuous mapping. Finally, observe that is lower semicontinuous in (since it is the maximum of a number of continuous functions) and therefore is upper semicontinuous in . By the extreme value theorem, it follows that a quantilal policy exists.
For any , we define by
Proof of Proposition 2
By Proposition A.2, there is a quantilal policy . We define by
We now prove by induction that for any , .
For , we have .
For any and any , we have
To complete the induction step, observe that, by definition of
Now, for any , and therefore . We conclude that is also quantilal.
Proof of Proposition 3
By Proposition 2, there is which is a Markov quantilal policy. We have
Taking the expected value of this equation w.r.t. we get
Also, we have
It follows that
Define by
We get
Define the linear operator by the matrix
Viewing as a subset of , we get
On the other hand, we have
Therefore, and is also quantilal.
Proposition A.3
Assume geometric time discount. Consider any . Define the linear operators and by the matrices
Define the linear operator by
Define by
Define as follows
Here, vector inequalities are understood to be componentwise.
Then,
In particular, if , the above expression describes
Proof of Proposition A.3
Denote . As usual in extensiveform games, behavioral strategies are equivalent to mixed strategies and therefore the image of the pushforward operator is the same as the image of . It follows that
is a compact Polish space (in the sense of the product topology) and therefore is compact in the weak topology. By Sion's minimax theorem
Now consider any . implies (looking at the component) that
Therefore, there is some s.t.
The above is the Bellman equation for a modified MDP with reward function . Denote the version of the operator for the initial state (instead of ). We get
Observing that , we get
On the other hand, for any s.t. and (these inequalities correspond to the component of ), there is s.t.
Namely, this is the solution of the Bellman equation for the reward function . Therefore, we have
Taking the infimum of both sides over inside the domain we get
Proof of Proposition 4
Consider the characterization of in Proposition A.3. By general properties of systems of linear inequalities, there is some s.t.
Here, the notation means taking the submatrix of consisting of the rows corresponding to . Similarly, is the subvector of consisting of the components corresponding to .
(To see this, use the fact that the minimum of a linear function on a convex set is always attained on the boundary, and apply induction by dimension.)
Moreover, has to be a single point . Indeed, if it has more than one point then it contains a straight line . The projection of on the second has to be a single point , otherwise some point in violates the inequality which would contradict . Therefore, the projection of on the first is also a straight line . As in the proof of Proposition A.3, for any , is an upper bound on the value of state in the MDP with reward function . In particular, if then . However, since is a line, there must be some s.t. can be any real number for some . This is a contradiction.
Denote the projection operator on the first direct summand. It follows that we can always choose s.t. and we have
For each , the expression on the right hand side is a rational function of and the matrix elements of and which, in turn, are polynomials in the parameters the dependence on which we are trying to establish. Therefore, this expression is a rational function in those parameters (unless vanishes identically, in which case this can ignored). So, always equals one of a finite set of rational functions (corresponding to difference choices of ). The continuity of also easily follows from its characterization as .
Proposition A.4
Assume geometric time discount and stationary . Then, is a rational function of , , and with rational coefficients.
Proof of Proposition A.4
Define the linear operator by the matrix
We have
Proof of Corollary 1
By Propositions 4 and A.4, is a continuous piecewise rational function of with a finite number of pieces. Two rational functions can only coincide at a finite number of points (since a polynomial can only have a finite number of roots), therefore there is only a finite number of values of in which "switches" from one rational function to another. It follows that there is some s.t. is a fixed rational function of in the interval .
Moreover, it always holds that
The first inequality holds since, the guaranteed performance of the quantilal policy is at least as good as the guaranteed performance of . The second inequality is a consequence of the requirement that is nonnegative.
It follows that cannot have a pole at and therefore must converge to a finite value there.
Proof of Proposition 5
The algorithm for is obtained from Proposition A.3 using linear programming.
The claim regarding for stationary follows from Proposition A.4.
We now describe the algorithm for computing an equilibrium.
For any , define by
Consider any . We define by
Let be the operator for the MDP with transition kernel , and define
It is easy to see that if an only if there is quantilal s.t. . Indeed, the MDP with kernel is equivalent to the original MDP under the constraint that, when in state , any action has to be taken with the minimal probability . In particular (since we constrained the possible policies but not the possible penalties). So, if as above exists, then we can use it to construct a stationary policy for the new MDP with guaranteed value . Conversely, if the new MDP has a stationary policy with guaranteed value then it can be used to construct the necessary .
Using Proposition A.3, we can compute for any rational by linear programming. This allows us to find the desired policy by binary search on , one (state,action) pair at a time.