We introduce a variant of the concept of a "quantilizer" for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the "quantilal" policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the "quantilum" value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.


Background

Quantilization (introduced in Taylor 2015) is a method of dealing with "Extremal Goodhart's Law". According to Extremal Goodhart, when we attempt to optimize a utility function by aggressively optimizing a proxy , we are likely to land outside of the domain where the proxy is useful. Quantilization addresses this by assuming an unknown cost function whose expectation w.r.t. some reference probability measure is bounded by 1. can be thought of as defining the "domain" within which is well-behaved (for example it can be the probability measure of choices made by humans). We can then seek to maximize while constraining