We introduce a variant of the concept of a "quantilizer" for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the "quantilal" policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the "quantilum" value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter λ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.
Background
Quantilization (introduced in Taylor 2015) is a method of dealing with "Extremal Goodhart's Law". According to Extremal Goodhart, when we attempt to optimize a utility function U∗:A→R by aggressively optimizing a proxyU:A→R, we are likely to land outside of the domain where the proxy is useful. Quantilization addresses this by assuming an unknown cost function C:A→[0,∞) whose expectation Eζ[C] w.r.t. some reference probability measure ζ∈ΔA is bounded by 1. ζ can be thought of as defining the "domain" within which U is well-behaved (for example it can be the probability measure of choices made by humans). We can then seek to maximize E[U] while constraining E
We introduce a variant of the concept of a "quantilizer" for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the "quantilal" policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the "quantilum" value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter λ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.
Background
Quantilization (introduced in Taylor 2015) is a method of dealing with "Extremal Goodhart's Law". According to Extremal Goodhart, when we attempt to optimize a utility function U∗:A→R by aggressively optimizing a proxy U:A→R, we are likely to land outside of the domain where the proxy is useful. Quantilization addresses this by assuming an unknown cost function C:A→[0,∞) whose expectation Eζ[C] w.r.t. some reference probability measure ζ∈ΔA is bounded by 1. ζ can be thought of as defining the "domain" within which U is well-behaved (for example it can be the probability measure of choices made by humans). We can then seek to maximize E[U] while constraining E