First, if we take PSRL as our model algorithm, then at any given time we follow a policy optimal for some hypothesis sampled out of the belief state. Since our prior favors simple hypotheses, the hypothesis we sampled is likely to be simple. But, given a hypothesis of description complexity , the corresponding optimal policy has description complexity , since the operation "find the optimal policy" has description complexity.
Taking computational resource bounds into account makes things more complicated. For some computing might be intractable, even though itself is "efficiently computable" in some sense. For example we can imagine an that has exponentially many states plus some succinct description of the transition kernel.
One way to deal with it is using some heuristic for optimization. But, then the description complexity is still .
Another way to deal with it is restricting ourselves to the kind of hypotheses for which is tractable, but allowing incomplete/fuzzy hypotheses, so that we can still deal with environments whose complete description falls outside this class. For example, this can take the form of looking for some small subset of features that has predictable behavior that can be exploited. In this approach, the description complexity is probably still something like , where this time is incomplete/fuzzy (but I don't yet know how PSRL for incomplete/fuzzy hypothesis should work).
Moreover, using incomplete models we can in some sense go in the other direction, from policy to model. This might be a good way to think of model-based RL. In actor-critic algorithms, our network learns a pair consisting of a value function and a policy . We can think of such a pair as an incomplete model that is defined by the Bellman inequality interpreted as a constraint on the transition kernel (or and the reward function ):
Assuming that our incomplete prior assigns weight to this incomplete hypothesis, we get a sort of Occam's razor for policies.
In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy's value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.
So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?
Maybe problems that don't have simple solutions (i.e. all their solutions have a large description length) are usually intractable for us. If so, given a problem that we're trying to solve, the assumption that it has simple solutions is probably either useful (if it's true) or costless (if it isn't). In other words: "look for your missing key under the lamppost, not because it's probably there, but because you'll only ever find it if it's there".
One way of solving the problem of action selection is to treat our own actions as random variables, then build a probabilistic model, and condition that probability distribution on things going well for us in the future. See, for example, Planning By Probabilistic Inference, human brains (I would argue), upside-down RL (kinda, I think?), etc.
In this formulation, instrumental Occam's razor is just a special case of epistemic Occam's razor, it seems to me.