Similar but not exactly.
I mean that you take some known distribution (the training distribution) as a
starting point. But when sampling actions you do so from shifted on truncated
distribution to favour higher reward policies.
The in the decision transformers I linked, AI is playing a variety of different
games, where the programmers might not know what a good future reward value
would be. So they let the system AI predict the future reward, but with the
distribution shifted towards higher rewards.
I discussed this a bit more after posting the above comment, and there is
something I want to add about the comparison.
In quantilizers if you know the probability of DOOM from the base distribution,
you get an upper bound on DOOM for the quantaizer. This is not the case for type
of probability shift used for the linked decision transformer.
DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the
AI's reward function but is in reality VERY BAD.
You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?