All of Paul Bricman's Comments + Replies

You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?

1Linda Linsefors1y
Similar but not exactly. I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies.  The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards. I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison.  In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer. DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI's reward function but is in reality VERY BAD.