This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed...
You could describe the behavior of untuned GPT-like model[1] using a (peculiar) utility function. The fact that the loss function and training didn't explicitly involve a reward function doesn't mean a utility function can't represent what's learned, after all. Coming from the opposite direction, you could also train a predictor...
This is a part of a maybe-series where I braindump safety notes while waiting on training runs to complete. It's mostly talking to myself, but talking to myself in public seems somewhat more productive. The content of this post is not guaranteed to be novel, interesting, or correct, though I...