I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)
Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You'... (read more)
Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don't see how you solve this problem in general in a sample-efficient manner otherwise.
One class of model-based RL approaches is based on [model-predictive control](ht... (read more)
I'm a bit confused by the intro saying that RL can't do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:
- Agent explores by pressing lever, learns transition dynamics that pressing lever => spray of salt water.
- Planner concludes that any sequence of actions involving pressing lever wi... (read more)
Thanks for the post, this is my favourite formalisation of optimisation so far!
One concern I haven't seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed si... (read more)
I feel like there are three facets to "norms" v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the 'negative' rights is probably more important, but it seems to me t... (read more)