Posts

Sorted by New

Wiki Contributions

Comments

I'm not familiar with alignment research too deeply, but it's always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards. 

For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution  of possibly rewards for each state-action pair, and the only information it ever gets about rewards is an occasional "hand-of-god" handing it , the optimal action for some state  , the agent must then work backwards from this optimal action to update  . It must then reason from this updated distribution of rewards to , the current distribution of optimal policies implied by its knowledge of rewards. Such an agent presented with an action  that would prevent future "hand-of-god" optimal action outputs would not choose it because that would mean not being able to further constrain , which makes its expected future reward smaller.

Someday when I have time I want to code a small grid-world agent that actually implements something like this, to see if it works.   

  1. Different networks for each game
  2. They train for 220k steps for each agent and mention that 100k steps takes 7 hours on 4 GPUs (no mention of which gpus, but maybe RTX3090 would be a good guess?)
  3. They don't mention it
  4. They are explicitely motivated by robotics control, so yes, they expect this to help in that direction. I think the main problem is that robotics requires more complicated reward-shaping to obtain desired behaviour. In Atari the reward is already computed for you and you just need to maximise it, when designing a robot to put dishes in a dishwasher the rewards need to be crafted by humans. Going from "Desired Behavior -> Rewards for RL" is harder than "Rewards for RL -> Desired Behavior"
  5. I am somewhat surprised by the simplicity of the 3 methods described in the paper, I update towards "dumb and easy improvements over current methods can lead to drastic changes in performance".

As far as I can see, their improvements are:

  1. Learn the environment dynamics by self-supervision instead of relying only on reward signals. Meaning that they don't learn the dynamics end-to-end like in MuZero. For them the loss function for the enviroment dynamics is completely separate from the RL loss function.  (I was wrong, they in fact add a similarity loss to the loss function of MuZero that provides extra supervision for learning the dynamics, but gradients from rewards still reach the dynamics and representation networks.)
  2. Instead of having the dynamics model predict future rewards, have it predict a time-window averaged reward (what they call "value prefix"). This means that the model doesn't need to get the timing of the reward *exactly* right to get a good loss, and so lets the model have a conception of "a reward is coming sometime soon, but I don't quite know exactly when"
  3. As training progresses the old trajectories sampled with an earlier policy are no longer very useful to the current model, so as each training run gets older, they replace the training run in memory with a model-predicted continuation. I guess it's like replacing your memories of a 10-year old with imagined "what would I have done" sequences, and the older the memories, the more of them you replace with your imagined decisions.