Geoffrey Irving

Research Director at the UK AI Safety Institute (AISI). Previously, DeepMind, OpenAI, Google Brain, etc.

Wiki Contributions


I certainly do think that debate is motivated by modeling agents as being optimized to increase their reward, and debate is an attempt at writing down a less hackable reward function.  But I also think RL can be sensibly described as trying to increase reward, and generally don't understand the section of the document that says it obviously is not doing that.  And then if the RL algorithm is trying to increase reward, and there is a meta-learning phenomenon that cause agents to learn algorithms, then the agents will be trying to increase reward.

Reading through the section again, it seems like the claim is that my first sentence "debate is motivated by agents being optimized to increase reward" is categorically different than "debate is motivated by agents being themselves motivated to increase reward".  But these two cases seem separated only by a capability gap to me: sufficiently strong agents will be stronger if they record algorithms that adapt to increase reward in different cases.

This is a great post!  Very nice to lay out the picture in more detail than LTSP and the previous LTP posts, and I like the observations about the trickiness of the n-way assumption.

I also like the "Is it time to give up?" section.  Though I share your view that it's hard to get around the fundamental issue: if we imagine interpretability tools telling us what the model is thinking, and assume that some of the content that must be communicated is statistical, I don't see how that communication doesn't need some simplifying assumption to be interpretable to humans (though the computation of P(Z) or equivalent could still be extremely powerful).  So then for safety we're left with either (1) probabilities computed from an n-way assumption are powerful enough that the gap to other phenomena the model sees is smaller than available safety margins or (2) something like ELK works and we can restrict the model to only act based on the human-interpretable knowledge base.