Extraction of human preferences 👨→🤖

Hello Matthew,

I'm Mislav, one of the team members that worked on this project. Thank you for your thoughtful comment.

Yes, you understood what we did correctly. We wanted to check whether human preferences are "learned by default" by comparing the performance of a human preference predictor trained just on the environment data and a human preference predictor trained on the RL agent's internal state.

As for your question related to environments, I agree with you. There are probably some environments (like the gridworld environment we used) where the human preference is too easy to learn. On other environments, the human preference is too hard to learn and then there's the golden middle.

One of our team members (I think it was Riccardo) had the idea of investigating the research question which could be posed as follows: "What kinds of environments are suitable for the agent to learn human preferences by default?". As you stated, in that case it would be useful to investigate the properties (features) of the environment and make some conclusions about what characterizes the environments where the RL agent can learn human preferences by default.

This is a research direction that could build up on our work here.

As for your question on why and how did we choose what the human preference will be in a particular environment: to be honest, I think we were mostly guided by our intuition. Nevan and Riccardo experimented with a lot of different environment setups in the VizDoom environment. Arun and me worked on setting up the PySC2 environment, but since training the agent on the PySC2 environment demanded a lot of resources, was pretty unstable and the VizDoom environment results turned out to be negative, we decided not to experiment on other environments further. So to recap, I think that we were mostly guided by our intuition on what would be too easy, too hard or just right of a human preference to predict and we course corrected by the experimental results.

Best,
Mislav

[-]Vaniver4y60

Thanks for sharing negative results!

If I'm understanding you correctly, the structure looks something like this:

We have a toy environment where human preferences are both exactly specified and consequential.
We want to learn how hard it is to discover the human preference function, and whether it is 'learned by default' in an RL agent that's operating in the world and just paying attention to consequences.
One possible way to check whether it's 'learned by default' is to compare the performance of a predictor trained just on environmental data, a predictor trained just on the RL agent's internal state, and a predictor extracted from the RL agent.

The relative performance of those predictors should give you a sense of whether the environment or the agent's internal state give you a clearer signal of the human's preferences.

It seems to me like there should be some environments where the human preference function is 'too easy' to learn on environmental data (naively, the "too many apples" case should qualify?) and cases where it's 'too hard' (like 'judge how sublime this haiku is', where the RL agent will also probably be confused), and then there's some goldilocks zone where the environmental predictor struggles to capture the nuance and the RL agent has managed to capture the nuance (and so the human preferences can be easily exported from the RL agent).

Does this frame line up with yours? If so, what are the features of the environments that you investigated that made you think they were in the goldilocks zone? (Or what other features would you look for in other environments if you had to continue this research?)

[-]Mislav Jurić4y20

Technique	Notes	Test AUC
Agent fine-tuning		0.91
Training a CNN on the environment observations	Baseline	0.89
Training the human preference predictor on reduced activations		0.82
Finding agent subnetworks for human preference prediction	75% of original agent network	0.73

Technique	Notes	Test AUC
Agent fine-tuning		0.82
Finding agent subnetworks for human preference prediction	50% of original agent net	0.75
Training a CNN on the environment observations	Baseline	0.82

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

10

Extraction of human preferences 👨→🤖

10

Introduction

Techniques/Methods

Environments

Experiments

Conclusion and Future work

Team Members

Acknowledgements

References