Mislav Jurić — AI Alignment Forum

Hello Matthew,

I'm Mislav, one of the team members that worked on this project. Thank you for your thoughtful comment.

Yes, you understood what we did correctly. We wanted to check whether human preferences are "learned by default" by comparing the performance of a human preference predictor trained just on the environment data and a human preference predictor trained on the RL agent's internal state.

As for your question related to environments, I agree with you. There are probably some environments (like the gridworld environment we used) where the human preference is too easy to learn. On other environments, the human preference is too hard to learn and then there's the golden middle.

One of our team members (I think it was Riccardo) had the idea of investigating the research question which could be posed as follows: "What kinds of environments are suitable for the agent to learn human preferences by default?". As you stated, in that case it would be useful to investigate the properties (features) of the environment and make some conclusions about what characterizes the environments where the RL agent can learn human preferences by default.

This is a research direction that could build up on our work here.

As for your question on why and how did we choose what the human preference will be in a particular environment: to be honest, I think we were mostly guided by our intuition. Nevan and Riccardo experimented with a lot of different environment setups in the VizDoom environment. Arun and me worked on setting up the PySC2 environment, but since training the agent on the PySC2 environment demanded a lot of resources, was pretty unstable and the VizDoom environment results turned out to be negative, we decided not to experiment on other environments further. So to recap, I think that we were mostly guided by our intuition on what would be too easy, too hard or just right of a human preference to predict and we course corrected by the experimental results.

Best,
Mislav

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments