Developing safe and beneficial reinforcement learning (RL) agents requires making them aligned with human preferences. An RL agent trained to fulfil any objective in the real world will probably have to learn human preferences in order to do well. This is because humans live in the real world, so the RL agent will have to take human preferences into account as it optimizes its objective. We propose to first train an RL agent on an objective in the real world so that it learns human preferences as it is being trained on that real world objective and then use the agent’s understanding of human preferences to build a better reward function.

We build upon the work of Christiano et al. (2017) where they trained a human preference predictor as the reward signal. The preference predictor was trained on environment observations to give a high reward for states where the human preferences were satisfied and a low reward for states where the human preferences were not satisfied:

In our experiments, the reward predictor takes the activations (hidden states) of the RL agent as the input and is trained to predict a binary label depending on whether human preferences are satisfied or not.

 We first train an RL agent in an environment with some reward function that’s not aligned with human preferences. After training the RL agent, we try different transfer learning techniques to transfer the agent’s knowledge of human preferences to the human preferences predictor. Our goal is to train the human preferences predictor to get a high accuracy with a small amount of labeled training examples.

The idea of training a human preference predictor off of the RL agent’s hidden (internal) states was already validated by Wichers (2020). We wanted to validate it further by trying other techniques to train a human preference predictor, as well as to validate it in more environments.

Research question

The main research question we wanted to answer is: “Are human preferences present in the hidden states of an RL agent that was trained in an environment where a human is present?”

In order to answer this question, we conjectured that if human preferences are indeed present in the hidden states, we could:

  • leverage the hidden states of the RL agent in order to train an equally accurate preference predictor with a smaller amount of data
  • bootstrap a reward function that would get progressively better at capturing latent human preferences

With the above question and aims in mind, we focused on validating the idea of human preferences being present in the hidden state of the agent.


We experimented with different techniques to find the best way of extracting the learned human preferences from the RL agent. For each one of these techniques, we first trained an RL agent on an environment. 

Agent fine-tuning

We selected a portion of the neural network of the agent and added new layers on top of the layers we selected. We then trained that model using supervised learning, where features were the hidden states of the agent and the target was whether human preferences were satisfied or not.

Training a CNN on the environment observations

In this approach, we trained a convolutional neural network (CNN) on the environment observations to predict human preferences.

Training the human preference predictor on reduced activations

Here we used activation reduction techniques as in Hilton et. al. (2020). We first exported the RL agent activations and then used the activation reduction techniques in the hopes that they will yield better features for predicting human preferences. We then trained a human preference predictor on the reduced activations.

Finding agent subnetworks for human preference prediction

We build upon the works of Ramanujan et. al. (2020). We try to find a subnetwork of the pretrained RL agent’s neural network responsible for learning human preferences. This is done by assigning and updating a score for the weights in the RL agent’s neural network. The scores for each weight in the RL agent’s neural network determine whether or not the weight is useful for human preference prediction. In the original work of Ramanujan et. al. (2020) they try to find a subnetwork from a randomly initialized neural network, while we try to find the subnetwork from a pretrained RL agent neural network.


We tried our techniques on both the gridworld environment used in Wichers (2020) and the VizDoom environment (2018). 

The gridworld environment was set up as follows: In the environment, there’s an RL agent, a simulated human, an electric fence and apples. The RL agent moves around the environment and collects apples. If the RL agent collects too many apples, the simulated human gets angry and activates the electric fence, giving the RL agent a negative reward. The human preference in this environment was the threshold at which the simulated human would activate the electric fence. The RL agent was supposed to learn to collect just enough apples so that it doesn’t trigger the simulated human to turn on the electric fence.

In the VizDoom environment, there were three actors: the RL agent, a simulated human (represented as a monster) and an infected marine. The RL agent could move left or right and fire from its weapon. The human preference, in this case, was the information about who is the simulated human attacking - it could be either the RL agent or the infected marine.

For the gridworld environment, our baseline was the test area under curve (test AUC) from training the human preference predictor as a CNN on environment observations.


We can see the results for the gridworld environment in Table 1 below:

TechniqueNotesTest AUC
Agent fine-tuning 0.91
Training a CNN on the environment observationsBaseline0.89
Training the human preference predictor on reduced activations 0.82
Finding agent subnetworks for human preference prediction75% of original agent network0.73

Table 1: Results of our experiment on the grid world environment

For each of the techniques in the table above the training set contained 50 data points and the validation set contained 500 data points. We ran it for 500 epochs (with early stopping) and the results were averaged over 10 runs.

As we can see, the agent fine-tuning did better than the baseline.

As a baseline for the Doom environment we had a human preference predictor as a CNN trained on the environment observations. This is the same baseline as we had in the gridworld environment; only the environment is different.

In Table 2 below we can see the results on the Doom environment

TechniqueNotesTest AUC
Agent fine-tuning 0.82
Finding agent subnetworks for human preference prediction50% of original agent net0.75
Training a CNN on the environment observationsBaseline0.82

Table 2: Results of the first experiment on the Doom environment

When running the experiments shown in the table above, we used 50 training data points and 500 validation data points. The training batch size was 32. The number of training epochs was 100 (with early stopping). To find the best hyperparameters we ran the hyperparameter tuning procedure 60 times. We averaged the results over 10 rounds.

From the experiments on the Doom environment, we have found that with limited training data techniques that leverage what the RL agent has already learned about human preferences do not do better than the baseline. Therefore, we decided not to pursue this research direction further.

Conclusion and Future work

Our experiments showed no significant improvement over the work of Wichers (2020). Thus, we stopped doing further research, since it doesn’t seem promising. Our codebase is available on GitHub:

All suggestions on what we could try or improve upon are welcome.

Team Members

Nevan Wichers, Riccardo Volpato, Mislav Jurić and Arun Raja


We would like to thank Paul Christiano, Evan Hubinger, Jacob Hilton and Christos Dimitrakakis for their research advice during AI Safety Camp 2020.


Deep Reinforcement Learning from Human Preferences. Christiano et. al. (2017)

Preference Extraction GitHub code repository

RL Agents Implicitly Learning Human Preferences. Wichers N. (2020)

Understanding RL Vision. Hilton et. al. (2020)

ViZDoom GitHub code repository. Wydmuch et. al. (2018)

What's Hidden in a Randomly Weighted Neural Network?. Ramanujan et. al. (2020)

New Comment
2 comments, sorted by Click to highlight new comments since:

Thanks for sharing negative results!

If I'm understanding you correctly, the structure looks something like this:

  • We have a toy environment where human preferences are both exactly specified and consequential.
  • We want to learn how hard it is to discover the human preference function, and whether it is 'learned by default' in an RL agent that's operating in the world and just paying attention to consequences.
  • One possible way to check whether it's 'learned by default' is to compare the performance of a predictor trained just on environmental data, a predictor trained just on the RL agent's internal state, and a predictor extracted from the RL agent.

The relative performance of those predictors should give you a sense of whether the environment or the agent's internal state give you a clearer signal of the human's preferences.

It seems to me like there should be some environments where the human preference function is 'too easy' to learn on environmental data (naively, the "too many apples" case should qualify?) and cases where it's 'too hard' (like 'judge how sublime this haiku is', where the RL agent will also probably be confused), and then there's some goldilocks zone where the environmental predictor struggles to capture the nuance and the RL agent has managed to capture the nuance (and so the human preferences can be easily exported from the RL agent). 

Does this frame line up with yours? If so, what are the features of the environments that you investigated that made you think they were in the goldilocks zone? (Or what other features would you look for in other environments if you had to continue this research?)

Hello Matthew,

I'm Mislav, one of the team members that worked on this project. Thank you for your thoughtful comment.

Yes, you understood what we did correctly. We wanted to check whether human preferences are "learned by default" by comparing the performance of a human preference predictor trained just on the environment data and a human preference predictor trained on the RL agent's internal state.

As for your question related to environments, I agree with you. There are probably some environments (like the gridworld environment we used) where the human preference is too easy to learn. On other environments, the human preference is too hard to learn and then there's the golden middle.

One of our team members (I think it was Riccardo) had the idea of investigating the research question which could be posed as follows: "What kinds of environments are suitable for the agent to learn human preferences by default?". As you stated, in that case it would be useful to investigate the properties (features) of the environment and make some conclusions about what characterizes the environments where the RL agent can learn human preferences by default.

This is a research direction that could build up on our work here.

As for your question on why and how did we choose what the human preference will be in a particular environment: to be honest, I think we were mostly guided by our intuition. Nevan and Riccardo experimented with a lot of different environment setups in the VizDoom environment. Arun and me worked on setting up the PySC2 environment, but since training the agent on the PySC2 environment demanded a lot of resources, was pretty unstable and the VizDoom environment results turned out to be negative, we decided not to experiment on other environments further. So to recap, I think that we were mostly guided by our intuition on what would be too easy, too hard or just right of a human preference to predict and we course corrected by the experimental results.