Sorted by New

Wiki Contributions



The simplest possible acceptable value learning benchmark would look something like this:

  • Data is recorded of people playing a video game. They are told to maximize their reward (which can be exactly computed), have no previous experience playing the game, are actually trying to win and are clearly suboptimal (imitation learning would give very bad results).
  • The bot is first given all their inputs and outputs, but not their rewards.
  • Then it can play the game in place of the humans but again isn’t given the rewards. Preferably the score isn’t shown on screen.
  • The goal is to maximize the true reward function.
  • These rules are precisely described and are known by anyone who wants to test their algorithms.

None of the environments and datasets you mention are actually like this. Some people do test their IRL algorithms in a way similar to this (the difference being that they learn from another bot), but the details aren’t standardized.

A harder and more realistic version that I have yet to see in any paper would look something like this:

  • Data is recorded of people playing a game with a second player. The second player can be a human or a bot, and friendly, neutral or adversarial.
  • The IO of both players is different, just like different people have different perspectives in real life.
  • A very good imitation learner is trained to predict the first player's output given their input. It comes with the benchmark.
  • The bot to be tested (which is different from the previous ones) has the same IO channels as the second player, but doesn't see the rewards. It also isn't given any of the recordings.
  • Optionally, it also receives the output of a bad visual object detector meant to detect the part of the environment directly controlled by the human/imitator.
  • It plays the game with the human imitator.
  • The goal is to maximize the human’s reward function.

It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.


Do you have plans to measure the alignment of pure RL agents, as opposed to repurposed language models? It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one. An agent would be given first or third-person demonstrations of people trying to maximize their score in a game, and then it would try to do the same, without ever getting to see what the true reward function is. Having something like this would probably be very useful; it would allow us to directly measure goodharting, and being quantitative it might help incentivize regular ML researchers to work on alignment. Will you create something like this?