P. — AI Alignment Forum

The simplest possible acceptable value learning benchmark would look something like this:

Data is recorded of people playing a video game. They are told to maximize their reward (which can be exactly computed), have no previous experience playing the game, are actually trying to win and are clearly suboptimal (imitation learning would give very bad results).
The bot is first given all their inputs and outputs, but not their rewards.
Then it can play the game in place of the humans but again isn’t given the rewards. Preferably the score isn’t shown on screen.
The goal is to maximize the true reward function.
These rules are precisely described and are known by anyone who wants to test their algorithms.

None of the environments and datasets you mention are actually like this. Some people do test their IRL algorithms in a way similar to this (the difference being that they learn from another bot), but the details aren’t standardized.

A harder and more realistic version that I have yet to see in any paper would look something like this:

Data is recorded of people playing a game with a second player. The second player can be a human or a bot, and friendly, neutral or adversarial.
The IO of both players is different, just like different people have different perspectives in real life.
A very good imitation learner is trained to predict the first player's output given their input. It comes with the benchmark.
The bot to be tested (which is different from the previous ones) has the same IO channels as the second player, but doesn't see the rewards. It also isn't given any of the recordings.
Optionally, it also receives the output of a bad visual object detector meant to detect the part of the environment directly controlled by the human/imitator.
It plays the game with the human imitator.
The goal is to maximize the human’s reward function.

It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments