[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

BASALT: A Benchmark for Learning from Human Feedback (Rohin Shah et al) (summarized by Rohin): A typical argument for AI risk, given in Human Compatible (AN #69), is that current AI systems treat their specifications as definite and certain, even though they are typically misspecified. This state of affairs can lead to the agent pursuing instrumental subgoals (AN #107). To solve this, we might instead build AI systems that continually learn the objective from human feedback. This post and paper (on which I am an author) present the MineRL BASALT competition, which aims to promote research on algorithms that learn from human feedback.

BASALT aims to provide a benchmark with tasks that are realistic in the sense that (a) it is challenging to write a reward function for them and (b) there are many other potential goals that the AI system “could have” pursued in the environment. Criterion (a) implies that we can’t have automated evaluation of agents (otherwise that could be turned into a reward function) and so suggests that we use human evaluation of agents as our ground truth. Criterion (b) suggests choosing a very “open world” environment; the authors chose Minecraft for this purpose. They provide task descriptions such as “create a waterfall and take a scenic picture of it”; it is then up to researchers to create agents that solve this task using any method they want. Human evaluators then compare two agents against each other and determine which is better. Agents are then given a score using the TrueSkill system.

The authors provide a number of reasons to prefer the BASALT benchmark over more traditional benchmarks like Atari or MuJoCo:

1. In Atari or MuJoCo, there are often only a few reasonable goals: for example, in Pong, you either hit the ball back, or you die. If you’re testing algorithms that are meant to learn what the goal is, you want an environment where there could be many possible goals, as is the case in Minecraft.

2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.

3. The “true reward function” in Atari or MuJoCo is often not a great evaluation: for example, a Hopper policy trained to stand still using a constant reward gets 1000 reward! Human evaluations should not be subject to the same problem.

4. Since the tasks were chosen to be inherently fuzzy and challenging to formalize, researchers are allowed to take whatever approach they want to solving the task, including “try to write down a reward function”. In contrast, for something like Atari or MuJoCo, you need to ban such strategies. The only restriction is that researchers cannot extract additional state information from the Minecraft simulator.

5. Just as we’ve overestimated few-shot learning capabilities (AN #152) by tuning prompts on large datasets of examples, we might also be overestimating the performance of algorithms that learn from human feedback because we tune hyperparameters on the true reward function. Since BASALT doesn’t have a true reward function, this is much harder to do.

6. Since Minecraft is so popular, it is easy to hire Minecraft experts, allowing us to design algorithms that rely on expert time instead of just end user time.

7. Unlike Atari or MuJoCo, BASALT has a clear path to scaling up: the tasks can be made more and more challenging. In the long run, we could aim to deploy agents on public multiplayer Minecraft servers that follow instructions or assist with whatever large-scale project players are working on, all while adhering to the norms and customs of that server.

Rohin's opinion: You won’t be surprised to hear that I’m excited about this benchmark, given that I worked on it. While we listed a bunch of concrete advantages in the post above, I think many (though not all) of the advantages come from the fact that we are trying to mimic the situation we face in the real world as closely as possible, so there’s less opportunity for Goodhart’s Law to strike. For example, later in this newsletter we’ll see that synthetically generated demos are not a good proxy for human demos. Even though this is the norm for existing benchmarks, and we didn’t intentionally try to avoid this problem, BASALT (mostly) avoids it. With BASALT you would have to go pretty far out of your way to get synthetically generated demos, because by design the tasks are hard to complete synthetically, and so you have to use human demos.

I’d encourage readers to participate in the competition, because I think it’s especially good as a way to get started with ML research. It’s a new benchmark, so there’s a lot of low-hanging fruit in applying existing ideas to the benchmark, and in identifying new problems not present in previous benchmarks and designing solutions to them. It’s also pretty easy to get started: the BC baseline is fairly straightforward and takes a couple of hours to be trained on a single GPU. (That’s partly because BC doesn’t require environment samples; something like GAIL (AN #17) would probably take a day or two to train instead.)

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

What Matters for Adversarial Imitation Learning? (Manu Orsini, Anton Raichuk, Léonard Hussenot et al) (summarized by Rohin): This paper takes adversarial imitation learning algorithms (think GAIL (AN #17) and AIRL (AN #17)) and tests the effect of various hyperparameters, including the loss function, the discriminator regularization scheme, the discriminator learning rate, etc. They first run a large, shallow hyperparameter sweep to identify reasonable ranges of values for the various hyperparameters, and then run a larger hyperparameter sweep within these ranges to get a lot of data that they can then analyze. All the experiments are done on two continuous control benchmarks: the MuJoCo environments in OpenAI Gym and manipulation environments from Adroit.

Obviously they have a lot of findings, and if you spend time working with adversarial imitation learning algorithms, I’d recommend reading through the full paper, but the ones they highlight are:

1. Even though some papers have proposed regularization techniques that are specific to imitation learning, standard supervised learning techniques like dropout work just as well.

2. There are significant differences in the results when using synthetic demonstrations vs. human demonstrations. (A synthetic demonstration is one provided by an RL agent trained on the true reward.) For example, the optimal choice of loss function is different for synthetic demos vs. human demos. Qualitatively, human demonstrations are not Markovian and are often multimodal (especially when the human waits and thinks for some time: in this case one mode is “noop” and the other mode is the desired action).

Rohin's opinion: I really like this sort of empirical analysis: it seems incredibly useful for understanding what does and doesn’t work.

Note that I haven’t looked deeply into their results and analysis, and am instead reporting what they said on faith. (With most papers I at least look through the experiments to see if the graphs tell a different story or if there were some unusual choices not mentioned in the introduction, but that was a bit of a daunting task for this paper, given how many experiments and graphs it had.)

Prompting: Better Ways of Using Language Models for NLP Tasks (Tianyu Gao) (summarized by Rohin): Since the publication of GPT-3 (AN #102), many papers have been written about how to select the best prompt for large language models to have them solve particular tasks of interest. This post gives an overview of this literature. The papers can be roughly divided into two approaches: first, we have discrete prompts, where you search for a sequence of words that forms an effective prompt; these are “discrete” since words are discrete. Second, we have soft prompts, where you search within the space of embeddings of words for an embedding that forms an effective prompt; since embeddings are vectors of real numbers they are continuous (or “soft”) and can be optimized through gradient descent (unlike discrete prompts).

Interactive Explanations: Diagnosis and Repair of Reinforcement Learning Based Agent Behaviors (Christian Arzate Cruz et al) (summarized by Rohin): Many papers propose new algorithms that can better leverage human feedback to learn a good policy. This paper instead demonstrates an improved user interface so that the human provides better feedback, resulting in a better policy, on the game Super Mario Bros. Specifically:

1. The user can see the behavior of the agent and rewind / pause to find a place where the agent took a poor action.

2. The system generates an explanation in terms of the underlying state variables that explains why the agent chose the action it chose, relative to the second best action. It can also explain why it didn’t take a particular action.

3. The user can tell the agent that it should have taken some other action, and the agent will be trained on that instruction.

The authors conduct a user study and demonstrate that users find it intuitive to correct “bugs” in a policy using this interface.

Rohin's opinion: This seems like a great line of research to me. While step 2 isn’t really scalable, since it requires access to the underlying simulator state, steps 1 and 3 seem doable even at scale (e.g. I can imagine how they would be done in Minecraft from pixels), and it seems like this should significantly improve the learned policies.

AI GOVERNANCE

Debunking the AI Arms Race Theory (Paul Scharre) (summarized by Sudhanshu): This article, published recently in the Texas National Security Review, argues that various national trends of military spending on AI do not meet the traditional definition of an 'arms race'. However, the current situation can be termed a security dilemma, a "more generalized competitive dynamic between states." The article identifies two ways in which race-style dynamics in AI competition towards the aims of national security might create new risks: (i) a need for increasingly rapid decision-making might leave humans with diminished control or 'out of the loop'; and (ii) the pressure to quickly improve military AI capabilities could result in sacrificing supplementary goals like robustness and reliability, leading to unsafe systems being deployed.

The article offers the following strategies as panaceas to such dynamics. Competing nations should institute strong internal processes to ensure their systems are robust and secure, and that human control can be maintained. Further, nations should encourage other countries to take similar steps to mitigate these risks within their own militaries. Finally, nations should cooperate in regulating the conduct of war to avoid mutual harm. It concludes after citing several sources that advocate for the US to adopt these strategies.

Sudhanshu's opinion: I think the headline was chosen by the editor and not the author: the AI arms race 'debunking' is less than a fourth of the whole article, and it's not even an important beat of the piece; instead, the article is about how use of technology/AI/deep learning for military applications in multipolar geopolitics can actually result in arms-race-style dynamics and tangible risks.

Even so, I'm not convinced that the traditional definition of 'arms race' isn't met. The author invokes percentage growth in military spending of more than 10% over the previous year as a qualifying criterion for an arms race, but then compares this with the actual spending of 0.7% of the US military budget on AI in 2020 to make their case that there is no arms race. These two are not comparable; at the very least, we would need to know the actual spending on AI by the military across two years to see at what rate this spending changed, and whether or not it then qualifies to be an arms race.

NEWS

Hypermind forecasting contest on AI (summarized by Rohin): Hypermind is running a forecasting contest on the evolution of artificial intelligence with a $30,000 prize over four years. The questions ask both about the growth of compute and about performance on specific benchmarks such as the MATH suite (AN #144).

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.