How DeepMind's Generally Capable Agents Were Trained

Thanks for writing this up, I found this summary much more clear and approachable than the paper. I also basically agree with your own points, with the caveat that I think the distinction between curiosity and curriculum gets blurry in meta-learning contexts like this. I wish there were better metrics and benchmarks for data efficiency in this regard, and then we could do things like measure improvements in units of that metric.

I’m pretty pessimistic about this line of research for a number of reasons, that I think support and complement the reasons you mentioned. (I used to work on much smaller/simpler RL in randomized settings https://openai.com/blog/safety-gym/)

My first pessimism is that this research is super expensive (I believe prohibitively so) to scale to real world settings. In this case, we have a huge amount of effort designing and training the models, building the environment distribution (and a huge amount of effort tuning that, since it always ends up being a big issue in RL in my experience). The path here is to make more and more realistic worlds, which is extremely difficult, since RL agents will learn to hack physics simulators like no tomorrow. [Footnote: I don’t have the reference on hand, but someones been tracking “RL Agents Hacking Environments” in a google sheets and feel that’s appropriate here]. Even in the super-constrained world of autonomous vehicles, it has taken huge teams tons of resources and years to make simulators that are good enough to do real-world training on — and none of those probably have complex contact physics!

My second pessimism is that these agents are likely to be very hard to align — in that we will have a hard time specifying human values in terms of the limited goal syntax. (Let alone evaluating every partial sequence of human values, like you point out). There’s going to be a huge inferential gap between humans and almost every approach to building these systems.

My third pessimism is one that comes from my experience working with Deep RL, and that’s the huge data efficiency problem in part because it needs to figure out every aspect of the world from scratch. (This is in addition to the fact that it doesn’t have as much incentive to understand the parts of the world that don’t seem relevant to the goal). It seems that it’s almost certain that we’ll need to somehow give these systems a high quality prior — either via pretraining, or otherwise. In this case, my pessimism is the lack of the use of external knowledge as a prior, which is fixable by changing out some part of system for a pertained model.

(As a hypothetical example, having the goal network include a pretrained language model, and specifying the goals in natural language, would make me less pessimistic about it understanding human values)

I feel like Alex Irpan’s great “Deep RL Doesn’t Work Yet” is still true here, so linking that too. https://www.alexirpan.com/2018/02/14/rl-hard.html

I came here from this post: https://www.lesswrong.com/posts/WerwgmeYZYGC2hKXN/promising-posts-on-af-that-have-fallen-through-the-cracks

unsolicited feedback on the post: (please feel free to totally ignore this or throw it out) i thought this was well written and clear. I am both lazy and selfish, and when authors have crisp ideas I like when they are graphically represented in even simple diagrams. At the very least, I think it’s nice to highlight the most crucial graphs/plots of a paper when highlighting it, and I think I would have liked those, too. So +100 in case you were wondering about including things like paper screenshots or simple doodles of concepts.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

27

How DeepMind's Generally Capable Agents Were Trained

27

Intro

The Environment

The Optimization Process

Intro

1: Reinforcement Learning

2. Dynamic Task Generation

3. Population-Based Training

4. Generational Training

Summary & Results

Critique

Too-Specific Goal Neural Network

Extra Information During Training

Dynamic Task Generation