What are some non-purely-sampling ways to do deep RL?

6Ryan Blough

2Evan Hubinger

7Matthew Barnett

5Rohin Shah

5gwern

1Evan Hubinger

2Donald Hobson

1Evan Hubinger

New Answer

New Comment

1 Answers sorted by

This doesn't strike directly at the sampling question, but it is related to several of your ideas about incorporating the differentiable function: Neural Ordinary Differential Equations.

This is being exploited most heavily in the Julia community. The broader pitch is that they have formalized the relationship between differential equations and neural networks. This allows things like:

- applying differential equation tricks to computing the outputs of neural networks
- using neural networks to solve pieces of differential equations
- using differential equations to specify the weighting of information

The last one is the most intriguing to me, mostly because it solves the problem of machine learning models having to start from scratch even in environments where information about the environment's structure is known. For example, you can provide it with Maxwell's Equations and then it "knows" electromagnetism.

There is a blog post about the paper and using it with the DifferentialEquations.jl and Flux.jl libraries. There is also a good talk by Christopher Rackauckas about the approach.

It is mostly about using ML in the physical sciences, which seems to be going by the name Scientific ML now.

For the Alignment Newsletter:

Summary: A deep reinforcement learning agent trained by reward samples alone may predictably lead to a proxy alignment issue: the learner could fail to develop a full understanding of what behavior it is being rewarded for, and thus behave unacceptably when it is taken off its training distribution. Since we often use explicit specifications to define our reward functions, Evan Hubinger asks how we can incorporate this information into our deep learning models so that they remain aligned off the training distribution. He names several possibilities for doing so, such as giving the deep learning model access to a differentiable copy of the reward function during training, and fine-tuning a language model so that it can map natural language descriptions of a reward function into optimal actions.

Opinion: I'm unsure, though leaning skeptical, whether incorporating a copy of the reward function into a deep learning model would help it learn. My guess is that if someone did that with a current model it would make the model harder to train, rather than making anything easier. I will be excited if someone can demonstrate at least one feasible approach to addressing proxy alignment that does more than sample the reward function.

My opinion (also going in the newsletter):

I'm skeptical of this approach. Mostly this is because I'm generally skeptical that an intelligent agent will consist of a separate "planning" part and "reward" part. However, if that were true, then I'd think that this approach could plausibly give us some additional alignment, but can't solve the entire problem of inner alignment. Specifically, the reward function encodes a _huge_ amount of information: it specifies the optimal behavior in all possible situations you could be in. The "intelligent" part of the net is only ever going to get a subset of this information from the reward function, and so its plans can never be perfectly optimized for that reward function, but instead could be compatible with any reward function that would provide the same information on the "queries" that the intelligent part has produced.

For a slightly-more-concrete example, for any "normal" utility function U, there is a utility function U' that is "like U, but also the best outcomes are ones in which you hack the memory so that the 'reward' variable is set to infinity". To me, wireheading is possible because the "intelligent" part doesn't get enough information about U to distinguish U from U', and so its plans could very well be optimized for U' instead of U.

This is rather abstract / complex so I'd be interested in suggestions for how to make it more understandable.

You mean stuff like model-predictive control and planning? You can use backprop to do gradient ascent over a sequence of actions if you have a differentiable environment and/or reward model. This also has a lot of application to image CNNs: reversing GANs to encode an image for editing, optimizing to maximize a particular class (like maximally 'dog' or 'NSFW' images) etc. I cover some of the uses and history in https://www.gwern.net/Faces#reversing-stylegan-to-control-modify-images

My most recent suggestion in this vein was about OA/Christiano's preference learning, using gradient ascent directly on trajectories/strings, which avoids explicit sampling and rating in an environment.

Hmmm... not sure if this is exactly what I want. I'd prefer not to assume too much about the environment dynamics. Not sure if this is related to what you're talking about, but one possibility, maybe, for a way in which you could do model-based planning with an explicit reward function but without assuming much about the environment dynamics could be to learn all the dynamics necessary to do model-based planning in a model-free way (like MuZero) *except* for the reward function and then include the reward function explicitly.

Yep—that's the adversarial training approach to this problem. The problem is that you might not be able to sample all the relevant highly uncertain points (e.g. because you don't know exactly what the deployment distribution will be), which means you have to do some sort of relaxed adversarial training instead, which introduces its own issues.

Conventionally in machine learning, if you want to learn to minimize some loss or maximize some expected return, you do so by sampling a bunch of losses/rewards and training on those. Since the model only ever sees the loss or reward function through the lens of those specific samples, this basic approach introduces a proxy alignment problem.

For example, suppose you train an RL agent to maximize its future discounted return according to some reward function r. Furthermore, suppose there exists some other reward function r′ such that r and r′ give equivalent samples on the training distribution, but diverge elsewhere. If you just train your agent via evaluating r on a bunch of samples, however, then even if your model is in some sense trying to do the right thing, it has no possible way of knowing whether r or r′ is the right generalization.

In many cases, however, we know exactly what r is—we have explicit code for it and everything (or at least some sort of natural language description of it)—but we still only make use of r via sampling/evaluation. Of course, in many RL settings, you actually do only know how to evaluate r, not inspect it in any other way. However, I think a system that only works in settings where you have more access to the reward function than that can still do quite a lot—even if you explicitly know an environment’s reward function, it can still be quite difficult to figure out the optimal policy (think Go, for example) such that having an ML system which can figure it out for you is quite powerful.

So, here’s my question: at least for environments in which you have a known reward function, what are some ways of making use of that information in training a deep learning model other than evaluating that reward function on a bunch of samples? I’m also interested in ways of doing this in non-RL settings, though I still mostly only want to focus on deep learning approaches—there are certainly ways of doing this in more classical machine learning, but I’m less interested in those.

Some possibilities that I’ve considered so far:

exceptfor the reward function and then include the reward function explicitly.I’m sure there are other possibilities that I haven’t thought about yet, however—possibly including papers on this in the literature that I’m not familiar with. Any ideas?