My students and collaborators have been doing some particularly awesome work over the past several months, and to highlight that I wanted to summarize their papers here, and explain why I’m excited about them. There’s six papers in three categories.
Science of ML
I'll go over the first category (human-aligned AI) today, and save the other two for next week. As always, we love getting feedback on our work, so let us know what you think!
While AI alignment is a somewhat subtle and complex problem, two basic issues are that (1) ML systems often hack their reward functions, and (2) human supervision doesn’t necessarily solve this, because humans can’t easily understand the consequences of intervening on complex systems. Alex and Ruiqi’s papers help address each of these questions in turn.
Mapping and Mitigating Misaligned Models. What Alex Pan and Kush Bhatia did was construct a wide variety of reinforcement learning environments where reward hacking is possible, and measured the extent to which it occurred. They do this by defining both a “proxy” and “true” reward, and look at what happens to the true reward as we optimize the proxy reward. Two key insights are that:
A simple illustration of both is a traffic simulator, where the RL agent is trying to shape traffic flow to be more efficient. Small neural net models help cars merge efficiently onto the highway, but large models instead block cars from merging at all (which allows the cars already on the highway to move really fast and consequently achieves high proxy reward).
In this case, the proxy reward was actually the reward suggested by the designes of the traffic simulator, highlighting the difficulty of choosing good reward functions in practice.
Why you should care. Our results show that reward hacking is likely to become a bigger problem in the future (since it seems to get worse as models get larger). It also shows that in some cases, reward hacking could appear suddenly or unexpectedly. This seems important to investigate and we are hoping others will join us in continuing to understand when reward hacking occurs and how to prevent it.
Summarizing Differences Between Text Distributions. Ruiqi Zhong and Charlie Snell built a system that does the following: given two different distributions of natural language text, it generates a natural language description of what is different about the two distributions. It works by commbining a proposer (which consumes a small number of examples and generates hypotheses) with a verifier (which re-ranks all the hypotheses on using a large set of examples). An example is shown below:
While this might sound like a simple task, many tasks can be reduced to it. Here are a couple examples we consider in the paper:
Some other applications are describing what inputs activate a neuron, how language on Twitter has changed over time, how teacher evaluations differ across genders, or what the differences are between an in-distribution and out-of-distribution dataset.
Why you should care. One hope for AI is that it will help humans make better decisions than they could by themselves. One way to do this is by consuming complex data that humans could not easily process and then explaining it in a useful way. Our system does this—it would be time-consuming to manually look over two large datasets to understand how they differ, but the system can do it automatically. We hope future work will both improve this type of system (there is definitely still headroom!) and design ML systems that help humans understand other types of complex data as well.
We have one paper that is the first empirical demonstration of an important failure mode (phase transitions for reward hacking), and another that can eventually amplify human capabilities, by helping them understand complex data. Both pretty exciting! (At least in my biased opinion.)
If you liked these, check back next week for the other four papers!
I really liked the summarizing differences between distributions paper.
I think I'm excited for broadly the same reasons you are, but to state the case in my own words:
It seems like it would make sense and be valuable for labs interested in alignment to start adopting this kind of tool in practice (with the same kinds of benefits as early adoption of RLHF or early forms of amplification/debate). I think that's probably possible now or soon. Though academic research could also push them much further and it wouldn't be crazy to wait on that.
In some not-too-distant future it would be really cool if a fine-tuning API could also tell me something like "Here is a natural language hypothesis; predicted human judgments using this hypothesis explain X% of the learned classifier's performance." I expect this would already catch a large number of bugs or spurious correlations and add value.
For mostly-subhuman systems I think X% might be very close to 100%. I think it might already often be worthwhile to aim to replace the opaque classifier with a prediction of human judgments given a natural-language hypothesis (e.g. in any case where training data coverage isn't that great, humans have strong priors, or generalization errors can be very costly).
One caveat is that "summarizing differences between distributions" seems a bit too strong as a claim about what you're doing. It seems like all you get is some hypothesis that does a good job of telling which of the two distributions a datapoint came from; it need not capture all or even most of the differences between the distributions.
I think that the motivation and approach have a lot of overlap with imitative generalization. But to clarify the relationship: my goal when writing about imitative generalization (or iterated amplification, or relaxed adversarial training, or etc.) is mostly to start thinking through the limits of those techniques and doing useful theoretical work as far in advance as possible, rather than to plant a flag or claim credit without actually doing the hard work of making things work in practice. I'm definitely happy if any of my writing causes people to work on these topics, but as far as I know that didn't happen in this case.