We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics (see Figure 1). We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts.
Here’s a description of the project Redwood Research is working on at the moment. First I’ll say roughly what we’re doing, and then I’ll try to explain why I think this is a reasonable applied alignment project, and then I’ll talk a bit about the takeaways I’ve had from the project so far.
There are a bunch of parts of this that we’re unsure of and figuring out as we go; I’ll try to highlight our most important confusions as they come up. I’ve mentioned a bunch of kind of in-the-weeds details because I think they add flavor. This is definitely just me describing a work in progress, rather than presenting any results.
Thanks to everyone who’s contributed to the project so far: the full-time Redwood technical team of...
Within those projects, I'm aiming to work on subprojects that are:
The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.
The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.
[Thanks to Richard Ngo, Damon Binder, Summer Yue, Nate Thomas, Ajeya Cotra, Alex Turner, and other Redwood Research people for helpful comments; thanks Ruby Bloom for formatting this for the Alignment Forum for me.]
I'm going to draw a picture, piece by piece. I want to talk about the capability of some different AI systems.
You can see here that we've drawn the capability of the system we want to be competitive with, which I’ll call the unaligned benchmark. The unaligned benchmark is what you get if you train a system on the task that will cause the system to be most generally capable. And you have no idea how it's thinking about things, and you can only point this system at some goals and not others.
I think that...
Special thanks to Abram Demski, Paul Christiano, and Kate Woolverton for talking with me about some of the ideas that turned into this post.
The goal of this post is to present a new prosaic (i.e. that uses current ML techniques) AI safety proposal based on AI safety via debate that I've been thinking about recently. I'll start by describing a simple version of the proposal and then show some of the motivation behind it as well as how the simple version can be expanded upon.
Let and be models and be a human. Intuitively, we'll train and via the following procedure given a question :