I am grateful to Daniel Kokotajlo, Beth Barnes and John Schulman for feedback on this post.
The purpose of this post is to request research on the design of precise procedures for evaluating how factually accurate pieces of text are. This stands out to me as an area that is potentially valuable for reducing risks from advanced AI, while not requiring detailed knowledge of ML.
Suppose that you are given:
The problem is to define a procedure that takes in a context and a piece of text from the given distribution, and outputs a numeric score for factual accuracy.
The procedure should have the following properties:
Note that the problem statement doesn't mention ML models (except as examples). The problem is of course motivated by ML models, but I think it can be studied relatively independently of ML, at least initially.
The main motivation for this request is that it is a problem that arises very naturally when attempting to train truthful LMs. The most straightforward way to optimize the factual accuracy of a language model is to have humans evaluate the factual accuracy of model outputs, and to then optimize those evaluations using techniques like reinforcement learning. The procedure needs to be unambiguous because label noise hurts both ML training and labeler monitoring (not to mention other benefits). Subject to this constraint, the main criterion for the procedure should be that it produces good outcomes in the given real-world setting.
My main reasons for thinking that this research could be important for reducing risks from advanced AI are:
I do think that this research is a gamble, in the sense that the details of evaluating factual accuracy may not end up mattering very much, perhaps because a wide range of procedures are good enough to avoid the very worst outcomes, and the important bottlenecks are elsewhere. That being said, I think we'll be in a better position to evaluate those arguments once we've given the research more of a try.
In some sense, the research can be thought of as a very special case of trying to specify more precisely what humans value. However, compared to more general research on that question, I think the specific research has a number of advantages:
Here are two important examples of existing research that begin to tackle this problem from different ends of a spectrum:
I think that it could be productive to push harder from either end of this spectrum. On the theoretical side:
On the practical side:
An instructive exercise is to browse some of WebGPT's answers, and to consider how one might evaluate their factual accuracy without the given sources (but potentially collecting new sources as part of the procedure). Even for factual topics, there can often be vague, subjective or holistic claims, which can be very hard to evaluate without either relevant expertise or a direct confirmation/refutation from a reliable source.
Another very relevant line of existing research is the exploration of debate using human judges and debaters, with a view to having AI systems play the roles of the debaters. Current AI systems are not yet capable enough for these schemes to be practical, but it is good to be thinking ahead, and there could also be shorter-term takeaways.
There is probably a lot more research in philosophy and the social sciences that is also relevant. Wikipedia's verifiability policy seems closely related, and is well-studied. There is even an entire field of applied epistemology. However, most of this research has not yet been made accessible to ML researchers working on factual accuracy. There could therefore be some low-hanging fruit in digesting some of this work appropriately.
A closely related question that often comes up is "who to align to": specifically, if there is some ambiguity in the procedure, who should be asked to make those judgment calls? For example, people of different political persuasions will often evaluate politically-sensitive statements differently.
I expect this question to eventually become an important part of the problem, but I'd be inclined to begin by focusing on procedure design, for a few reasons:
That being said, I'd still be excited to see work on this part of the problem, since it's a thorny issue that seems closely tied to risks from AI persuasion.
It's tempting to be pessimistic that we'll be able to design procedures that people of different political persuasions can have trust in, because of the current state of political discourse. But I think it might feasible to design procedures that are much more broadly trusted than current institutions, for a couple of reasons:
There are of course a number of obstacles in getting from procedures that are broadly trusted to working AI systems that are trusted to follow those procedures, but I think they are surmountable with enough effort. And even if it turns out to be impossible to design procedures that are universally trusted, there could still be significant benefits from improving procedures on the margin.
It's hard to convey exactly what kind of research I'd find most compelling in this area, and I'd be happy to chat to people who are considering working on this topic. Feel free to reach out to me at email@example.com.
Thanks for writing this, I'm excited to see more work on this subject!
One minor musing: I think the problem is a bit more dire than the framing "who to align to" suggests. Humans are biased, including us, including me. A system which replicates those biases and tells us/me what we would have concluded if we investigated in our usual biased way... is "aligned" in some sense, but in a very important sense is unaligned.* To use Ajeya's metaphor, it's a sycophant, not a saint. Rather than assisting us to find the truth, it'll assist us in becoming more unreasonably overconfident and self-assured in the ideology we already endorsed.
One reason I'm excited about research in this area is that hopefully we'll be able to collect data from a wide range of different political perspectives and diverse kinds of people, so that we can make political affiliation one of the variables the user can choose -- that way users can see how the bot's answers differ depending on which bias it has. I expect this to be pretty helpful in a variety of ways.
*A provocative way of putting it that I nevertheless tentatively endorse: It's aligned to your current ideology, not to you.