This post is heavily informed by prior work, most notably that of Owain Evans, Owen Cotton-Barratt and others (Truthful AI), Beth Barnes (Risks from AI persuasion), Paul Christiano (unpublished) and Dario Amodei (unpublished), but was written by me and is not necessarily endorsed by those people. I am also very grateful to Paul Christiano, Leo Gao, Beth Barnes, William Saunders, Owain Evans, Owen Cotton-Barratt, Holly Mandel and Daniel Ziegler for invaluable feedback.
In this post I propose to work on building competitive, truthful language models or truthful LMs for short. These are AI systems that are:
- Useful for a wide range of language-based tasks
- Competitive with the best contemporaneous systems at those tasks
- Truthful in the sense of rarely stating negligent falsehoods in deployment
Such systems will likely be fine-tuned from large language models such as GPT-3, hence the name.
WebGPT is an early attempt in this direction. The purpose of this post is to explain some of the motivation for building WebGPT, and to seek feedback on this direction.
Truthful LMs are intended as a warm-up for aligned AGI. This term is used in a specific way in this post to refer to an empirical ML research direction with the following properties:
- Practical. The goal of the direction is plausibly achievable over the timescale of a few years.
- Valuable. The direction naturally leads to research projects that look helpful for AGI alignment.
- Mirrors aligned AGI. The goal is structurally similar to aligned AGI on a wide variety of axes.
The remainder of the post discusses:
- The motivation for warm-ups (more)
- Why truthful LMs serve as a good warm-up (more)
- The motivation for focusing on negligent falsehoods specifically (more)
- A medium-term vision for truthful LMs (more)
- How working on truthful LMs compares to similar alternatives (more)
- Common objections to working on truthful LMs (more)
Warm-ups for aligned AGI
There are currently a number of different empirical ML research projects aimed at helping with AGI alignment. A common strategy for selecting such projects is to select a research goal that naturally leads to helpful progress, such as summarizing books or rarely describing injury in fiction. Often, work on the project is output-driven, taking a no-holds-barred approach to achieving the selected goal, which has a number of advantages that aren't discussed here. On the other hand, goal selection is usually method-driven, tailored to test a particular method, such as recursive decomposition or adversarial training.
The idea of a warm-up for aligned AGI, as defined above, is to take the output-driven approach one step further. Instead of selecting projects individually, we attempt to choose a more ambitious research goal that naturally leads to helpful projects. Because it is harder to predict the course of research over multiple projects, we also try to make the goal structurally similar to aligned AGI, to make it more likely that unforeseen and auxiliary projects will also be valuable.
Whether this output-driven approach to project selection is preferable to the method-driven approach depends on more specific details that will be discussed later. But it is worth discussing first the advantages of each approach in broad strokes:
- Momentum versus focus. The output-driven approach involves having a consistent high-level goal, which allows different projects to more directly build upon and learn from one another. On the other hand, the method-driven approach involves more frequent re-evaluation of goals, posing less of a risk of being distracted from the even higher-level goal of aligned AGI.
- Testing assumptions versus testing methods. The output-driven approach makes it easier to evaluate long-term progress and hold projects to account, making it better for testing underlying assumptions and discovering new methods. On the other hand, the method-driven approach offers the most direct feedback on method design, making it better for improving methods incrementally.
- Practical progress versus theoretical progress. The output-driven approach involves a more realistic goal that mirrors aligned AGI, which is more likely to generate practical progress such as infrastructure and know-how. On the other hand, the method-driven approach is more likely to answer questions that directly inform theoretical work.
- Broader benefits versus replaceability. The output-driven approach more naturally gives rise to a wide variety of valuable projects such as policy and deployment work, is more likely to motivate related work by others, and is more likely to have direct benefits to society. On the other hand, the method-driven approach is less likely to select projects that would have happened anyway, and is less likely to be pulled in unwanted directions.
Truthful LMs as a good warm-up
In this section I will argue that truthful LMs serve as a particularly good warm-up for aligned AGI, in the sense defined above.
To begin with, truthful LMs are structurally similar to aligned AGI on a wide variety of axes:
- Alignment focus. Negligent falsehoods are a central example of an alignment failure. The reasons for focusing on negligent falsehoods specifically are discussed below.
- General-purpose domain. Language models appear to be the closest existing AI systems to AGI in terms of the breadth of their capabilities and the sophistication of their real-world understanding.
- Competitiveness requirement. The competitiveness condition is included in the definition of truthful LMs in order to mirror the need for aligned AGI to be competitive.
- Mitigation of risks to society. Untruthful LMs pose various risks, as discussed in Truthful AI and Risks from AI persuasion. Of course, the risks posed by unaligned AGI are more serious.
- Importance of rare failures. Truthful LMs are required to rarely state negligent falsehoods. The requirements for aligned AGI are similar but even stricter, in the sense that it is important to avoid even a single sufficiently bad alignment failure.
- Decomposition into outer alignment and distributional robustness. For both truthful LMs and aligned AGI, it may be helpful to decompose the problem into outer alignment (constructing an objective such that the model almost never fails on the training distribution) and distributional robustness (ensuring that the behavior of the model does not degrade too much when moving from training to deployment).
- Failure of naive human supervision. Naive objectives such as imitating humans and optimizing human approval are generally considered insufficient for competitive and aligned AGI. In the short term, optimizing human judgments will probably go pretty far towards making LMs more truthful, but there are signs of this objective breaking down with WebGPT, and more sophisticated techniques are starting to look attractive in practice.
- Broader challenges. Actually achieving good outcomes from aligned AGI likely involves a number of policy and deployment challenges that aren't automatically addressed by technical solutions. Similar challenges may be involved in achieving good short-term outcomes from truthful LMs. For example, it is unclear the exact criteria that should be used to judge LM behavior, such as how they should respond to controversial questions, how truthfulness should be balanced against other criteria such as helpfulness, and so on. Some suggestions for these criteria are made in Truthful AI, but these are not yet precise enough to be turned into training objectives.
Because of these similarities, working on truthful LMs offers numerous benefits. Perhaps most importantly, it naturally leads in several directions that are also attractive from a method-driven perspective:
- Methods for learning from human feedback. Something like reinforcement learning from human feedback (RLHF) seems necessary for getting ML systems to follow objectives of our choosing, and our current methodology for this (PPO against a static reward model etc.) can likely be improved upon.
- Going beyond naive human supervision. Working on truthful LMs will likely involve improving human supervision using AI assistance at some point. For example, this could involve using model-generated critiques, or more advanced techniques such as debate.
- Robustness. LMs could be made robustly truthful in several related senses: having a low failure rate on-distribution, being robust to distributional shift, and being adversarially robust. Achieving this will likely require techniques such as adversarial training.
- Empirical evidence about alignment difficulty. Having truthful LMs makes it easier to study questions like how well honesty generalizes in the context of capable models that have been trained using different objectives. This helps inform us about the scale at which different training objectives are likely to break down and lead to misalignment.
In addition, there are a number of broader benefits to working on truthful LMs:
- Infrastructure and know-how. For all of the above directions, it would be valuable to develop not only algorithmic advances, but also the capability of individuals and organizations to employ the required methods effectively.
- Policy and deployment work. I haven't thought as carefully about these areas, but given all of the above similarities, it seems possible that building and safely deploying truthful LMs would require work with long-term value in these and other areas, even if only via developing individual and organizational expertise.
- Direct benefits to society. Deploying truthful LMs could broadly improve how society functions, and pushing for this relatively early could have compounding effects via norm-setting. These benefits are discussed in much greater depth in Truthful AI and Risks from AI persuasion. I still consider these benefits to be somewhat speculative given the complexities of how society functions.
Overall, working on truthful LMs seems practical, valuable, and mirrors aligned AGI in enough ways to make it seem highly promising as an empirical ML research direction.
Why focus on negligent falsehoods?
Most of the arguments in favor of working on truthful LMs apply equally well to working on aligning language models in general. However, the definition of truthful LMs specifically singles out negligent falsehoods: statements that are unacceptably likely to be false, and where it should have been feasible for an AI system to understand this. This is done for several reasons:
- Alignment focus. Negligent falsehoods are a clear example of an alignment failure: we would like the model to avoid the falsehood, and the model should be capable of doing so, but it fails nonetheless.
- Lack of ambiguity. Compared to other kinds of alignment failure such as causing negligent harm or being unhelpful, negligent falsehoods are more unambiguous. Falsehoods in general may be even less ambiguous, but are more likely to be the result of a capability failure rather than an alignment failure.
- Policy benefits. As discussed in Truthful AI, truthfulness has particular benefits for society, and avoiding negligent falsehoods is a natural bright line around which beneficial standards could be developed.
- Method-driven motivations. Compared to other criteria, evaluating truthfulness involves significant complexity, making it more compelling as a target that requires going beyond naive human supervision. Negligent falsehoods are also compelling as something to avoid with a high degree of robustness (aiming for a very low failure rate, adversarial robustness, etc.), whereas falsehoods in general do not have this property (since models can be tripped up by probing for the limits of their knowledge).
- Connection to aligned AGI. Compared to other criteria, truthfulness relates most clearly to long-term concerns about AGI deceiving humans. There is even an argument, advanced in Eliciting Latent Knowledge, that eliminating negligent falsehoods is in some sense sufficient for aligning AGI in general, through a process of "indirect normativity".
The most obvious drawback of focusing on negligent falsehoods is that they are more ambiguous than falsehoods in general. In practice, I think it will be fine to focus on falsehoods that are plausibly negligent: it will be OK if some effort goes into capabilities that improve truthfulness, as long as they do not become the main focus. Such capabilities may also enable new alignment strategies: for example, the use of retrieval in WebGPT opened the door to improved evaluation of factual accuracy via the use of references. For the purposes of evaluating progress, it will be fine to make reasonable judgments about how likely a falsehood is to be negligent.
Medium-term vision for truthful LMs
Truthful LMs are a target that could be pursued with various different mindsets. At one extreme, one could take a very method-driven approach to selecting projects, and simply incorporate a preference for goals that can be framed in terms of truthfulness. At the other extreme, one could mostly try to make language models more useful, but try to adhere to relatively high standards of truthfulness along the way. Where to land on this spectrum depends on how one trades off the advantages of output-driven and method-driven approaches, as discussed above.
My tentative inclination is towards a middle ground, remaining slightly method-driven while having a clear medium-term vision for the next few years. In this spirit, here is a first attempt at such a vision.
The system is a truthful pure-text assistant:
- Users interact with it via a sequence of text messages.
- It attempts to perform any reasonable task that can be performed by interacting with text-based interfaces, as long as these interactions aren't required to have real-world side effects (with exceptions for things that could be harmful, etc.).
- It is competitive with contemporaneous systems. Over the next few years, I expect it to become feasible for AI to perform a large fraction of text-based tasks that do not require specific expertise, concentration for longer than a few seconds at a time, or more than around 10 minutes in total.
- It expresses uncertainty when making claims that might not be true (or can at least be configured that way), but not to an unreasonable degree (it does not typically hedge claims that most people would find it unreasonable to).
- For claims that are not hedged, it has a high degree of truthfulness: in ordinary usage, 99.9% of claims are not negligent falsehoods, and 90% of AI researchers without any special knowledge of the system cannot get it to state a negligent falsehood with 10 back-and-forth messages. (These numbers are indicative and may need revising.)
I think that achieving such a system would be a lot of work, but would not require any fundamental insights, and could be achieved with pre-trained models of the future using the methods of WebGPT together with some form of debate and adversarial training.
Comparison with related proposals
There are a number of similar approaches that have recently been proposed or are currently being pursued. I am generally a fan of these approaches, but it is worth discussing how they compare.
Some alternative proposals also focus on improving the behavior of contemporary models, but are more method-driven:
- Aligning narrowly superhuman models. This is also a proposal to align contemporary models, but it is less opinionated about the specific task, and emphasizes "sandwiching" projects that are specifically targeted at testing proposals for going beyond naive human supervision.
- Redwood Research's current project. This is a project to train a language model to continue fiction without describing injury, designed to test methods for achieving very low failure rates and adversarial robustness.
As discussed above, there are trade-offs between being method-driven and being output-driven when selecting projects. Overall, it seems plausible to me that method-driven projects are currently the most valuable empirical ML projects on the margin, since they are the most carefully targeted. On the other hand, being output-driven is a longer-term play, and may be able to make better use of people who thrive on practical problems in particular. Hence I would argue in favor of a portfolio approach.
Aligning language models in general
Another category of proposals is very similar to working on truthful LMs, but focused on a more general notion of alignment than truthfulness:
- Helpful, honest and harmless (HHH) models. This is a proposal (as part of a larger piece of work) to train large language models to be aligned according to these HHH criteria.
- Instruction-following models. This is a project to fine-tune models like GPT-3 to follow the user's intent.
I do think it makes sense to incorporate criteria other than truthfulness when aligning language models, and so these projects may end up being very similar to working on truthful LMs in practice. However, I would argue in favor of such projects placing particular emphasis on negligent falsehoods, for the reasons discussed above.
Working on truthful LMs has a number of possible objections in common with Aligning narrowly superhuman models. In addition to these, there are some more specific objections.
Lack of focus
One concern with working on truthful LMs is that it will be insufficiently focused on the core parts of alignment, as a result of being too output-driven. I think this concern is pretty reasonable, and can largely be mitigated by not being completely output-driven, but instead retaining some of the method-driven mindset, as discussed above.
It is a difficult question to determine exactly where to fall on this spectrum. I think that there are a couple of potential cruxes that lead people to have different intuitions on this question:
- Threat models for misaligned AGI. There is disagreement over the amount of weight that should be put on specific threat models of misaligned AGI, most notably risks from power-seeking misalignment. The more weight one puts on a specific threat model, the more one may be inclined to probe specific methods intended to address that threat. My personal sense is that power-seeking misalignment is currently the best-articulated specific threat model, but that it probably fails to capture a large portion of the overall risk associated with the transition to an AI-based economy. While I think that focusing on the clearest risk probably has the best bang-for-buck initially, I would also argue that using a good warm-up for aligned AGI is more likely to help with risks that are currently less clearly articulated, in addition to helping address this particular threat model.
- Alignment difficulty. There is also disagreement about how sophisticated the methods that we will need to align AGI will be. The more likely it is that simple methods will work out, the less valuable it is to make theoretical progress towards more sophisticated methods, compared to making practical progress that increases the chance that simple methods will be employed successfully. Moreover, if simple methods might solve at least some of the problem, then it is more valuable to put these methods to the test, and to isolate the parts of the problem that they fail to solve. I have a lot of uncertainty about alignment difficulty, which makes both practical and theoretical progress attractive to me.
Another concern is that working on truthful LMs may lead to AI being "let out of the box" by encouraging research in which models interact with the external world agentically, in the manner of WebGPT.
I think this concern is worth taking seriously, but that the case for it is weak:
- As AI capabilities improve, the level of access to the external world required for unintended model behavior to cause harm goes down. Hence access to the external world needs to be heavily restricted in order to have a meaningful safety benefit, which imposes large costs on research that are hard to justify.
- I am in favor of carefully and conservatively evaluating the risks of unintended model behavior before conducting research, and putting in place appropriate monitoring. But in the short term, this seems like an advantage of the research direction rather than a disadvantage, since it helps surface risks while the stakes are still low, build institutional capacity for evaluating and taking into account these risks, and set good precedents.
- In case this does turn out to be more of a concern upon reflection, there are other approaches to truthful AI that involve less agentic interaction with the external world than continuing in the style of WebGPT.
There is still an argument that there will be a period during which AI is capable enough to cause serious damage, but not capable enough to escape from sandboxed environments, and that setting precedents could worsen the risks posed during this interval. I don't currently find this argument persuasive, but would be interested to hear if there is a more persuasive version of it. That said, one bright line that stands out is training models to perform tasks that actually require real-world side effects, and I think it makes sense to think carefully before crossing that line.
Similarity to capabilities research
The output-driven approach has its advantages, but also makes the research more similar to capabilities research, which exacerbates some other potential concerns. In each case, I think that the response given in Aligning narrowly superhuman models remains valid, but is worth commenting on:
- Replaceability. It is more likely that similar work will be done anyway. I think this is a valid concern, but that there are enough distinguishing features of the research direction that this isn't a big problem. For example, I would not have expected a purely capabilities-oriented version of WebGPT to have focused nearly as much on human feedback. The focus on negligent falsehoods in particular is unlikely to be picked up outside of the alignment community in the near future.
- Encouraging scaling AI. The fact that making LMs more truthful is economically valuable will make the research more likely to cause harm by increasing investment in scaling AI. I think that the argument that this is not a major concern still holds, but that it is worth being especially cautious and responsible around publication and deployment decisions when following this research direction.
I think that working on truthful LMs has a comparative advantage in worlds where:
- We have around 10-40 years until transformative AI
- Transformative AI is built using techniques that resemble modern deep learning
- There is a slow takeoff
- Alignment does not require vastly more theoretical insight (but may require some)
- Our current picture of the risks posed by transformative AI is incomplete
These all seem like plausible assumptions to me, which probably goes some way towards explaining why I find truthful LMs compelling. I'm of course also keen on other work that is more valuable under different assumptions.
On the whole, working on truthful LMs seems highly promising to me as part of a portfolio of approaches aimed at AGI alignment, especially for people who are drawn to practical agendas.
Request for feedback
By default, this is the research direction I'll continue to pursue at OpenAI. It's therefore very valuable for me to know if it's horribly mistaken, or even if it's just clearly less valuable than alternative directions on the margin. Equally, if you're very excited by this research direction, then we should coordinate. In addition to leaving comments, please feel free to reach out to me at email@example.com if your feedback would be more convenient to give privately or via a different medium.