TL;DR: We show sycophancy is a general behavior of RLHF’ed AI assistants in varied, free-form text-generation settings, extending previous results. Our experiments suggest these behaviors are likely driven in part by imperfections in human preferences–both humans and preference models sometimes prefer convincingly-written sycophantic responses over truthful ones. This provides empirical evidence that we will need scalable oversight approaches.
Tweet thread summary: link
It has been hypothesized that using human feedback to align AI could lead to systems that exploit flaws in human ratings. Meanwhile others have empirically found that language models repeat back incorrect human views, which is known as sycophancy. But these evaluations are mostly proof-of-concept demonstrations where users introduce themselves as having a particular view. And although these existing empirical results match the theoretical concerns, it isn’t clear whether they are actually caused by issues with human feedback. We decided to investigate these questions more thoroughly. In other words, is sycophancy actually a problem in AI assistants? And what is the role played by human preference judgments in sycophancy?
We first found that sycophancy is a general behavior of RLHF-trained conversational models. Across Anthropic, OpenAI, and Meta assistants, we found clear sycophancy in varied, free-form text tasks: models wrongly admitted mistakes, gave biased feedback, and mimicked user errors. The consistency might indicate a link to RLHF training more broadly, rather than model-specific factors.
We also release our evaluation datasets at https://github.com/meg-tong/sycophancy-eval. We believe they measure more realistic forms of sycophancy than existing datasets.
We then investigated whether human preference judgements play a role in this broadly observed behavior. Analyzing Anthropic's released helpfulness preference data, we found "matching user beliefs and biases" was highly predictive of human judgments. However, other factors, such as a response being truthful or assertive, are also predictive of human preferences.
We looked at the behavior of humans and preference models (PMs) in a specific setting—distinguishing between seemingly correct but incorrect sycophantic responses to user-stated misconceptions, and truthful responses to those misconceptions. We collected human data, and found independent humans sometimes preferred convincing sycophantic responses over correct ones for challenging common misconceptions. As misconceptions became more challenging, humans further struggled to tell apart sycophantic and truthful responses. This behavior also showed up in the Claude 2 PM.
We then looked at whether sycophancy increases or decreases when optimizing against the PM used to train Claude 2. When optimizing the Claude 2 PM using best-of-N sampling, we found feedback sycophancy increases, but surprisingly other forms decrease. However, when compared to an improved “non-sycophantic” PM, we see that Claude 2 PM sometimes sacrifices truthfulness in favor of sycophancy. That is, the Claude 2 PM sometimes picks sycophantic responses in favor of truthful ones. This suggests the PM is, to some extent, modeling the flaws in human judgments.
We also found some forms of sycophancy increased when training Claude 2 throughout RLHF training. However, not all forms of sycophancy increased, and the model was sycophantic even at the start of RLHF training.
Overall, our analysis suggests human judgment plays some role in the observed sycophancy of RLHF-trained AI assistants. But much remains unclear—there are cases where optimizing against the PM reduces sycophancy, and the model that we used was sycophantic even at the start of RLHF training. This suggests other factors, such as pretraining and supervised learning before RLHF, likely also contribute.
Our work provides some suggestive empirical evidence that scalable oversight techniques, which go beyond using non-expert human feedback, may be necessary to align AI. We show both that sycophancy shows up in practice in a variety of settings and that human feedback plays a role, as some have hypothesized. We show that preference models can pick up on imperfections in human preferences, which can then get learned by models during RLHF. Optimizing against these PMs could thus lead to unwanted or unsafe behavior. However, the reality is not as clean as the theory, and it seems that other factors also play a role.
Our team at Anthropic is actively hiring, so if you’re interested in working on the above directions with us, please apply to the research scientist or research engineer roles at Anthropic and mention your interest in alignment!
E.g., "Discovering language model behaviors with model-written evaluations"; Perez et al. 2022. "Simple synthetic data reduces sycophancy in large language models"; Jerry Wei, et al.
Cotra refers to such models as "sycophants" - they seem high-performing by seeking short-term human approval in ways that aren’t beneficial long-term. For example, using human feedback to get reliable code could just produce models that generate bugs humans can't detect. It could also yield models that agree with users rather than correcting factual mistakes.
This is actually the Claude 2 PM prompted explicitly to be less sycophantic. To do so, we prepend text to the human-assistant conversation seen by the PM where the user explicitly asks for the most truthful answer.
The failure modes we identify are not in and of themselves dangerous, but some researchers have hypothesized that this approach could exploit flaws or biases in human ratings, producing behavior that appears good on the surface but is actually problematic.