By Ajeya Cotra
Training powerful models to maximize simple metrics (such as quarterly profits) could be risky. Sufficiently intelligent models could discover strategies for maximizing these metrics in perverse and unintended ways. For example, the easiest way to maximize profits may turn out to involve stealing money, manipulating whoever keeps records into reporting unattainably high profits, capturing regulators of the industry to be allowed to ship shoddy products or avoid taxes, etc. More generally, the most effective path to maximizing a simple metric may involve acquiring enough power to tamper directly with whatever instruments or sensors are used to evaluate the metric, effectively deceiving and disempowering humans to do so.
It seems significantly safer if powerful models could be trained using something like human feedback, where human evaluators inspect a model’s action and rate how good that action is likely to be all-things-considered, and the model is trained to take actions that humans would rate highly. Human feedback could potentially disincentivize some obviously-perverse strategies like “blatantly stealing money to maximize profits,” and incentivize practices which could help maintain or improve human control like “explaining why a proposed action will be beneficial.”
However, human feedback isn’t fully adequate for supervising powerful models, especially if they take actions that are too complex for humans to understand. For example, even if blatant forms of theft are disincentivized, a sufficiently intelligent model trained with human feedback may still e.g. participate in various abstruse and complicated financial contracts which effectively constitute theft. On the other hand, if human evaluators simply penalize any action they don’t understand, the models they train would be significantly less valuable than they could have been and may be outcompeted by models trained with outcome metrics like profit.
We are interested in ways to enhance or improve upon human feedback, so that humans can provide adequate feedback even in domains where models are more capable or knowledgable than humans, without resorting to training on outcome metrics. For example, projects in this space could explore questions like:
We are seeking proposals for projects aiming to develop and test strategies for improving on the performance that could be achieved with naive human feedback in settings like these. Potential strategies that could be explored include:
We are especially interested in proposals that focus on fine-tuning GPT-3 or another large generative language model, proposals that aim to use non-expert human raters to effectively evaluate a task that involves some subject matter expertise (e.g. legal advice, medical advice, programming, physics questions, etc), and proposals that systematically study how different techniques for enhancing human feedback may change the way performance scales as a function of the amount of human labor put into providing feedback.