https://twitter.com/antimatter15/status/1602469101854564352 Currently, large language models (ChatGPT, Constitutional AI) are trained to refuse to follow user requests that are considered inappropriate or harmful. This can be done by training on example strings of the form “User: inappropriate request AI: elaborate apology” Proposal Instead of training a language model to produce “elaborate...
Can AI systems substantially help with alignment research before transformative AI? People disagree. Ought is collecting a dataset of alignment research tasks so that we can: 1. Make progress on the disagreement 2. Guide AI research towards helping with alignment We’re offering a prize of $200-$2000 for each contribution to...
TLDR We wrote a 20-page document that explains IDA and outlines potential Machine Learning projects about IDA. This post gives an overview of the document. What is IDA? Iterated Distillation and Amplification (IDA) is a method for training ML systems to solve challenging tasks. It was introduced by Paul Christiano....
When I think about Iterated Amplification (IA), I usually think of a version that uses imitation learning for distillation. This is the version discussed in the Scalable agent alignment via reward modeling: a research direction, as "Imitating expert reasoning", in contrast to the proposed approach of "Recursive Reward Modelling". The...
HCH, introduced in Humans consulting HCH, is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking...
[Background: Intended for an audience that has some familiarity with Paul Christiano’s approach to AI Alignment. Understanding Iterated Distillation and Amplification should provide sufficient background.] [Disclaimer: When I talk about “what Paul claims”, I am only summarizing what I think he means through reading his blog and participating on discussions...