This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Where not indicated, ideas are summaries of the documents mentioned, not original contributions. I am thankful for the encouraging approach of the organizers of the program, especially Oliver Zhang.
In the post “The case for aligning narrowly superhuman models”, Ajeya Cotra advocates for performing technical alignment work on currently available large models using human feedback, in domains where (some) humans are already outperformed (in some aspects). The direct benefit of this line of work could be testing out conceptual work, highlighting previously unknown challenges and eliciting new scalable, general solutions, as well as indirectly moving the field of ML in a better direction and building the community and infrastructure/tooling of AI Safety. This post summarizes the original post, some related work and some objections.
Progress in ML could allow us to do technical alignment research that is closer to the "real problem" than specific toy examples. In particular, one of the main promising directions (among others such as Interpretability and Truthful and honest AI) is discovering better methods of giving feedback to models more capable than us. This might be easy for some concrete, narrow tasks, such as playing Go (there is an algorithmic way to decide if a model is playing Go better than another model), but for more “fuzzy” tasks (such as “devise a fair economic policy”), we can’t evaluate the model according to such a gold standard.
A couple of examples of work falling into this category (See Example tasks):
These tasks call for more sophisticated ways of generating feedback for the models. Some of the approaches that can be used include debate or decomposing the question into subtasks. See section Possible approaches to alignment.
To test these methods, Ajeya also introduces the concept of sandwiching: produce a model using data and feedback from humans skilled at a task (to establish a gold standard), then use a model and some of these alignment approaches to achieve the same performance with data and feedback from people not skilled in the area.
The most important benefits this line of work could provide is:
Main critiques of the proposal and cruxes on the topic include (See section Critiques and cruxes based on the comments on the post and Comments by MIRI):
The post has been well-received in the community. Open Philanthropy has since accepted a round of grant proposals. OpenAI, Redwood Research, Anthropic and Ought are already doing related work.
From Ajeya’s proposal in Open Philantropy’s Request for Proposals include:
Ajeya argues that the necessity to use humans and a “more useful” model at the end are indicators of an interesting task.
I personally would add a few ideas:
There is some already existing work in this general direction, most notably OpenAI using human feedback to finetune GPT-3 for summarisation, web search and following instructions via reward modelling. Anthropic recently published a similar paper, but on a more diverse set of tasks.
Redwood research is working on a concrete problem in this space, namely getting a large language model to output stories which never include someone getting injured.
In the original post, what is considered alignment is intentionally left very open-ended, as part of the point is exploring promising proposals.
In order to make the research useful in the long-term, projects should strive to be:
Some of the ideas worth exploring include:
In addition to these, [in my opinion] it could be interesting to make the feedback process more interactive, informative and dynamic.
In the above mentioned OpenAI articles, feedback is gathered by asking human labelers to compare two particular solutions, then training a reward model on the comparisons and finetuning the original model against the reward model.
I would be excited to see approaches improving on this by:
To test a given alignment approach in practice, we can try to establish a baseline performance using training data and feedback from “empowered” humans, and then trying to get as close as possible to that performance with “non-empowered” humans using and providing feedback to our model.
There are different ways we can make an “empowered” and “non-empowered” set of humans:
This particular summary of the discussion reflects how I interpreted arguments, and some of them are my own.
Most of the benefits of this line of work hinge on how good "training grounds" are aligning current models to aligning more powerful models. In my opinion the following factors, if true, would make the case especially strong:
If something like the scaling hypothesis works, and we can use fairly similar approaches to current ones to reach transformative AI, then
Fixing a broken pocket calculator (which is undoubtedly superhuman), or even making Google output better results isn’t traditionally considered a field of AI Alignment. Some people view GPT-3 as more of a fact-retrieval engine. If that is the case, maybe “alignment work” on GPT-3 is more similar to working on Google Search. Or, perhaps, as a less theoretical example, one could argue that transformative AI is likely to be agentic, and since language models are not agentic, a number of issues (such as inner alignment) are just not applicable. (The proposal does not limit itself to language models, but this counterargument applies to other models as well.)
That being said, models such as GPT-3 or AlphaZero might have superhuman qualities that are relevant, such as a superhumanly rich latent world model, or superhuman ability to “evaluate” multiple plans/concepts. (Many more than a human can hold in working memory.)
While sandwiching is an interesting idea, Eliezer advocates for caution: maybe we can align current weak models with “weak” humans using some clever techniques, but there is a chance they break down once the model “figures out how to manipulate humans”/”hack itself”/”changing its own environment”/”is just optimising stronger”.
Even if future models are going to be similar to current ones in architecture, if simply stronger capabilities introduce some entirely new type of risks, this line of work is less valuable. Deception seems like a very good example, which we can’t experience with current models.
This is a subset of the previous point, but an important one.
If we had strong transparency tools, a lot of suggested concrete avenues would open for alignment work: we could provide better feedback to our models, verify their inner alignment (in particular, prevent deception) and audit (see Automated Auditing) them for robustness failures. If we have a chance at getting better at these techniques, concrete research in this direction involving current models is very valuable.
One could argue that this would be done by industry anyways, as they have a clear incentive to make their models more useful.
Ajeya counters that while there is related work, it is not exactly aimed at solving alignment in the long-term. (So for example they would not evaluate their approaches through sandwiching, or would cut corners and choose hacky solutions instead of general ones.)
In addition, pushing for more human-feedback research could, by setting a precedent and demonstrating that some approaches work:
It has been suggested that we could instead evaluate concrete conceptual proposals, such as ascription universality, automated auditing, making models more honest, HCH or Debate.
To the extent that these proposals are already implementable, they indeed seem like very good proposals for “aligning narrowly superhuman models”. However, it might be the case, that these proposals are hard to implement precisely without direct guidance from a conceptual researcher like Paul Christiano.
It also strikes me as a strong argument, that conceptual work can benefit from trying out things in practice, and we might discover new approaches previously not considered. (Thus allowing more open-ended exploration makes sense.)