Some quotes:

Our approach

Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.

To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:

  1. To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
  2. To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
  3. Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).

We expect our research priorities will evolve substantially as we learn more about the problem and we’ll likely add entirely new research areas. We are planning to share more on our roadmap in the future.


While this is an incredibly ambitious goal and we’re not guaranteed to succeed, we are optimistic that a focused, concerted effort can solve this problem:C

There are many ideas that have shown promise in preliminary experiments, we have increasingly useful metrics for progress, and we can use today’s models to study many of these problems empirically. 

Ilya Sutskever (cofounder and Chief Scientist of OpenAI) has made this his core research focus, and will be co-leading the team with Jan Leike (Head of Alignment). Joining the team are researchers and engineers from our previous alignment team, as well as researchers from other teams across the company.

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 3:08 AM

I've been trying to understand (catch up with) the current alignment research landscape, and this seems like a good opportunity to ask some questions.

  1. This post links to which seems to be a continuation of Geoffrey Irving et el's Debate and Jan Leike et el's recursive reward modeling both of which are in turn closely related to Paul Christiano's IDA, so that seems to be the main alignment approach that OpenAI will explore in the near future. Is this basically correct?
  2. From a distance (judging from posts on the Alignment Forum and what various people publicly describe themselves as working on) it looks like research interest in Debate and IDA (outside of OpenAI) has waned a lot over the last 3 years, which seems to coincide with the publication of Obfuscated Arguments Problem which applies to Debate and also to IDA (although the latter result appears to not have been published), making me think that this problem made people less optimistic about IDA/Debate. Is this also basically correct?
  3. Alternatively or in addition (this just occurred to me), maybe people switched away from IDA/Debate because they're being worked on inside OpenAI (and probably DeepMind where Geoffrey Irving currently works) and they want to focus on more neglected ideas?
  1. Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we'll add more over time.

  2. I don't really understand why many people updated so heavily on the obfuscated arguments problem; I don't think there was ever good reason to believe that IDA/debate/RRM would scale indefinitely and I personally don't think that problem will be a big blocker for a while for some of the tasks that we're most interested in (alignment research). My understanding is that many people at DeepMind and Anthropic remain optimistic about debate variants have have been running a number of preliminary experiments (see e.g. this Anthropic paper).

  3. My best guess for the reason why you haven't heard much about it is that people weren't that interested in running on more toy tasks or doing more human-only experiments and LLMs haven't been good enough to do much beyond critique-writing (we tried this a little bit in the early days of GPT-4). Most people who've been working on this recently don't really post much on LW/AF.

Thanks for engaging with my questions here. I'll probably have more questions later as I digest the answers and (re)read some of your blog posts. In the meantime, do you know what Paul meant by "it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents" in the other subthread?

I'm not entirely sure but here is my understanding:

I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system's hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it's not too much it shouldn't be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time.

Personally I am more bullish on understanding how we can get agents to help with evaluation of other agents' outputs such that you can get them to tell you about all of the problems they know about. The "offense-defense" balance to understand here is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code) past a motivated human supervisor empowered with a similarly smart AI system trained to help them.

1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn't depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.

2. I think it's unlikely debate or IDA will scale up indefinitely without major conceptual progress (which is what I'm focusing on), and obfuscated arguments are a big part of the obstacle. But there's not much indication yet that it's a practical problem for aligning modestly superhuman systems (while at the same time I think research on decomposition and debate has mostly engaged with more boring practical issues). I don't think obfuscated arguments have been a major part of most people's research prioritization.

3. I think many people are actively working on decomposition-focused approaches. I think it's a core part of the default approach to prosaic AI alignment at all the labs, and if anything is feeling even more salient these days as something that's likely to be an important ingredient. I think it makes sense to emphasize it less for research outside of labs, since it benefits quite a lot from scale (and indeed my main regret here is that working on this for GPT-3 was premature). There is a further question of whether alignment people need to work on decomposition/debate or should just leave it to capabilities people---the core ingredient is finding a way to turn compute into better intelligence without compromising alignment, and that's naturally something that is interesting to everyone. I still think that exactly how good we are at this is one of the major drivers for whether the AI kills us, and therefore is a reasonable topic for alignment people to push on sooner and harder than it would otherwise happen, but I think that's a reasonable topic for controversy.

Thanks for this helpful explanation.

it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents

Can you point me to the original claims? While trying to find it myself, I came across which seems to be the most up to date explanation of why Jan thinks his approach will work (and which also contains his views on the obfuscated arguments problem and how RRM relates to IDA, so should be a good resource for me to read more carefully). Are you perhaps referring to the section "Evaluation is easier than generation"?

Do you have any major disagreements with what's in Jan's post? (It doesn't look like you publicly commented on either Jan's substack or his AIAF link post.)

I don't think I disagree with many of the claims in Jan's post, generally I think his high level points are correct.

He lists a lot of things as "reasons for optimism" that I wouldn't consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn't list the analogous reasons for pessimism (e.g. stuff that hasn't worked well yet).  Similarly I'm not sure conviction in language models is a good thing but it may depend on your priors.

One potential substantive disagreement with Jan's position is that I'm somewhat more scared of AI systems evaluating the consequences of each other's actions and therefore more interested in trying to evaluate proposed actions on paper (rather than needing to run them to see what happens). That is, I'm more interested in "process-based" supervision and decoupled evaluation, whereas my understanding is that Jan sees a larger role for systems doing things like carrying out experiments with evaluation of results in the same way that we'd evaluate employee's output.

(This is related to the difference between IDA and RRM that I mentioned above. I'm actually not sure about Jan's all-things-considered position, and I think this piece is a bit agnostic on this question. I'll return to this question below.)

The basic tension here is that if you evaluate proposed actions you easily lose competitiveness (since AI systems will learn things overseers don't know about the consequences of different possible actions) whereas if you evaluate outcomes then you are more likely to have an abrupt takeover where AI systems grab control of sensors / the reward channel / their own computers (since that will lead to the highest reward). A subtle related point is that if you have a big competitiveness gap from process-based feedback, then you may also be running an elevated risk from deceptive alignment (since it indicates that your model understands things about the world that you don't).

In practice I don't think either of those issues (competitiveness or takeover risk) is a huge issue right now. I think process-based feedback is pretty much competitive in most domains, but the gap could grow quickly as AI systems improve (depending on how well our techniques work). On the other side, I think that takeover risks will be small in the near future, and it is very plausible that you can get huge amounts of research out of AI systems before takeover is a significant risk. That said I do think eventually that risk will become large and so we will need to turn to something else: new breakthroughs, process-based feedback, or fortunate facts about generalization.

As I mentioned, I'm actually not sure what Jan's current take on this is, or exactly what view he is expressing in this piece. He says:

Another important open question is how much easier evaluation is if you can’t rely on feedback signals from the real world. For example, is evaluation of a piece of code easier than writing it, even if you’re not allowed to run it? If we’re worried that our AI systems are writing code that might contain trojans and sandbox-breaking code, then we can’t run it to “see what happens” before we’ve reviewed it carefully.

I'm not sure where he comes down on whether we should use feedback signals from the real world, and if so what kinds of precaution we should take to avoid takeover and how long we should expect them to hold up. I think both halves of this are just important open questions---will we need real world feedback to evaluate AI outcomes? In what cases will we be able to do so safely? If Jan is also just very unsure about both of these questions then we may be on the same page.

I generally hope that OpenAI can have strong evaluations of takeover risk (including: understanding their AI's capabilities, whether their AI may try to take over, and their own security against takeover attempts). If so, then questions about the safety of outcomes-based feedback can probably be settled empirically and the community can take an "all of the above" approach. In this case all of the above is particularly easy since everything is sitting on the same spectrum. A realistic system is likely to involve some messy combination of outcomes-based and process-based supervision, we'll just be adjusting dials in response to evidence about what works and what is risky.

goodness of HCH

What is the latest thinking/discussion about this? I tried to search LW/AF but haven't found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share?

How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relation to HCH? What are your guesses for how future historians would complete the following sentence? Despite human reasoning being apparently very unreliable, HCH was a good approximation target for AI because ...

instead relies on some claims about offense-defense between teams of weak agents and strong agents

I'm curious if you have an opinion on where the burden of proof lies when it comes to claims like these. I feel like in practice it's up to people like me to offer sufficiently convincing skeptical arguments if we want to stop AI labs from pursuing their plans (since we have little power to do anything else) but morally shouldn't the AI labs have much stronger theoretical foundations for their alignment approaches before e.g. trying to build a human-level alignment researcher in 4 years? (Because if the alignment approach doesn't work, we would either end up with an unaligned AGI or be very close to being able to build AGI but with no way to align it.)

Very nice! I'd say this seems like it's aimed at a difficulty level of 5 to 7 on my table,

I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I'd unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.

Does anyone have guesses or direct knowledge of:

  1. What are OpenAI's immediate plans? For example what are the current/next alignment-focused ML projects they have in their pipeline?
  2. What kind of results are they hoping for at the end of 4 years? Is it to actually "build a roughly human-level automated alignment researcher" or is that a longer term goal and the 4 year goal is to just to achieve some level of understanding of how to build and align such an AI?

I was informed by an OpenAI insider that the 4 year goal is actually “build a roughly human-level automated alignment researcher”.