A challenge for AGI organizations, and a challenge for readers

Eliezer Yudkowsky

Thanks for writing this! I'd be very excited to see more critiques of our approach and it's been great reading the comments so far! Thanks to everyone who took the time to write down their thoughts! :)

I've also written up a more detailed post on why I'm optimistic about our approach. I don't expect this to be persuasive to most people here, but it should give a little bit more context and additional surface area to critique what we're doing.

[-]johnswentworth3y174

My own responses to OpenAI's plan:

Worlds Where Iterative Design Fails, for the RLHF part and also a lot of the general mindset
Rant on Problem Factorization, for the debate etc part
Godzilla Strategies, for the "use AI to aid AI alignment" part

These are obviously not intended to be a comprehensive catalogue of the problems with OpenAI's plan, but I think they cover the most egregious issues.

[-]Jozdien3y30

I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.

I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.

I don't think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI's approach because it views these as solutions in themselves, instead of as simplified problems.

[-]Daniel Kokotajlo3y*1521

I'm happy to see OpenAI and OpenAI Alignment Team get recognition/credit for having a plan and making it public. Well deserved I'd say. (ETA: To be clear, like the OP I don't currently expect the plan to work as stated; I expect us to need to pivot eventually & hope a better plan comes along before then!)

[-]Algon3y94

What's MIRI's current plan? I can't actually remember, though I do know you've pivoted away from your strategy for Agent Foundations. But that wasn't the only agenda you were working on, right?

[-]Rob Bensinger3y1611

The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:

In the context of AI alignment, the "Principle of Minimality" or "Principle of Least Everything" says that when we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it. Similarly, inside the AGI itself, if a class of thought seems dangerous but necessary to execute sometimes, we want to execute the fewest possible instances of that class of thought.
E.g., if we think it's a dangerous kind of event for the AGI to ask "How can I achieve this end using strategies from across every possible domain?" then we might want a design where most routine operations only search for strategies within a particular domain, and events where the AI searches across all known domains are rarer and visible to the programmers. Processing a goal that can recruit subgoals across every domain would be a dangerous event, albeit a necessary one, and therefore we want to do less of it within the AI (and require positive permission for all such cases and then require operators to validate the results before proceeding).
Ideas that inherit from this principle include the general notion of Task-directed AGI, taskishness, and mild optimization.

Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.

MIRI itself isn't pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is "make progress on figuring out AI alignment". So I can separately answer the question "what's MIRI's organizational plan for solving alignment?"

My answer to that question is: we don't currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.

If an alignment idea strikes us as having even a tiny scrap of hope, and isn't already funding-saturated, then we're making sure it gets funded. We don't care whether that happens at MIRI versus elsewhere — we're just seeking to maximize the amount of good work that's happening in the world (insofar as money can help with that), and trying to bring about the existence of a research ecosystem that contains a wide variety of different moonshots and speculative ideas that are targeted at the core difficulties of alignment (described in the AGI Ruin and sharp left turn write-ups).
If an idea seems to have a significant amount of hope, and not just a tiny scrap — either at a glance, or after being worked on for a while by others and bearing surprisingly promising fruit — then I expect that MIRI will make that our new organizational focus, go all-in, and pour everything we have into helping with it as much we can. (E.g., we went all-in on our 2017-2020 research directions, before concluding in late 2020 that these were progressing too slowly to still have significant hope, though they might still meet the "tiny scrap of hope" bar.)

None of the research directions we're aware of currently meet our "significant amount of hope" bar, but several things meet the "tiny scrap of hope" bar, so we're continuing to keep an eye out and support others' work, while not going all-in on any one approach.

Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI's research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org's attention on those directions. We just think they're worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.

Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the "Agent Foundations" problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees "Embedded Agency" less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)

(Caveat: I had input from some other MIRI staff in writing the above, but I'm speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)

[-]Raemon3y157

The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period.

FYI, I think there's a huge difference between "I think humanity needs to aim for a pivotal act" and "I recommend to groups pushing the capabilities frontier forward to aim for pivotal act". I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have generally demonstrated pretty bad judgment by, um, being capabilities researchers.

[-]VojtaKovarik3y85

My ~2-hour reaction to the challenge:^[1]

(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their "alignment plan" make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible.^[2] For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI's leadership.

(III)^[3] My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI's alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI's capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don't have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)

(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.

Assumptions implied by OpenAI's plans, with my reactions:

(V) Easy alignment / warning shots for misaligned AGI:
"Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: [...]" My biggest objection with the whole plan is already regarding the second sentence of the post: relying on a trial-and-error approach. I assume OpenAI believes either: (1) The proposed alignment plan is so unlikely to fail that we don't need to worry about the worlds where it does fail. Or (2) In the worlds where the plan fails, we will have a clear warning shots. (I personally believe this is suicidal. I don't expect people to automatically agree, but with everything at stake, they should be open to signs of being wrong.)
(VI) "AGI alignment" isn't "AGI complete":
This is already acknowledged in the post: "It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems." However, it isn't exactly clear what precise assumptions are being made here. Moreover, there is no vision for how to monitor whether the assumptions hold or not. Do we keep iterating on AI capabilities, each time hoping that "this time, it will be powerful enough to help with alignment"?
(VII) Related assumption: No lethal discontinuities:
The whole post suggest the workflow "new version V of AI-capabilities ==> capabilities ppl start working on V+1 & (simultaneously) alignment people use V for alignment research ==> alignment(V) gets used on V, or informs V+1". (Like with GPT-3.) This requires the assumption that either you can hold off research on V+1 until alignment(V) is ready, or the assumption that deployed V will not kill you before you solve alignment(V). Which of the assumptions is being made here? I currently don't see evidence for "ability to hold off on capabilities research". What are the organizational procedures allowing this?
(VIII) [Point intentionally removed. I endorse the sentiment that treating these types of lists as complete is suicidal. In line with this, I initially wrote 7 points and then randomly deleted one. This is, obviously, in addition to all the points that I failed to come up with at all, or that I didn't mention because I didn't have enough original thoughts on them and it would seem too much like parroting MIRI. And in addition to the points that nobody came up with yet...]
(IX) Regarding "outer alignment alignment": Other people solving the remaining issues. Or having warning shots & the ability to hold off capabilities research until OpenAI solves them:
It is good to at least acknowledge that there might be other parts of AI alignment than just "figuring out learning from human feedback (& human-feedback augmentation)". However, even if this ingredient is necessary, the plan assumes that if it turns out not-sufficient, you will (a) notice and (b) have enough time to fix the issue.
(X) Ability to differentially use capabilities progress towards alignment progress:
The plan involves training AI assistants to help with alignment research. This seems to assume that either (i) the AI assistants will only be able to help with alignment research, or (ii) they will be general, but OpenAI can keep their use restricted to alignment research only, or (iii) they will be general and generally used, but somehow we will have enough time to do the alignment research anyway. Personally, I think all three of these assumptions are false --- (i) because it seems unlikely they won't also be usable on capabilities research, (ii) based on track record so far, and (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
(XI) Creating an aligned AI is sufficient for getting AI to go well:
The plan doesn't say anything about what to do with the hypothetical aligned AGI. Is the assumption that OpenAI can just release the seems-safe-so-far AGI through their API, $1 for 10,000 tokens, and we will all live happily ever after? Or is the plan to, uhm, offer it to all governments of the world for assistance in decision-making? Or something else inside the Overton window? If so, what exactly, and what is the theory of change for it? I think there could be many moral & responsible plans outside of the Overton window, just because public discource these days tends to be tricky. Having a specific strategy like that seems fine and reasonable. But I am afraid there is simultaneously (a) the desire to stick to the Overton window strategies and (b) no theory of change for how this prevents misaligned AGI by other actors, or other failure modes, (c) no "explicit assumptions & detection system & course-correction-procedure" for "nothing will go wrong if we just do (b)".

General complaint: The plan is not a plan at all! It's just a meta-plan.

(XII) Ultimately, I would paraphrase the plan-as-stated as: "We don't know how to solve alignment. It seems hard. Let's first build an AI to make us smarter, and then try again." I think OpenAI should clarify whether this is literally true, or whether there is some idea for how the object-level AI alignment plan looks like --- and if so, what is it.
(XIII) For example, the post mentions that "robustness and interpretability research [is important for the plan]". However, this is not at all apparent from the plan. (This is acknowledged in the post, but that doesn't make it any less of an issue!) This means that the plan is not detailed enough.
As an analogy, suppose you have a mathematical theorem that makes an assumption X. And then you look at the proof, and you can't see the step that would fail if X was untrue. This doesn't say anything good about your proof.

^{^}
Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."
As far as I know, I came up with points (I), (III), and (XII) myself and I don't remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI's publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)
^{^}
(II) For example, consider the following claim: "We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants." My reaction: Yes, technically speaking this is true. But likewise --- please excuse the jarring analogy --- the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn't it be even better if you personally didn't plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).
^{^}
Apologies for the inconsistent numbering. I had to give footnote [2] number (II) to get to the nice round total of 13 points :-).

[-]Aaron_Scher3y10

(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.

Either I misunderstand this or it seems incorrect.

It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.

The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest lead are safety conscious, I expect the ratio of alignment : capabilities research they produce to be much higher (compared to now) right before AGI. See here.

[-]VojtaKovarik3y10

> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.

Either I misunderstand this or it seems incorrect.

Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:

To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assistants able to speed up alignment vs capability" and (proliferation prevention) "how much can OpenAI prevent them from proliferating to capabilities research".^[1] And then the corresponding more realistic version of the claims would be that:

either (i') AI assistants will fundamentally be able to speed up alignment much more than capabilities
or (ii') the potential speedup ratios will be comparable, but OpenAI will be able to significantly restrict the proliferation of AI assistants for capabilities research
or (iii') both the potential speedup ratios and adoption rates of AI assistants will be comparable for capabilities research will be, but somehow we will have enough time to solve alignment anyway.

Comments:

Regarding (iii'): It seems that in the worlds where (iii') holds, you could just as well solve alignment without developing AI assistants.
Regarding (i'): Personally I don't buy this assumption. But you could argue for it on the grounds that perhaps alignment is just impossible to solve for unassisted humans. (Otherwise arguing for (i') seems rather hard to me.)
Regarding (ii'): As before, this seems implausible based on the track record :-).

^{^}
This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.

[-]Aaron_Scher3y10

Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan.

4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.

I don't know how to link to the specific comment, but here somewhere. Also:

We can focus on tasks differentially useful to alignment research

Your pessimism about iii still seems a bit off to me. I agree that if you were coordinating well between all the actors than yeah you could just hold off on AI assistants. But the actual decision the OpenAI alignment team is facing could be more like "use LLMs to help with alignment research or get left behind when ML research gets automated". If facing such choices I might produce a plan like theirs, but notably I would be much more pessimistic about it. When the universe limits you to one option, you shouldn't expect it to be particularly good. The option "everybody agrees to not build AI assistants and we can do alignment research first" is maybe not on the table, or at least it probably doesn't feel like it is to the alignment team at OpenAI.

[-]VojtaKovarik3y10

Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.

But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)

[-]VojtaKovarik3y10

(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn't be viewed as negative judgement on the people involved :-).)

[-]Steven Byrnes3y50

See my comment on Jan’s new post.

[-]Donald Hobson3y30

I have a response here.

https://www.lesswrong.com/posts/3oNZA9wTrFJRH6Sau/my-thoughts-on-openai-s-alignment-plan

[-]Alex Flint3y20

Here is a critique of OpenAI's plan

[-]DaemonicSigil3y20

On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:

Practical considerations: AI systems currently tend to require lots of examples and it's expensive to get these if they all have to be provided by a human.
Some actions look good to a casual human observer, but are actually bad on closer inspection. The AI would be rewarded for finding and taking such actions.
If you're training a neural network, then there are generically going to be lots of adversarial examples for that network. As the AI gets more and more powerful, we'd expect it to be able to generate more and more situations where its learned value function gives a high reward but a human would give a low reward. So it seems like we end up playing a game of adversarial example whack-a-mole for a long time, where we're just patching hole after hole in this million-dimensional bucket with thousands of holes. Probably the AI manages to kill us before that process converges.
To make the above worse, there's this idea of a sharp left turn, where a sufficiently intelligent AI can think of very weird plans that go far outside of the distribution of scenarios that it was trained on. We expect generalization to get worse in this regime, and we also expect an increased frequency of adversarial examples. (What would help a lot here is designing the AI to have an interpretable planning system, where we could run these plans forward and negatively reinforce the bad ones (and maybe all the weird ones, because of corrigibility reasons, though we'd have to be careful about how that's formulated because we don't want the AI trying to kill us because it thinks we'd produce a weird future).)
Once the AI is modelling reality in detail, its reward function is going to focus on how the rewards are actually being piped to the AI, rather than the human evaluator's reaction, let alone of some underlying notion of goodness. If the human evaluators just press a button to reward the AI for doing a good thing, the AI will want to take control of that button and stick a brick on top of it.

On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I'm pretty skeptical that it provides much value:

The AI can try and fool the critic just like it would fool humans. It doesn't even need a realistic world model for this, since using the critic to inform the training labels leaks information about the critic to the AI.
It's therefore very important that the critic model generates all the strong and relevant criticisms of a particular AI output. Otherwise the AI could just route around the critic.
On some kinds of task, you'll have an objective source of truth you can train your model on. The value of an objective source of truth is that we can use it to generate a list of all the criticisms the model should have made. This is important because we can update the weights of the critic model based on any criticisms it failed to make. On other kinds of task, which are the ones we're primarily interested in, it will be very hard or impossible to get the ground truth list of criticisms. So we won't be able to update the weights of the model that way when training. So in some sense, we're trying to generalize this idea of "a strong a relevant criticism" between these different tasks of differing levels of difficulty.
This requirement of generating all criticisms seems very similar to the task of getting a generative model to cover all modes. I guess we've pretty much licked mode collapse by now, but "don't collapse everything down to a single mode" and "make sure you've got good coverage of every single mode in existence" are different problems, and I think the second one is much harder.

On using AI systems, in particular large language models, to advance alignment research: This is not going to work.

LLMs are super impressive at generating text that is locally coherent for a much broader definition of "local" than was previously possible. They are also really impressive as a compressed version of humanity's knowledge. They're still known to be bad at math, at sticking to a coherent idea and at long chains of reasoning in general. These things all seem important for advancing AI alignment research. I don't see how the current models could have much to offer here. If the thing is advancing alignment research by writing out text that contains valuable new alignment insights, then it's already pretty much a human-level intelligence. We talk about AlphaTensor doing math research, but even AlphaTensor didn't have to type up the paper at the end!
What could happen is that the model writes out a bunch of alignment-themed babble, and that inspires a human researcher into having an idea, but I don't think that provides much acceleration. People also get inspired while going on a walk or taking a shower.
Maybe something that would work a bit better is to try training a reinforcement-learning agent that lives in a world where it has to solve the alignment problem in order to achieve its goals. Eg. in the simulated world, your learner is embodied in a big robot, and it there's a door in the environment it can't fit through, but it can program a little robot to go through the door and perform some tasks for it. And there's enough hidden information and complexity behind the door that the little robot needs to have some built-in reasoning capability. There's a lot of challenges here, though. Like how do you come up with a programming environment that's simple enough that the AI can figure out how to use it, while still being complex enough that the little robot can do some non-trivial reasoning, and that the AI has a chance of discovering a new alignment technique? Could be it's not possible at all until the AI is quite close to human-level.

^{^}

We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!

Nate’s personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn’t agreed on a plan and there’s a lot of disagreement about what the best approach is”.

In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and document core strategic disagreements now rather than later.

^{^}

Nate adds: “This is a chance to show that you totally would have seen the issues yourselves, and thereby deprive MIRI folk of the annoying ‘y'all'd be dead if not for MIRI folk constantly pointing out additional flaws in your plans’ card!”

^{^}

Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

78

A challenge for AGI organizations, and a challenge for readers

78