I have recently encountered a number of people with misconceptions about OpenAI. Some common impressions are accurate, and others are not. This post is intended to provide clarification on some of these points, to help people know what to expect from the organization and to figure out how to engage with it. It is not intended as a full explanation or evaluation of OpenAI's strategy. 

The post has three sections:

  • Common accurate impressions
  • Common misconceptions
  • Personal opinions

The bolded claims in the first two sections are intended to be uncontroversial, i.e., most informed people would agree with how they are labeled (correct versus incorrect). I am less sure about how commonly believed they are. The bolded claims in the last section I think are probably true, but they are more open to interpretation and I expect others to disagree with them.

Note: I am an employee of OpenAI. Sam Altman (CEO of OpenAI) and Mira Murati (CTO of OpenAI) reviewed a draft of this post, and I am also grateful to Steven Adler, Steve Dowling, Benjamin Hilton, Shantanu Jain, Daniel Kokotajlo, Jan Leike, Ryan Lowe, Holly Mandel and Cullen O'Keefe for feedback. I chose to write this post and the views expressed in it are my own.

Common accurate impressions

Correct: OpenAI is trying to directly build safe AGI.

OpenAI's Charter states: "We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome." OpenAI leadership describes trying to directly build safe AGI as the best way to currently pursue OpenAI's mission, and have expressed concern about scenarios in which a bad actor is first to build AGI, and chooses to misuse it.

Correct: the majority of researchers at OpenAI are working on capabilities. 

Researchers on different teams often work together, but it is still reasonable to loosely categorize OpenAI's researchers (around half the organization) at the time of writing as approximately:

  • Capabilities research: 100
  • Alignment research: 30
  • Policy research: 15

Correct: the majority of OpenAI employees did not join with the primary motivation of reducing existential risk from AI specifically.

My strong impressions, which are not based on survey data, are as follows. Across the company as a whole, a minority of employees would cite reducing existential risk from AI as their top reason for joining. A significantly larger number would cite reducing risk of some kind, or other principles of beneficence put forward in the OpenAI Charter, as their top reason for joining. Among people who joined to work in a safety-focused role, a larger proportion of people would cite reducing existential risk from AI as a substantial motivation for joining, compared to the company as a whole. Some employees have become motivated by existential risk reduction since joining OpenAI.

Correct: most interpretability research at OpenAI stopped after the Anthropic split.

Chris Olah led interpretability research at OpenAI before becoming a cofounder of Anthropic. Although several members of Chris's former team still work at OpenAI, most of them are no longer working on interpretability.

Common misconceptions

Incorrect: OpenAI is not working on scalable alignment.

OpenAI has teams focused both on practical alignment (trying to make OpenAI's deployed models as aligned as possible) and on scalable alignment (researching methods for aligning models that are beyond human supervision, which could potentially scale to AGI). These teams work closely with one another. Its recently-released alignment research includes self-critiquing models (AF discussion), InstructGPTWebGPT (AF discussion) and book summarization (AF discussion). OpenAI's approach to alignment research is described here, and includes as a long-term goal an alignment MVP (AF discussion).

Incorrect: most people who were working on alignment at OpenAI left for Anthropic. 

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic. Edited to add: this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here.

Incorrect: OpenAI is a purely for-profit organization.

OpenAI has a hybrid structure in which the highest authority is the board of directors of a non-profit entity. The members of the board of directors are listed here. In legal paperwork signed by all investors, it is emphasized that: "The [OpenAI] Partnership exists to advance OpenAI Inc [the non-profit entity]'s mission of ensuring that safe artificial general intelligence is developed and benefits all of humanity. The General Partner [OpenAI Inc]'s duty to this mission and the principles advanced in the OpenAI Inc Charter take precedence over any obligation to generate a profit. The Partnership may never make a profit, and the General Partner is under no obligation to do so."

Incorrect: OpenAI is not aware of the risks of race dynamics.

OpenAI's Charter contains the following merge-and-assist clause: "We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”"

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

OpenAI has a Governance team (within Policy Research) that advises leadership and is focused on strategy for avoiding existential risk from AI. In multiple recent all-hands meetings, OpenAI leadership have emphasized to employees the need to scale up safety efforts over time, and encouraged employees to familiarize themselves with alignment ideas. OpenAI's Chief Scientist, Ilya Sutskever, recently pivoted to spending 50% of his time on safety.

Personal opinions

Opinion: OpenAI leadership cares about reducing existential risk from AI.

I think that OpenAI leadership are familiar and agree with the basic case for concern and appreciate the magnitude of what's at stake. Existential risk is an important factor, but not the only factor, in OpenAI leadership's decision making. OpenAI's alignment work is much more than just a token effort.

Opinion: capabilities researchers at OpenAI have varying attitudes to existential risk.

I think that capabilities researchers at OpenAI have a wide variety of views, including some with long timelines who are skeptical of attempts to mitigate risk now, and others who are supportive but may consider the question to be outside their area of expertise. Some capabilities researchers actively look for ways to help with alignment, or to learn more about it.

Opinion: disagreements about OpenAI's strategy are substantially empirical.

I think that some of the main reasons why people in the alignment community might disagree with OpenAI's strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

Opinion: I am personally extremely uncertain about strategy-related questions.

I do not spend most of my time thinking about strategy. If I were forced to choose between OpenAI speeding up or slowing down its work on capabilities, my guess is that I would end up choosing the latter, all else equal, but I am very unsure.

Opinion: OpenAI's actions have drawn a lot of attention to large language models.

I think that the release of GPT-3 and the OpenAI API led to significantly increased focus and somewhat of a competitive spirit around large language models. I consider there to be advantages and disadvantages to this. I don't think OpenAI predicted this in advance, and believe that it would have been challenging, but not impossible, to foresee this.

Opinion: OpenAI is deploying models in order to generate revenue, but also to learn about safety.

I think that OpenAI is trying to generate revenue through deployment in order to directly create value and in order to fund further research and development. At the same time, it also uses deployment as a way to learn in various ways, and about safety in particular.

Opinion: OpenAI's particular research directions are driven in large part by researchers.

I think that OpenAI leadership has control over staffing and resources that affects the organization's overall direction, but that particular research directions are largely delegated to researchers, because they have the most relevant context. OpenAI would not be able to do impactful alignment research without researchers who have a strong understanding of the field. If there were talented enough researchers who wanted to lead new alignment efforts at OpenAI, I would expect them to be enthusiastically welcomed by OpenAI leadership.

Opinion: OpenAI should be focusing more on alignment.

I think that OpenAI's alignment research in general, and its scalable alignment research in particular, has significantly higher average social returns than its capabilities research on the margin.

Opinion: OpenAI is a great place to work to reduce existential risk from AI.

I think that the Alignment, RL, Human Data, Policy Research, Security, Applied Safety, and Trust and Safety teams are all doing work that seems useful for reducing existential risk from AI.

New Comment
75 comments, sorted by Click to highlight new comments since: Today at 6:59 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic.

I think this is literally true, but at least as far as I know is not really conveying the underlying dynamics and so I expect readers to walk away with the wrong impression.

Again, I might be totally wrong here, but as far as I understand the underlying dynamics is that there was a substantial contingent of people who worked at OpenAI because they cared about safety but worked in a variety of different roles, including many engineering roles. That contingent had pretty strong disagreements with leadership about a mixture of safety and other operating priorities (but I think mostly safety). Dario in-particular had lead a lot of the capabilities research and was dissatisfied with how the organization was run.

Dario left and founded Anthropic, taking a substantial number of engineering and research talent with him (I don'... (read more)

1Jacob Hilton2y
Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".

People at OpenAI regularly say things like

And you say:

  • OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities

AFAICT, no-one from OpenAI has publicly explained why they believe that RLHF + amplification is supposed to be enough to safely train systems that can solve alignment for us. The blog post linked above says "we believe" four times, but does not take the time to explain why anyone believes these things.

Writing up this kind of reasoning is time-intensive, but I think it would be worth it: if you're right, then the value of information for the rest of the community is huge; if you're wrong, it's an opportunity to change your minds.

Opinion: disagreements about OpenAI's strategy are substantially empirical.

I think that some of the main reasons why people in the alignment community might disagree with OpenAI's strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

See, this is exactly the problem. Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late. That is the fundamental reason why alignment is harder than other scientific fields. Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, "getting what you measure" in slow takeoff, however you frame it the issue is the same: things look fine early on, and go wrong later.

And as far as I can tell, OpenAI as an org just totally ignores that whole class of issues/arguments, and charges ahead assuming that if they don't see a problem then there isn't a problem (and meanwhile does things which actively select for hiding problems, like e.g. RLHF).

To clarify, by "empirical" I meant "relating to differences in predictions" as opposed to "relating to differences in values" (perhaps "epistemic" would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.

[First of all, many thanks for writing the post; it seems both useful and the kind of thing that'll predictably attract criticism]

I'm not quite sure what you mean to imply here (please correct me if my impression is inaccurate - I'm describing how-it-looks-to-me, and I may well be wrong):

I would expect OpenAI leadership to put more weight on experimental evidence than you...

Specifically, John's model (and mine) has:
X = [Class of high-stakes problems on which we'll get experimental evidence before it's too late]
Y = [Class of high-stakes problems on which we'll get no experimental evidence before it's too late]

Unless we expect Y to be empty, when we're talking about Y-problems the weighting is irrelevant: we get no experimental evidence.

Weighting of evidence is an issue when dealing with a fixed problem.
It seems here as if it's being used to select the problem: we're going to focus on X-problems because we put a lot of weight on experimental evidence. (obviously silly, so I don't imagine anyone consciously thinks like this - but out-of-distribution intuitions may be at work)

What kind of evidence do you imagine would lead OpenAI leadership to change their minds/approach?
Do you / your-model-of-leadership believe that there exist Y-problems?

2Jacob Hilton2y
I don't think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a "problem". Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.
1Joe_Collman2y
To rephrase, it seems to me that in some sense all evidence is experimental. What changes is the degree of generalisation/abstraction required to apply it to a particular problem. Once we make the distinction between experimental and non-experimental evidence, then we allow for problems on which we only get the "non-experimental" kind - i.e. the kind requiring sufficient generalisation/abstraction that we'd no longer tend to think of it as experimental. So the question on Y-problems becomes something like: * Given some characterisation of [experimental evidence] (e.g. whatever you meant that OpenAI leadership would tend to put more weight on than John)... * ...do you believe there are high-stakes problems for which we'll get no decision-relevant [experimental evidence] before it's too late?

Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd still be hard to convert that into a solution for alignment. Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd be very hard to convert that into a solution for alignment.

If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn't be harder than the core problems of any other field of science/engineering. It wouldn't be unusually hard, by the standards of technical research.

Of course, "empirical evidence of power-seeking behavior" is a lot weaker than a magical box. With only that level of empirical evidence, most of the "no empirical feedback" problem would still be present. More on that next.

Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

The key "lack of empirical feedback" property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will ... (read more)

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.

6Richard Ngo2y
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work). I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved most of the alignment problem. But, out of the things John listed: I expect that we'll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.

I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a good understanding of newtonian mechanics and some aspects of relativity, but I don't know, if I could just launch a rocket every day without bad consequences, I am pretty sure I wouldn't really need a deep understanding of either of those, or would easily figure out the relevant bits as I kept experimenting.

The reason why rocket science relies so much on having solid theoretical models is because we have to get things right in only a few shots. I don't think you really needed any particularly good theory to build trains for example. Just a lot of attempts and tinkering.

At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.

Also, I think that "on our first try" thing isn't a great framing, because there are always precursors (e.g. we landed a man on the moon "on our first try" but also had plenty of tries at something kinda similar). Then the question is how similar, and how relevant, the precursors are - something where I expect our differing attitudes about the value of empiricism to be the key crux.

1David Scott Krueger2y
Well you could probably build a rocket that looks like it works, anyways.  Could you build one you would want to try to travel to the moon in?  (Are you imagining you get to fly in these rockets?  Or just launch and watch from ground?  I was imagining the 2nd...)

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

Why, then, would they continue to build the technology which causes that risk? Why do they consider it morally acceptable to build something which might well end life on Earth?

A common view is that the timelines to risky AI are largely driven by hardware progress and deep learning progress occurring outside of OpenAI. Many people (both at OpenAI and elsewhere) believe that questions of who builds AI and how are very important relative to acceleration of AI timelines. This is related to lower estimates of alignment risk, higher estimates of the importance of geopolitical conflict, and (perhaps most importantly of all) radically lower estimates for the amount of useful alignment progress that would occur this far in advance of AI if progress were to be slowed down. Below I'll also discuss two arguments that delaying AI progress would on net reduce alignment risk which I often encountered at OpenAI.

I think that OpenAI has had a meaningful effect on accelerating AI timelines and that this was a significant cost that the organization did not adequately consider (plenty of safety-focused folk pushed back on various accelerating decisions and this is ultimately related to many departures though not directly my own). I also think that OpenAI is significantly driven by the desire to do something impactful and to reap the short-term benefits of AI. In significant ... (read more)

Another fairly common argument and motivation at OpenAI in the early days was the risk of "hardware overhang," that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago.

Could you clarify this bit? It sounds like you're saying that OpenAI's capabilities work around 2017 was net-positive for reducing misalignment risk, even if the only positive we count is this effect. (Unless you think that there's substantial reason that acceleration is bad other than giving the AI safety community less time.) But then in the next paragraph you say that this argument was wrong (even before GPT-3 was released, which vaguely gestures at the "around 2017"-time). I don't see how those are compatible.

One positive consideration is: AI will be built at a time when it is more expensive (slowing later progress). One negative consideration is: there was less time for AI-safety-work-of-5-years-ago. I think that this particular positive consideration is larger than this particular negative consideration, even though other negative considerations are larger still (like less time for growth of AI safety community).

1Chris_Leong2y
  Agreed, this is one of the biggest considerations missed, in my opinion, by people who think accelerating progress was good. (TBH, if anyone was attempting to accelerate progress to reduce AI risk, I think that they were trying to be too clever by half; or just rationalisting).

Alignment research: 30

Could you share some breakdown for what these people work on? Does this include things like the 'anti-bias' prompt engineering?

It includes the people working on the kinds of projects I listed under the first misconception. It does not include people working on things like the mitigation you linked to. OpenAI distinguishes internally between research staff (who do ML and policy research) and applied staff (who work on commercial activities), and my numbers count only the former.

WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)

(Edit: My current guess for full-time equivalents who are doing safety work at OpenAI (e.g. if someone is doing 50% work that a researcher fully focused on capabilities would do and 50% on alignment work, then we count them as 0.5 full-time equivalents) is around 10, maybe a bit less, though I might be wrong here.)

I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).

2Noosphere892y
The real question for Habryka is why does he think that it's bad for WebGPT to be built in order to get truthful AI? Like, isn't solving that problem quite a significant thing already for alignment?

WebGPT is approximately "reinforcement learning on the internet".

There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think "reinforcement learning on the internet" is approximately the worst direction for modern AI to go in terms of immediate risks.

I don't think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions for AI to go. And my guess is the majority of the effect of this research will be to cause more people to pursue this direction in the future (Adept.AI seems to be pursuing a somewhat similar approach).

Edit: Jacob does talk about this a bit in a section I had forgotten about in the truthful LM post:

Another concern is that working on truthful LMs may lead to AI being "let out of the box" by encouraging research in which models interact with the external world agentically, in the manner of WebGPT.

I think this concern is worth taking seriously, but that the case for it is weak:

  • As AI capabilities improve, the level of access to the ext
... (read more)

The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying "well, seems like the cost is kinda high so we won't do it" seems like exactly the kind of attitude that I am worried will cause humanity to go extinct. 

  • When you say "good things to keep an AI safe" I think you are referring to a goal like "maximize capability while minimizing catastrophic alignment risk." But in my opinion "don't give your models access to the internet or anything equally risky" is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this "boxing benefit" (as claimed by the quote you are objecting to).
  • I assume the harms you are pointing to here are about setting expect
... (read more)

If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs.

Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I've seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researchers at other AGI companies like Deepmind, and probably ~100x more efficient than researchers at most AI research organizations (like Facebook AI).

WebGPT strikes me on the worse side of OpenAI capabilities research in terms of accelerating timelines (since I think it pushes us into a more dangerous paradigm that will become dangerous earlier, and because I expect it to be the kind of thing that could very drastically increase economical returns from AI). And then it also has the additional side-effect of pushing us into a paradigm of AIs that are much harder to align and so doing alignment work in that paradigm will be slower (as has I think a bunch of the RLHF work, though there I think there is a more reasonable case for a commensurate benefit there in terms of the technology also being useful for AI Alignment).

I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think "We have powerful AI systems but haven't deployed them to do stuff they are capable of" is a very short-term kind of situation and not particularly desirable besides.

I'm not sure what you are comparing RLHF or WebGPT to when you say "paradigm of AIs that are much harder to align." I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling but I think that's the wrong comparison point barring a degree of coordination that is much larger than what is needed to avoid scaling up models past dangerous thresholds, (ii) I think you are wrong about the dynamics of deceptive alignment under existing mitigation strategies and that scaling up generative modeling to the point where it is transformative is considerably more likely to lead to deceptive alignment than using RLHF (primarily via involving much more intelligent models).

I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don't understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it's an important disagreement.

I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval is exactly the kind of work that will likely make AIs more strategic and smarter, create deceptive alignment, and produce models that humans don't understand.

I do indeed think the WebGPT work is relevant to both increasing capabilities and increasing likelihood of deceptive alignment (as is most reinforcement learning that directly pushes on human approval, especially in a large action space with permanent side effect).

I don't think "your AI wants to kill you but it can't get out of the box so it helps you with alignment instead" is the mainline scenario. You should be building an AI that wouldn't stab you if your back was turned and it was holding a knife, and if you can't do that then you should not build the AI.

That's interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like "we'll just have AIs competing against each other and box them and make sure they don't have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment". Buck's post on "The prototypical catastrophic AI action is getting root access to its datacenter" also suggests to me that the "AI gets access to the internet" scenario is a thing that he is pretty concerned about.

More broadly, I remember that Carl Shulman said that he thinks that the reference class of "violent revolutions" is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes fro... (read more)

6Paul Christiano2y
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like "you are in a secure box and can't get out," they are mostly facts about all the other AI systems you are dealing with. That said, I think you are overestimating how representative these are of the "mainline" hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet---just over human evaluations of the quality of answers or browsing behavior).

But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can't access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic.

I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I do sure feel unhappy that their plan seems to be to be banking on this kind of terrifying situation, which is part of why I am so pessimistic about the likelihood of doom.

If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn't rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren't ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.

6Rohin Shah2y
... Who are you talking to? I'm having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don't know them that well). At DeepMind there's a small minority who think about boxing, but I think even they wouldn't think of this as a major aspect of their plan. I agree that they aren't aiming for a "much more comprehensive AI alignment solution" in the sense you probably mean it but saying "they rely on boxing" seems wildly off. My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.
3Oliver Habryka2y
Here is an example quote from the latest OpenAI blogpost on AI Alignment: This sounds super straightforwardly to me like the plan of "we are going to train non-agentic AIs that will help us with AI Alignment research, and will limit their ability to influence the world, by e.g. not giving them access to the internet". I don't know whether "boxing" is the exact right word here, but it's the strategy I was pointing to here.
3Rohin Shah2y
The immediately preceding paragraph is: I would have guessed the claim is "boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned", rather than "after training, the AI system might be trying to pursue its own goals, but we'll ensure it can't accomplish them via boxing". But I can see your interpretation as well.
2Oliver Habryka2y
Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access. I agree that "train a system with internet access, but then remove it, then hope that it's safe", doesn't really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it's an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.
3Rohin Shah2y
Oh you're making a claim directly about other people's approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree). I was suggesting that the plan was "train a system without Internet access, then add it at deployment time" (aka "box the AI system during training"). I wasn't at any point talking about WebGPT.
4Oliver Habryka2y
Huh, I definitely expect it to drive >0.1% of OpenAI's activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI's research staff, while probably substantially increasing OpenAI's ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to me as a baseline estimate, even if you don't think it's a particularly risky direction of research (given that it's consuming about 4-5% of OpenAI's research staff).

I think the direct risk of OpenAI's activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.

I agree that if we consider indirect risks broadly (including e.g. "this helps OpenAI succeed or raise money and OpenAI's success is dangerous") then I'd probably move back towards "what % of OpenAI's activities is it."

2Daniel Kokotajlo2y
Just to make sure I follow: You told them at the time that it was overdetermined that the risks weren't significant? And if you had instead told them that the risks were significant, they wouldn't have done it?

As in: there seem to have generally been informal discussions about how serious this risk was, and I participated in some of those discussions (though I don't remember which discussions were early on vs prior to paper release vs later). In those discussions I said that I thought the case for risk seemed very weak.

If the case for risk had been strong, I think there are a bunch of channels by which the project would have been less likely. Some involve me---I would have said so, and I would have discouraged rather than encouraged the project in general since I certainly was aware fo it. But most of the channels would have been through other people---those on the team who thought about it would have come to different conclusions, internal discussions on the team would have gone differently, etc. 

Obviously I have only indirect knowledge about decision-making at OpenAI so those are just guesses (hence "I believe that it likely wouldn't have happened"). I think the decision to train WebGPT would be unusually responsive to arguments that it is bad (e.g. via Jacob's involvement) and indeed I'm afraid that OpenAI is fairly likely to do risky things in other cases where there are quite good arguments against.

2David Scott Krueger2y
I don'd think the choice is between "smart and boxed" or "less smart and less boxed".  Intelligence (e.g. especially domain knowledge) is not 1-dimensional, boxing is largely a means of controlling what kind of knowledge the AI has.  We might prefer AI savants that are super smart about some task-relevant aspects of the world and ignorant about a lot of other strategically-relevant aspects of the world.  

like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons

This also seems like an odd statement - it seems reasonable to say "I think the net effect of InstructGPT is to boost capabilities" or even "If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT". But it feels like you're assuming some deep insight into the intention behind the people working on it, and making a much stronger statement than "I think OpenAI's alignment team is making bad prioritisation decisions".

Like, reading the author list of InstructGPT, there are obviously a bunch of people on there who care a bunch about safety including I believe the first two authors - it seems pretty uncharitable and hostile to say that they were motivated by a desire to boost capabilities, even if you think that was a net result of their work.

(Note: My personal take is to be somewhat confused, but to speculate that InstructGPT was mildly good for the world? And that a lot of the goodness comes from field building of getting more people investing in good quality RLHF.)

Yeah, I agree that I am doing reasoning on people's motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people's motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.

I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by safety concerns, since the case for it strikes me as so weak, and the risk seeming as somewhat obviously high, so I am still trying to process that and will probably make some kind of underlying update.

I do think overall I've had much better success at predicting the actions of the vast majority of people at OpenAI, including a lot of safety work, by thinking of them by being motivated by doing cool capability things, sometimes with a thin safety veneer on top, instead of being motivated primarily by safety. For example, I currently think that the release strategy for the GPT models of OpenAI is much better explained by OpenAI wanti... (read more)

2Neel Nanda2y
That seems weirdly strong. Why do you think that?
3Jacob Hilton2y
For people viewing on the Alignment Forum, there is a separate thread on this question here. (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)
2Oliver Habryka2y
I moved that thread over the AIAF as well!
1Larks2y
Thanks!

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.

Thanks again for writing this.
A few thoughts:

I think that the release of GPT-3 and the OpenAI API led to significantly increased focus and somewhat of a competitive spirit around large language models... I don't think OpenAI predicted this in advance, and believe that it would have been challenging, but not impossible, to foresee this.

Do you believe any general lessons have been learned from this? Specifically, it seems a highly negative pattern if [we can't predict concretely how this is likely to go badly] translates to [we don't see any reason not to go ahead].

I note that there's an asymmetry here: [states of the world we like] are a small target. To the extent that we can't predict the impact of a large-scale change, we should bet on negative impact.

 

OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI...

Questions:

  1. If we're in a scenario with [slow takeoff], [alignment is fairly easy], and [empirical, capabilities-reliant approaches work well], wouldn't we expect alignment to
... (read more)

I also appreciated reading this.

I would be very curious to see your / OpenAI's responses to Eliezer's Dimensions of Operational Adequacy in AGI Projects post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven't implemented the recommendations, what's stopping you?

Thanks for writing this! I agree with most of the claims you consider to be objective, and appreciate you writing this up so clearly.

Correct: OpenAI is trying to directly build safe AGI.

OpenAI's Charter states: "We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome." OpenAI leadership describes trying to directly build safe AGI as the best way to currently pursue OpenAI's mission, and have expressed concern about scenarios in which a bad actor is first to build AGI, and chooses to misuse it.

You seem confused about the difference between "paying lip service to X" and "actually trying to do X".

To be clear, this in itself isn't evidence against the claim that OpenAI is trying to directly build safe AI. But it's not much evidence for it, either.

Correct: the majority of researchers at OpenAI are working on capabilities. 

Researchers on different teams often work together, but it is still reasonable to loosely categorize OpenAI's researchers (around half the organization) at the time of writing as approximately:

  • Capabilities research: 100
  • Alignment research: 30
  • Policy research: 15

I'd guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research... (read more)

Calling work you disagree with "lip service" seems wrong and unhelpful.

There are plenty of ML researchers who think that they are doing real work on alignment and that your research is useless. They could choose to describe the situation by saying that you aren't actually doing alignment research. But I think it would be more accurate and helpful if they were to instead say that you are both working on alignment but have big disagreements about what kind of research is likely to be useful.

(To be clear, plenty of folks also think that my work is useless.)

I think Jacob (OP) said "OpenAI is trying to directly build safe AGI." and cited the charter and other statements as evidence of this claim. Then John replied that the charter and other statements are "not much evidence" either for or against this claim, because talk is cheap. I think that's a reasonable point.

Separately, maybe John in fact believes that the charter and other statements are insincere lip service. If so, I would agree with you (Paul) that John's belief is probably incorrect, based on my very limited knowledge. [Where I disagree with OpenAI, I presume that top leadership is acting sincerely to make a good future with safe AGI, but that they have mistaken beliefs about the hardness of alignment and other topics.]

4Paul Christiano2y
I was replying to:

I definitely do not use "lip service" as a generic term for alignment research I disagree with. I think you-two-years-ago were on a wrong track with HCH, but you were clearly aiming to solve alignment. Same with lots of other researchers today - I disagree with the approaches of most people in the field, but I do not accuse them not actually doing alignment research.

No, this accusation is specifically for things RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us), and to things like "AI ethics" work (which are very obviously not even attempting to solve the extinction problem). In general, it has to be not even trying to solve a problem which kills us in order for me to make that sort of accusation.

If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service. I'd think they were pretty stupid about their strategy, but hey, it's alignment, lots of u... (read more)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that's being made here is failing to recognize that reality doesn't grade on a curve when it comes to understanding the world - your arguments can be false even if nobody has refuted them. That's particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).

Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that's fine, this might be necessary, and so it's good to have some people pushing in this direction, but it seems like a bunch of people around here don't just ign... (read more)

Comments on parts of this other than the ITT thing (response to the ITT part is here)...

(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)

I don't usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn't relying on that particular abstraction at all.

Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that's fine, this might be necessary, and so it's good to have some people pushing in this direction, but it seems like a bunch of people around here don't just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.

I think your model here completely fails to predict Descartes, Laplace, Von... (read more)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.

I don't want to pour a ton of effort into this, but here's my 5-paragraph ITT attempt.

"As an analogy for alignment, consider processor manufacturing. We didn't get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isolate and solve them all without iteration. We can't get many useful bits out of empirical feedback if the result is always failure, and always for a long list of reasons.

And of course, if you know anything about modern fabs, you know there'd have been no hope whatsoever of identifying all the key problems in advance just based on theory. (Side note: I remember a good post or thread from the past year on crazy shit fabs need to do, but can't find it; anyone remember that and have a link?)

The way we actually did it was to start with gigantic millimeter-size features, which were relatively easy to manufacture. And then we scaled down s... (read more)

6Richard Ngo6mo
Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there's probably some additional argument that people would make about why this isn't just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)
3Robert Kirk2y
I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with "there'd have been no hope whatsoever of identifying all the key problems in advance just based on theory"). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we'll face). However, they probably don't believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it's going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical). Otherwise I do think your ITT does seem reasonable to me, although I don't think I'd put myself in the class of people you're trying to ITT, so that's not much evidence.
2Oliver Habryka2y
I am confused. How does RLHF help with outer alignment? Isn't optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces) I don't think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don't understand where the inner/outer alignment distinction comes from in this context)

RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.

The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad" feels more realistic, but also lacks the intuitive force of the smiley face example - it's much less clear in this example why generalization will go badly, given the breadth of the data collected.

I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.

Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don't buy that RLHF conveys more about human preferences in any meaningful way.

FWIW, I personally know some of the people involved pretty well since ~2015, and I think you are wrong about their motivations. 

2johnswentworth2y
That is plausible; I have made my position here very easy to falsify if I'm wrong.

How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you'd say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul's comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?

Jacob's comment does not count, since it's not addressing the "actually consider whether the project will net decrease chance of extinction" or the "could the answer have plausibly been 'no' and then the project would not have happened" part.

Paul's comment does address both of those, especially this part at the end:

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn't have happened.

That does indeed falsify my position, and I have updated the top-level comment accordingly. Thankyou for the information.

In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".

I share your opinion of RLHF work but I'm not sure I share your opinion of its consequences. For situations where people don't believe arguments that RLHF is fundamentally flawed because they're too focused on empirical evidence over arguments, the generation of empirical evidence that RLHF is flawed seems pretty useful for convincing them! 

The Partnership may never make a profit

I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?

[This comment is no longer endorsed by its author]Reply
1Ofer2y
Sorry, that text does appear in the linked page (in an image).