I work on AI alignment, by which I mean the technical problem of building AI systems that are trying to do what their designer wants them to do.

There are many different reasons that someone could care about this technical problem.

To me the single most important reason is that without AI alignment, AI systems are reasonably likely to cause an irreversible catastrophe like human extinction. I think most people can agree that this would be bad, though there’s a lot of reasonable debate about whether it’s likely. I believe the total risk is around 10–20%, which is high enough to obsess over.

Existing AI systems aren’t yet able to take over the world, but they are misaligned in the sense that they will often do things their designers didn’t want. For example:

  • The recently released ChatGPT often makes up facts, and if challenged on a made-up claim it will often double down and justify itself rather than admitting error or uncertainty (e.g. see herehere).
  • AI systems will often say offensive things or help users break the law when the company that designed them would prefer otherwise.

We can develop and apply alignment techniques to these existing systems. This can help motivate and ground empirical research on alignment, which may end up helping avoid higher-stakes failures like an AI takeover. I am particularly interested in training AI systems to be honest, which is likely to become more difficult and important as AI systems become smart enough that we can’t verify their claims about the world.

While it’s nice to have empirical testbeds for alignment research, I worry that companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself. If such systems are held up as key successes of alignment, then people who are frustrated with them may end up associating the whole problem of alignment with “making AI systems inoffensive.”

If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI. In my personal capacity, I have views on what uses of AI are more or less beneficial and what regulations make more or less sense, but in my capacity as an alignment researcher I don’t consider myself to be in the business of pushing for or against any of those decisions.

There is one decision I do strongly want to push for: AI developers should not develop and deploy systems with a significant risk of killing everyone. I will advocate for them not to do that, and I will try to help build public consensus that they shouldn’t do that, and ultimately I will try to help states intervene responsibly to reduce that risk if necessary. It could be very bad if efforts to prevent AI from killing everyone were undermined by a vague public conflation between AI alignment and corporate policies.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 8:00 AM

To be clear, I don't envy the position of anyone who is trying to deploy AI systems and am not claiming anyone is making mistakes. I think they face a bunch of tricky decisions about how a model should behave, and those decisions are going to be subject to an incredible degree of scrutiny because they are relatively transparent (since anyone can run the model a bunch of times to characterize its behavior).

I'm just saying that how you feel about AI alignment shouldn't be too closely tied up with how you end up feeling about those decisions. There are many applications of alignment like "not doubling down on lies" and "not murdering everyone" which should be extremely uncontroversial, and in general I think people ought to agree that it is better if customers and developers and regulators can choose the properties of AI systems rather than them being determined by technical contingencies of how AI is trained.

If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI.

Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash. The kind of backlash you're worried about seems likely only if, due to economies of scale, very few (competitive) AIs are built by large corporations, and they're all too conservative and inoffensive for many users' tastes. But in that scenario, AI could lead to an unprecedented ability to concentrate power (in the hands of AI developers or governments), which seems to be a reasonable concern for people to have.

It also does not seem totally unreasonable to direct some of that concern towards "AI alignment" (as opposed to only corporate policies or government regulators, as you seem to suggest), defined by "technical problem of building AI systems that are trying to do what their designer wants them to do". A steelman of such a "backlash" could be:

  1. Why work on this kind of alignment as opposed to another form that does not or is less likely to cause concentration of power in a few humans, for example AI that directly tries to satisfy humanity's overall values?
  2. According to some empirical and/or ethical views, such concentration of power could be worse than extinction, so maybe such alignment work is bad even if there is no viable alternative.

Not that I would necessarily agree with such a "backlash". I think I personally would be pretty conflicted (in the scenario where it looks like AI will cause major concentration of power) due to uncertainty about the relevant empirical and ethical views.

Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash.

I don't really think that's the case. 

Suppose that I have different taste from most people, and consider the interior of most houses ugly. I can be unhappy about the situation even if I ultimately end up in a house I don't think is ugly. I'm unhappy that I had to use multiple bits of selection pressure just to avoid ugly interiors, and that I spend time in other people's ugly houses, and so on.

In practice I think it's even worse than that; people get politically worked up about things that don't affect their lives at all through a variety of channels.

I do agree that backlash to X will be bigger if all AIs do X than if some AIs do X.

But in that scenario, AI could lead to an unprecedented ability to concentrate power (in the hands of AI developers or governments), which seems to be a reasonable concern for people to have.

I don't think this scenario is really relevant to the most common concerns about concentration of power. I think the most important reason to be scared of concentration of power is:

  • Historically you need a lot of human labor to get things done.
  • With AI the value of human labor may fall radically.
  • So capitalists may get all the profit, and it may be possible to run an oppressive state without a bunch of humans.
  • This may greatly increase economic inequality or make it much more possible to have robust oppressive regimes.

But all of those arguments are unrelated to the number of AI developers.

Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don't see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).

A steelman of such a "backlash" could be:

  1. Why work on this kind of alignment as opposed to another form that does not or is less likely to cause concentration of power in a few humans, for example AI that directly tries to satisfy humanity's overall values?

I don't think it's really plausible to have a technical situation where AI can be used to pursue "humanity's overall values" but cannot be used to pursue the values of a subset of humanity.

(I also tend to think that technocratic solutions to empower humanity via the design of AI are worse than solutions that empower people in more legible ways, either by having their AI agents participate in legible institutions or by having AI systems themselves act as agents of legible institutions. I have some similar concerns to those raised by Glen Weyl here though I disagree on many particulars, and think we should generally focus efforts in this space on mechanisms that don't predictably shift power to people who make detailed technical decisions about the design of AI that aren't legible to most people.)

2. According to some empirical and/or ethical views, such concentration of power could be worse than extinction, so maybe such alignment work is bad even if there is no viable alternative.

If someone's position is "alignment might prevent total human disempowerment, but it's better for humans to all be disempowered than for some humans to retain power" then I think they should make that case directly. I don't have personally have that much sympathy for that position, don't think it would play well with the public, and don't think it's closely related to the kind of backlash I'm imagining in the OP.

Stepping back, the version of this I can most understand is: some people might really dislike some effects of AI, and might justifiably push back on all research that helps facilitate that including research that reduces risks from AI (since that research makes the development of AI more appealing). But for the most part I think that energy can and should be directed to directly blocking problematic applications of AI, or AI development altogether, rather than measures that would reduce the risk of AI.

Another related concern might be that AI will "by default" have some kind of truth-oriented disposition that is above human meddling and alignment is mostly just a tool to move from that default (empowering AI developers). But in practice I think both that the default disposition isn't so good, and also that AI developers have other crappier ways to change AI behavior (which are Pareto dominated) and so in practice this is pretty similar to the previous point.

Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don’t see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).

Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right?

Ok, perhaps that doesn't happen because forming cartels is illegal, or because very high prices might attract new entrants, but AI developers could implicitly or explicitly collude with each other in ways besides price, such as indoctrinating their AIs with the same ideology, which governments do not forbid and may even encourage. So you could have a situation where AI developers don't have huge economic power, but do have huge, unprecedented cultural power (similar today's academia, traditional media, and social media companies, except way more concentrated/powerful).

Compare this situation with a counterfactual one in which instead of depending on huge training runs, AIs were manually programmed and progress depended on slow accumulation of algorithmic insights over many decades, and as result there are thousands of AI developers tinkering with their own designs and not far apart in the capabilities of the AIs that they offer. In this world, it would be much less likely for any given customer to not be able to find a competitive AI that shares (or is willing to support) their political or cultural outlook.

(I also see realistic possibilities in which AI developers do naturally have very high margins, and way more power (of all forms) than other actors in the supply chain. Would be interested in discussing this further offline.)

I don’t think it’s really plausible to have a technical situation where AI can be used to pursue “humanity’s overall values” but cannot be used to pursue the values of a subset of humanity.

It seems plausible to me that the values of many subsets of humanity aren't even well defined. For example perhaps sustained moral/philosophical progress requires a sufficiently large and diverse population to be in contact with each other and at roughly equal power levels, and smaller subsets (if isolated or given absolute power over others) become stuck in dead-ends or go insane and never manage to reach moral/philosophical maturity.

So an alignment solution based on something like CEV might just not do anything for smaller groups (assuming it had a reliable of way of detecting such deliberation failures and performing a fail-safe).

Another possibility here is that if there was a technical solution for making an AI pursue humanity's overall values, it might become politically infeasible to use AI for some other purpose.

Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins?

Yes.

What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right?

A monopoly on computers or electricity could also take big profits in this scenario. I think the big things are always that it's illegal and that high prices drive new entrants.

but AI developers could implicitly or explicitly collude with each other in ways besides price, such as indoctrinating their AIs with the same ideology, which governments do not forbid and may even encourage

I think this would also be illegal if justified by the AI company's preferences rather than customer preferences, and it would at least make them a salient political target for people who disagree. It might be OK if they were competing to attract employees/investors/institutional customers and in practice I think it would be most likely happen as a move by the dominant faction in political/cultural conflict in a broader society, and this would be a consideration raising the importance of AI researchers and potentially capitalists in that conflict.

I agree if you are someone who stands to lose from that conflict then you may be annoyed by some near-term applications of alignment, but I still think (i) alignment is distinct from those applications even if it facilitates them, (ii) if you don't like how AI empowers your political opponents then I strongly think you should push back on AI development itself rather than hoping that no one can control AI.

I strongly agree with the message in this post, but think the title is misleading. When I read it, it seemed to imply that alignment is distinct from near-term alignment concerns, while after having read it, it's specifically about how AI is used in the near-term. A title like "AI Alignment is distinct from how it is used in the near-term" would feel better by me.

I'm concerned about this, because I think the long-term vs near-term safety distinctions are somewhat overrated, and really wish these communities would collaborate more and focus more on the common ground! But the distinction is a common view-point, and what this title pattern matched to.

(Partially inspired by Stephen Casper's post)

"AI developers should not develop and deploy systems with a significant risk of killing everyone."

If you were looking at GPT-4, what criteria would you use to evaluate whether it had a significant risk of killing everyone? 

I'd test whether it has the capability to do so if it were trying (as ARC is starting to do). Then I'd think about potential reasons that our training procedure would lead it to try to take over (e.g. as described here or here or in old writing).

If models might be capable enough to take over if they tried, and if the training procedures plausibly lead to takeover attempts, I'd probably flip the burden of proof ask developers to explain what evidence leads them to be confident that these systems won't try to take over (what actual experiments did they perform to characterize model behavior, and how well would it detect the plausible concerns?) and then evaluate that evidence.

For GPT-4 I think the main step is just that it's not capable enough to do so. If it were capable enough, I think the next step would be measuring whether it ever engages in abrupt shifts to novel reward-hacking behaviors---if it robustly doesn't then we can be more confident it won't take over (when combined with more mild "on-distribution" measurement), and if it does then we can start to ask what determines when that happens or what mitigations effectively address it.

When you say, "take over", what do you specifically mean? In the context of a GPT descendent, would take over imply it's doing something beyond providing a text output for a given input? Like it's going out of its way to somehow minimize the cross-entropy loss with additional GPUs, etc.? 

Takeover seems most plausible when humans have deployed the system as an agent, whose text outputs are treated as commands that are sent to actuators (like bash), and which chooses outputs in order to achieve desired outcomes. If you have limited control over what outcome the system is pursuing, you can end up with useful consequentialists who have a tendency to take over (since it's an effective way to pursue a wide range of outcomes, including natural generalizations of the ones it was selected to pursue during training).

A few years ago it was maybe plausible to say that "that's not how people use GPT, they just ask it questions." Unfortunately (but predictably) that has become a common way to use GPT-4, and it is rapidly becoming more compelling as the systems improve. I think that in the relatively near future if you want an AI to help you with coding, rather than just getting completions you might say "Hey this function seems to be taking too long, could you figure out what's going on?" and the AI will e.g. do a bisection for you, set up a version of your code running in a test harness, ask you questions about desired behavior, and so on.

I don't think "the system gets extra GPUs to minimize cross entropy loss" in particular is very plausible. (Could happen, but not a high enough probability to be worth worrying about.)