I work on AI alignment, by which I mean the technical problem of building AI systems that are trying to do what their designer wants them to do.
There are many different reasons that someone could care about this technical problem.
To me the single most important reason is that without AI alignment, AI systems are reasonably likely to cause an irreversible catastrophe like human extinction. I think most people can agree that this would be bad, though there’s a lot of reasonable debate about whether it’s likely. I believe the total risk is around 10–20%, which is high enough to obsess over.
Existing AI systems aren’t yet able to take over the world, but they are misaligned in the sense that they will often do things their designers didn’t want. For example:
- The recently released ChatGPT often makes up facts, and if challenged on a made-up claim it will often double down and justify itself rather than admitting error or uncertainty (e.g. see here, here).
- AI systems will often say offensive things or help users break the law when the company that designed them would prefer otherwise.
We can develop and apply alignment techniques to these existing systems. This can help motivate and ground empirical research on alignment, which may end up helping avoid higher-stakes failures like an AI takeover. I am particularly interested in training AI systems to be honest, which is likely to become more difficult and important as AI systems become smart enough that we can’t verify their claims about the world.
While it’s nice to have empirical testbeds for alignment research, I worry that companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself. If such systems are held up as key successes of alignment, then people who are frustrated with them may end up associating the whole problem of alignment with “making AI systems inoffensive.”
If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI. In my personal capacity, I have views on what uses of AI are more or less beneficial and what regulations make more or less sense, but in my capacity as an alignment researcher I don’t consider myself to be in the business of pushing for or against any of those decisions.
There is one decision I do strongly want to push for: AI developers should not develop and deploy systems with a significant risk of killing everyone. I will advocate for them not to do that, and I will try to help build public consensus that they shouldn’t do that, and ultimately I will try to help states intervene responsibly to reduce that risk if necessary. It could be very bad if efforts to prevent AI from killing everyone were undermined by a vague public conflation between AI alignment and corporate policies.
To be clear, I don't envy the position of anyone who is trying to deploy AI systems and am not claiming anyone is making mistakes. I think they face a bunch of tricky decisions about how a model should behave, and those decisions are going to be subject to an incredible degree of scrutiny because they are relatively transparent (since anyone can run the model a bunch of times to characterize its behavior).
I'm just saying that how you feel about AI alignment shouldn't be too closely tied up with how you end up feeling about those decisions. There are many applications of alignment like "not doubling down on lies" and "not murdering everyone" which should be extremely uncontroversial, and in general I think people ought to agree that it is better if customers and developers and regulators can choose the properties of AI systems rather than them being determined by technical contingencies of how AI is trained.
Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash. The kind of backlash you're worried about seems likely only if, due to economies of scale, very few (competitive) AIs are built by large corporations, and they're all too conservative and inoffensive for many users' tastes. But in that scenario, AI could lead to an unprecedented ability to concentrate power (in the hands of AI developers or governments), which seems to be a reasonable concern for people to have.
It also does not seem totally unreasonable to direct some of that concern towards "AI alignment" (as opposed to only corporate policies or government regulators, as you seem to suggest), defined by "technical problem of building AI systems that are trying to do what their designer wants them to do". A steelman of such a "backlash" could be:
Not that I would necessarily agree with such a "backlash". I think I personally would be pretty conflicted (in the scenario where it looks like AI will cause major concentration of power) due to uncertainty about the relevant empirical and ethical views.
I don't really think that's the case.
Suppose that I have different taste from most people, and consider the interior of most houses ugly. I can be unhappy about the situation even if I ultimately end up in a house I don't think is ugly. I'm unhappy that I had to use multiple bits of selection pressure just to avoid ugly interiors, and that I spend time in other people's ugly houses, and so on.
In practice I think it's even worse than that; people get politically worked up about things that don't affect their lives at all through a variety of channels.
I do agree that backlash to X will be bigger if all AIs do X than if some AIs do X.
I don't think this scenario is really relevant to the most common concerns about concentration of power. I think the most important reason to be scared of concentration of power is:
But all of those arguments are unrelated to the number of AI developers.
Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don't see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).
I don't think it's really plausible to have a technical situation where AI can be used to pursue "humanity's overall values" but cannot be used to pursue the values of a subset of humanity.
(I also tend to think that technocratic solutions to empower humanity via the design of AI are worse than solutions that empower people in more legible ways, either by having their AI agents participate in legible institutions or by having AI systems themselves act as agents of legible institutions. I have some similar concerns to those raised by Glen Weyl here though I disagree on many particulars, and think we should generally focus efforts in this space on mechanisms that don't predictably shift power to people who make detailed technical decisions about the design of AI that aren't legible to most people.)
If someone's position is "alignment might prevent total human disempowerment, but it's better for humans to all be disempowered than for some humans to retain power" then I think they should make that case directly. I don't have personally have that much sympathy for that position, don't think it would play well with the public, and don't think it's closely related to the kind of backlash I'm imagining in the OP.
Stepping back, the version of this I can most understand is: some people might really dislike some effects of AI, and might justifiably push back on all research that helps facilitate that including research that reduces risks from AI (since that research makes the development of AI more appealing). But for the most part I think that energy can and should be directed to directly blocking problematic applications of AI, or AI development altogether, rather than measures that would reduce the risk of AI.
Another related concern might be that AI will "by default" have some kind of truth-oriented disposition that is above human meddling and alignment is mostly just a tool to move from that default (empowering AI developers). But in practice I think both that the default disposition isn't so good, and also that AI developers have other crappier ways to change AI behavior (which are Pareto dominated) and so in practice this is pretty similar to the previous point.
Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right?
Ok, perhaps that doesn't happen because forming cartels is illegal, or because very high prices might attract new entrants, but AI developers could implicitly or explicitly collude with each other in ways besides price, such as indoctrinating their AIs with the same ideology, which governments do not forbid and may even encourage. So you could have a situation where AI developers don't have huge economic power, but do have huge, unprecedented cultural power (similar today's academia, traditional media, and social media companies, except way more concentrated/powerful).
Compare this situation with a counterfactual one in which instead of depending on huge training runs, AIs were manually programmed and progress depended on slow accumulation of algorithmic insights over many decades, and as result there are thousands of AI developers tinkering with their own designs and not far apart in the capabilities of the AIs that they offer. In this world, it would be much less likely for any given customer to not be able to find a competitive AI that shares (or is willing to support) their political or cultural outlook.
(I also see realistic possibilities in which AI developers do naturally have very high margins, and way more power (of all forms) than other actors in the supply chain. Would be interested in discussing this further offline.)
It seems plausible to me that the values of many subsets of humanity aren't even well defined. For example perhaps sustained moral/philosophical progress requires a sufficiently large and diverse population to be in contact with each other and at roughly equal power levels, and smaller subsets (if isolated or given absolute power over others) become stuck in dead-ends or go insane and never manage to reach moral/philosophical maturity.
So an alignment solution based on something like CEV might just not do anything for smaller groups (assuming it had a reliable of way of detecting such deliberation failures and performing a fail-safe).
Another possibility here is that if there was a technical solution for making an AI pursue humanity's overall values, it might become politically infeasible to use AI for some other purpose.
A monopoly on computers or electricity could also take big profits in this scenario. I think the big things are always that it's illegal and that high prices drive new entrants.
I think this would also be illegal if justified by the AI company's preferences rather than customer preferences, and it would at least make them a salient political target for people who disagree. It might be OK if they were competing to attract employees/investors/institutional customers and in practice I think it would be most likely happen as a move by the dominant faction in political/cultural conflict in a broader society, and this would be a consideration raising the importance of AI researchers and potentially capitalists in that conflict.
I agree if you are someone who stands to lose from that conflict then you may be annoyed by some near-term applications of alignment, but I still think (i) alignment is distinct from those applications even if it facilitates them, (ii) if you don't like how AI empowers your political opponents then I strongly think you should push back on AI development itself rather than hoping that no one can control AI.
I strongly agree with the message in this post, but think the title is misleading. When I read it, it seemed to imply that alignment is distinct from near-term alignment concerns, while after having read it, it's specifically about how AI is used in the near-term. A title like "AI Alignment is distinct from how it is used in the near-term" would feel better by me.
I'm concerned about this, because I think the long-term vs near-term safety distinctions are somewhat overrated, and really wish these communities would collaborate more and focus more on the common ground! But the distinction is a common view-point, and what this title pattern matched to.
(Partially inspired by Stephen Casper's post)