Paul Christiano


Iterated Amplification


Some AI research areas and their relevance to existential safety

A number of blogs seem to treat [AI existential safety, AI alignment, and AI safety] as near-synonyms (e.g., LessWrong, the Alignment Forum), and I think that is a mistake, at least when it comes to guiding technical work for existential safety.

I strongly agree with the benefits of having separate terms and generally like your definitions.

In this post, AI existential safety means “preventing AI technology from posing risks to humanity that are comparable or greater than human extinction in terms of their moral significance.”  

I like "existential AI safety" as a term to distinguish from "AI safety" and agree that it seems to be clearer and have more staying power. (That said, it's a bummer that "AI existential safety forum" is a bit of a mouthful.)

If I read that term without a definition I would assume it meant "reducing the existential risk posed by AI." Hopefully you'd be OK with that reading. I'm not sure if you are trying to subtly distinguish it from Nick's definition of existential risk or if the definition you give is just intended to be somewhere in that space of what people mean when they say "existential risk" (e.g. the LW definition is like yours).

Some AI research areas and their relevance to existential safety

Outcome C is most naturally achieved using "direct democracy" TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that's hard because:

I'm not sure what's most natural, but I do consider this a fairly unlikely way of achieving outcome C.

I think the best argument for this kind of outcome is from Wei Dai, but I don't think it gets you close to the "direct democracy" outcome. (Even if you had state control and AI systems aligned with the state, it seems unlikely and probably undesirable for the state to be replaced with an aggregation procedure implemented by the AI itself.)

Some AI research areas and their relevance to existential safety

It's always possible to say, solving the single/single alignment problem will prevent anything like that from happening in the first place, but why put all your hopes on plan A, when plan B is relatively neglected?

The OP writes "contributions to AI alignment are also generally unhelpful to existential safety." I don't think I'm taking a strong stand in favor of putting all our hopes on plan A, I'm trying to understand the perspective on which plan B is much more important even before considering neglectedness.

It seems premature to say, in advance of actually seeing what such research uncovers, whether the relevant mechanisms and governance improvements are exactly the same as the improvements we need for good governance generally, or different.

I agree that would be premature. That said, I still found it notable that OP saw such a large gap between the importance of CSC and other areas on and off the list (including MARL). Given that I would have these things in a different order (before having thought deeply), it seemed to illustrate a striking difference in perspective. I'm not really trying to take a strong stand, just using it to illustrate and explore that difference in perspective.

Some AI research areas and their relevance to existential safety

Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn't keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.

It's unclear to me how this particular outcome relates to social alignment (or at least to the kinds of research areas in this post). Some possibilities:

  • Does failure to solve social alignment mean that firms and governments cannot use AI to represent their shareholders and constituents? Why might that be? (E.g. what's a plausible approach to atomic alignment that couldn't be used by a firm or government?)
  • Does AI progress occur unevenly such that some group gets much more power/profit, and then uses that power? If so, how would technical progress on alignment help address that outcome? (Why would the group with power be inclined to use whatever techniques we're imagining?) Also, why does this happen?
  • Does AI progress somehow complicate the problem of governance or corporate governance such that those organizations can no longer represent their constituents/shareholders? What is the mechanism (or any mechanism) by which this happens? Does social alignment help by making new forms of organization possible, and if so should I just be thinking of it as a way of improving those institutions, or is it somehow distinctive?
  • Do we already believe that the situation is gravely unequal (e.g. because governments can't effectively represent their constituents and most people don't have a meaningful amount of capital) and AI progress will exacerbate that situation? How does social alignment prevent that?

(This might make more sense as a question for the OP, it just seemed easier to engage with this comment since it describes a particular more concrete possibility. My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se.)

Some AI research areas and their relevance to existential safety

If single/single alignment is solved it feels like there are some salient "default" ways in which we'll end up approaching multi/multi alignment:

  • Existing single/single alignment techniques can also be applied to empower an organization rather than an individual. So we can use existing social technology to form firms and governments and so on, and those organizations will use AI.
  • AI systems can themselves participate in traditional social institutions. So AI systems that represent individual human interests can interact with each other e.g. in markets or democracies.

I totally agree that there are many important problems in the world even if we can align AI. That said, I remain interested in more clarity on what you see as the biggest risks with these multi/multi approaches that could be addressed with technical research.

For example, let's take the considerations you discuss under CSC:

Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world.  This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.

One approach to this problem is to work to make it more likely that AI systems can adequately represent human interests in understanding and intervening on the structure of society. But this seems to be a single/single alignment problem (to whatever extent that existing humans currently try to maintain and influence our social structure, such that impairing their ability to do so is problematic at all) which you aren't excited about.

Fourth, in such a world, algorithms will be needed to hold the aggregate global behavior of algorithms accountable to human wellbeing, because things will be happening too quickly for humans to monitor.  In short, an “algorithmic government” will be needed to govern “algorithmic society”.  Some might argue this is not strictly unnecessary: in the absence of a mathematically codified algorithmic social contract, humans could in principle coordinate to cease or slow down the use of these powerful new alignment technologies, in order to give ourselves more time to adjust to and govern their use.  However, for all our successes in innovating laws and governments, I do not believe current human legal norms are quite developed enough to stably manage a global economy empowered with individually-alignable transformative AI capabilities.  

Again, it's not clear what you expect to happen when existing institutions are empowered by AI and mostly coordinate the activities of AI.

The last line reads to me like "If we were smarter, when our legal system may no longer be up to the challenge," with which I agree. But it seems like the main remedy is "if we were smarter, we would hopefully work on improving our legal system in tandem with the increasing demands we impose on it."

It feels like the salient actions to take to me are (i) make direct improvements in the relevant institutions, in a way that anticipates the changes brought about by AI but will most likely not look like AI research, (ii) work on improving the relative capability of AI at those tasks that seem more useful for guiding society in a positive direction.

I consider (ii) to be one of the most important kinds of research other than alignment for improving the impact of AI, and I consider (i) to be all-around one of the most important things to do for making the world better. Neither of them feels much like CSC (e.g. I don't think computer scientists are the best people to do them) and it's surprising to me that we end up at such different places (if only in framing and tone) from what seem like similar starting points.

Some AI research areas and their relevance to existential safety

Progress in OODR will mostly be used to help roll out more AI technologies into active deployment more quickly

It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X). I may be misunderstanding.

A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that's the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.

I'm not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks. My sense is that most of the locally-contrarian views in this post are driven by locally-contrarian quantitative estimates of various risks. If that's the case, then it seems like the main thing that would shift my view would be some argument about the relative magnitude of risks. I'm not sure if other readers feel similarly.

Research in this area usually does not involve deep or lengthy reflections about the structure of society and human values and interactions, which I think makes this field sort of collectively blind to the consequences of the technologies it will help build.

This is a plausible view, but I'm not sure what negative consequences you have in mind (or how it affects the value of progress in the field rather than the educational value of hanging out with people in the field).

Incidentally, the main reason I think OODR research is educationally valuable is that it can eventually help with applying agent foundations research to societal-scale safety.  Specifically: how can we know if one of the operations (a)-(f) above is safe to perform 1,000,000 times, given that it was safe the first 1,000 times we applied it in a controlled setting, but the setting is changing over time?  This is a special case of an OODR question.

That task---how do we test that this system will consistently have property P, given that we can only test property P at training time?---is basically the goal of OODR research. Your prioritization of OODR suggests that maybe you think that's the "easy part" of the problem (perhaps because testing property P is so much harder), or that OODR doesn't make meaningful progress on that problem (perhaps because the nature of the problem is so different for different properties P?). Whatever it is, it seems like that's at the core of the disagreement and you don't say much about it. I think many people have the opposite intuition, i.e. that much of the expected harm from AI systems comes from behaviors that would have been recognized as problematic at training time.

In any case, I see AI alignment in turn as having two main potential applications to existential safety:

  1. AI alignment is useful as a metaphor for thinking about how to align the global effects of AI technology with human existence, a major concern for AI governance at a global scale, and
  2. AI alignment solutions could be used directly to govern powerful AI technologies designed specifically to make the world safer.

Here is one standard argument for working on alignment. It currently seems plausible that AI systems will be trying to do stuff that no one wants and that this could be very bad if AI systems are much more competent than humans. Prima facie, if the designers of AI systems are able to better control what AI systems are trying to do, then those AI systems are more likely to be trying to do what the developers want. So if we are able to give developers that ability, we can reduce the risk of AI competently doing stuff no one wants.

This isn't really a metaphor, it's a direct path for impact. It's unclear if you think that this argument is mistaken because developers will be able to control what their AI systems are trying to do, because they won't be motivated to deploy AI until they have that control, because it's not much better for AI systems to be trying to do what their developers want, because there are other more important reasons that AI systems could be trying to do stuff that no one wants, because there are other risks unrelated to AI trying to do stuff no one wants, or something else altogether.

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

Like you, I'm opposed to plans where people try to take over the world in order to make it safer. But this looks like a bit of a leap. For example, AI alignment may help us build powerful AI systems that help us negotiate or draft agreements, which doesn't seem like taking over the world to make it safer.

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

What would a corrigible but not-intent-aligned AI system look like?

Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can't be both corrigible and intent aligned.

Hiring engineers and researchers to help align GPT-3

described by Eliezer as “directly, straight-up relevant to real alignment problems.”

Link to thread.

Worth saying that Eliezer still thinks our team is pretty doomed and this is definitely not a general endorsement of our agenda. I feel excited about our approach and think it may yet work, but I believe Eliezer's position is that we're just shuffling around the most important difficulties into the part of the plan that's vague and speculative.

I think it's fair to say that Reflection is on the Pareto frontier of {plays ball with MIRI-style concerns, does mainstream ML research}. I'm excited for a future where either we convince MIRI that aligning prosaic AI is plausible, or MIRI convinces us that it isn't.

Hiring engineers and researchers to help align GPT-3

will these jobs be long-term remote? if not, on what timeframe will they be remote?

We expect to be requiring people to work from the office again sometime next year.

how suitable is the research engineering job for people with no background in ml, but who are otherwise strong engineers and mathematicians?

ML background is very helpful. Strong engineers who are interested in learning about ML are also welcome to apply though no promises about how well we'll handle those applications in the current round.

Hiring engineers and researchers to help align GPT-3

The team is currently 7 people and we are hiring 1-2 additional people over the coming months.

I am optimistic that our team and other similar efforts will be hiring more people in the future and continuously scaling up, and that over the long term there could be a lot of people working on these issues.

(The post is definitely written with that in mind and the hope that enthusiasm will translate into more than just hires in the current round. Growth will also depend on how strong the pool of candidates is.)

Load More