Wei Dai


For example, making numerous copies of itself to work in parallel would again raise the dangers of independently varying goals.

The AI could design a system such that any copies made of itself are deleted after a short period of time (or after completing an assigned task) and no copies of copies are made. This should work well enough to ensure that the goals of all of the copies as a whole never vary far from its own goals, at least for the purpose of researching a more permanent alignment solution. It's not 100% risk-free of course, but seems safe enough that an AI facing competitive pressure and other kinds of risks (e.g. detection and shutdown by humans) will probably be willing to do something like it.

In this way, AI coordination might succeed where human coordination appears likely to fail: Preventing humans from destroying themselves by producing superintelligent AIs.

Assuming this were to happen, it hardly seems a stable state of affairs. What do you think happens afterwards?

I don't think I understand, what's the reason to expect that the "acausal economy" will look like a bunch of acausal norms, as opposed to, say, each civilization first figuring out what its ultimate values are, how to encode them into a utility function, then merging with every other civilization's utility function? (Not saying that I know it will be the latter, just that I don't know how to tell at this point.)

Also, given that I think AI risk is very high for human civilization, and there being no reason to suspect that we're not a typical pre-AGI civilization, most of the "acausal economy" might well consist of unaligned AIs (created accidentally by other civilizations), which makes it seemingly even harder to reason about what this "economy" looks like.

We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime.

What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g. Microsoft, with little or no controls on how they're used). You yourself wrote:

Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now.

How is this compatible with the quote above?!

My personal view is that given all of this history and the fact that this forum is named the "AI Alignment Forum", we should not redefine "AI Alignment" to mean the same thing as "Intent Alignment". I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion is too great, avoiding "AI alignment" and using some other term like "AI x-safety". (Although there's an issue with "existential risk/safety" as well, because "existential risk/safety" covers problems that aren't literally existential, e.g., where humanity survives but its future potential is greatly curtailed. Man coordination is hard.)

Other relevant paragraphs from the Arbital post:

“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.

Other terms that have been used to describe this research problem include “robust and beneficial AI” and “Friendly AI”. The term “value alignment problem” was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences with (potentially idealized) human preferences.

Some alternative terms for this general field of study, such as ‘control problem’, can sound adversarial—like the rocket is already pointed in a bad direction and you need to wrestle with it. Other terms, like ‘AI safety’, understate the advocated degree to which alignment ought to be an intrinsic part of building advanced agents. E.g., there isn’t a separate theory of “bridge safety” for how to build bridges that don’t fall down. Pointing the agent in a particular direction ought to be seen as part of the standard problem of building an advanced machine agent. The problem does not divide into “building an advanced AI” and then separately “somehow causing that AI to produce good outcomes”, the problem is “getting good outcomes via building a cognitive agent that brings about those good outcomes”.

Here are some clearer evidence that broader usages of "AI alignment" were common from the beginning:

  1. In this Arbital page dated 2015, Eliezer wrote:

The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

(I couldn't find a easy way to view the original 2015 version, but do have a screenshot that I can produce upon request showing a Jan 2017 edit on Arbital that already had this broad definition.)

  1. In this Google doc (linked to from this 2017 post), Paul Christiano wrote:

By “AI alignment” I mean building AI systems which robustly advance human interests.

  1. In the above linked 2017 post, Vladimir Slepnev wrote:

AI Alignment focuses on ways to ensure that future smarter than human intelligence will have goals aligned with the goals of humanity. Many approaches to AI Alignment deserve attention. This includes technical and philosophical topics, as well as strategic research about related social, economic or political issues.

Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I think).

Eliezer used "AI alignment" as early as 2016 and ai-alignment.com wasn't registered until 2017. Any other usage of the term that potentially predates Eliezer?

I’m not sure what order the history happened in and whether “AI Existential Safety” got rebranded into “AI Alignment” (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn’t a rebrand)

There was a pretty extensive discussion about this between Paul Christiano and me. tl;dr "AI Alignment" clearly had a broader (but not very precise) meaning than "How to get AI systems to try to do what we want" when it first came into use. Paul later used "AI Alignment" for his narrower meaning, but after that discussion, switched to using "Intent Alignment" for this instead.

Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don’t see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).

Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right?

Ok, perhaps that doesn't happen because forming cartels is illegal, or because very high prices might attract new entrants, but AI developers could implicitly or explicitly collude with each other in ways besides price, such as indoctrinating their AIs with the same ideology, which governments do not forbid and may even encourage. So you could have a situation where AI developers don't have huge economic power, but do have huge, unprecedented cultural power (similar today's academia, traditional media, and social media companies, except way more concentrated/powerful).

Compare this situation with a counterfactual one in which instead of depending on huge training runs, AIs were manually programmed and progress depended on slow accumulation of algorithmic insights over many decades, and as result there are thousands of AI developers tinkering with their own designs and not far apart in the capabilities of the AIs that they offer. In this world, it would be much less likely for any given customer to not be able to find a competitive AI that shares (or is willing to support) their political or cultural outlook.

(I also see realistic possibilities in which AI developers do naturally have very high margins, and way more power (of all forms) than other actors in the supply chain. Would be interested in discussing this further offline.)

I don’t think it’s really plausible to have a technical situation where AI can be used to pursue “humanity’s overall values” but cannot be used to pursue the values of a subset of humanity.

It seems plausible to me that the values of many subsets of humanity aren't even well defined. For example perhaps sustained moral/philosophical progress requires a sufficiently large and diverse population to be in contact with each other and at roughly equal power levels, and smaller subsets (if isolated or given absolute power over others) become stuck in dead-ends or go insane and never manage to reach moral/philosophical maturity.

So an alignment solution based on something like CEV might just not do anything for smaller groups (assuming it had a reliable of way of detecting such deliberation failures and performing a fail-safe).

Another possibility here is that if there was a technical solution for making an AI pursue humanity's overall values, it might become politically infeasible to use AI for some other purpose.

If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI.

Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash. The kind of backlash you're worried about seems likely only if, due to economies of scale, very few (competitive) AIs are built by large corporations, and they're all too conservative and inoffensive for many users' tastes. But in that scenario, AI could lead to an unprecedented ability to concentrate power (in the hands of AI developers or governments), which seems to be a reasonable concern for people to have.

It also does not seem totally unreasonable to direct some of that concern towards "AI alignment" (as opposed to only corporate policies or government regulators, as you seem to suggest), defined by "technical problem of building AI systems that are trying to do what their designer wants them to do". A steelman of such a "backlash" could be:

  1. Why work on this kind of alignment as opposed to another form that does not or is less likely to cause concentration of power in a few humans, for example AI that directly tries to satisfy humanity's overall values?
  2. According to some empirical and/or ethical views, such concentration of power could be worse than extinction, so maybe such alignment work is bad even if there is no viable alternative.

Not that I would necessarily agree with such a "backlash". I think I personally would be pretty conflicted (in the scenario where it looks like AI will cause major concentration of power) due to uncertainty about the relevant empirical and ethical views.

Load More