List of resolved confusions about IDA

Wei Dai

AI Alignment is a confusing topic in general, but even compared to other alignment topics, IDA seems especially confusing. Some of it is surely just due to the nature of communicating subtle and unfinished research ideas, but other confusions can be cleared up with more specific language or additional explanations. To help people avoid some of the confusions I or others fell into in the past while trying to understand IDA (and to remind myself about them in the future), I came up with this list of past confusions that I think have mostly been resolved at this point. (However there's some chance that I'm still confused about some of these issues and just don't realize it. I've included references to the original discussions where I think the confusions were cleared up so you can judge for yourself.)

I will try to maintain this list as a public reference so please provide your own resolved confusions in the comments.

alignment = intent alignment

At some point Paul started using "alignment" refer to the top-level problem that he is trying to solve, and this problem is narrower (i.e., leaves more safety problems to be solved elsewhere) than the problem that other people were using "alignment" to describe. He eventually settled upon "intent alignment" as the formal term to describe his narrower problem, but occasionally still uses just "aligned" or "alignment" as shorthand for it. Source

short-term preferences ≠ narrow preferences

At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so). Source

preferences = "actual" preferences (e.g., preferences-on-reflection)

When Paul talks about preferences he usually means "actual" preferences (for example the preferences someone would arrive at after having a long time to think about it while having access to helpful AI assistants, if that's a good way to find someone's "actual" preferences). He does not mean their current revealed preferences or the preferences they would state or endorse now if you were to ask them. Source

corrigibility ≠ based on short-term preferences

I had misunderstood Paul to be using "corrigibility to X" as synonymous with "based on X's short-term preferences". Actually "based on X's short-term preferences" is a way to achieve corrigibility to X, because X's short-term preferences likely includes "be corrigible to X" as a preference. "Corrigibility" itself means something like "allows X to modify the agent" or a generalization of this concept. Source

act-based = based on short-term preferences-on-reflection

My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based". Source

act-based corrigibility

Evan Hubinger used "act-based corrigibility" to mean both a method of achieving corrigibility (based on short-term preferences) and the kind of corrigibility achieved by that method. (I'm not sure if he still endorses using the term this way.) Source

learning user preferences for corrigibility isn't enough for corrigible behavior

Because an act-based agent is about "actual" preferences not "current" preferences, it may be incorrigible even if it correctly learns that the user currently prefers the agent to be corrigible, if it incorrectly infers or extrapolates the user's "actual" preferences, or if the user's "actual" preferences do not actually include corrigibility as a preference. (ETA: Although in the latter case presumably the "actual" preferences include something even better than corrigibility.) Source

distill ≈ RL

Summaries of IDA often describe the "distill" step as using supervised learning, but Paul and others working on IDA today usually have RL in mind for that step. Source

outer alignment problem exists? = yes

The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

corrigible to the user? ≈ no

IDA is typically described as being corrigible to the user. But in reality it would be trying to satisfy a combination of preferences coming from the end user, the AI developer/overseer, and even law enforcement or other government agencies. I think this means that "corrigible to the user" is very misleading, because the AI is actually not likely to respect the user's preferences to modify (most aspects of) the AI or to be "in control" of the AI. Sources: this comment and a talk by Paul at an AI safety workshop

strategy stealing ≠ literally stealing strategies

When Paul says "strategy stealing" he doesn't mean observing and copying someone else's strategy. It's a term borrowed from game theory that he's using to refer to coming up with strategies that are as effective as someone else's strategy in terms of gaining resources and other forms of flexible influence. Source

This is a great post! I know there's been lots of conversations here and elsewhere about this topic, often going for dozens of comments, and I felt like a lot of them needed summarising else they'd be lost to history. Thanks for summarising them briefly and linking back to them.

Thanks! Yeah, one of my motivations for this post is that I was losing track of these discussions myself and falling back into confusion that was already cleared up. For example, after reading one of Paul's latest clarifications, I had a strong feeling that he had told me that already on a previous occasion, but I couldn't remember when. Another push came from my discussion with Raymond Arnold (Raemon) about distillation where we talked about how it's weird to summarize a debate/disagreement as one of the participants, and it kind of made me realize that summarizing resolved confusions has less of this problem.

The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

I'm confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of "outer alignment problems" or "using supervised learning for distillation").

At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so).

I would like to have these two terms defined. Let me offer my understanding from reading the relevant thread.

short-term preferences = short-term preferences-on-reflection ≠ narrow preferences

Short-term preferences refer to the most useful action I can take next, given my ultimate goals. This is to be contrasted with my current best guess about the outcome of that process. It's what I would want, not what I do want.

An AI optimising for my short-term preferences may reasonably say "No, don't take this action, because you'd actually prefer this alternative action if you only thought longer. It fits your true short-term preferences, you're just mistaken about them." This is in contrast with something you might call narrow preferences, which is where you tell the AI to do what you said anyway.

You have a section titled

learning user preferences for corrigibility isn't enough for corrigible behavior

Would this be more consistently titled "Learning narrow preferences for corrigibility isn't enough for corrigible behavior"?

I understand Paul to be saying that he hopes that corrigibility will fall out if we train an AI to score well on your short-term preferences, not just your narrow-preferences.

The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so).

I would like to have these two terms defined. Let me offer my understanding from reading the relevant thread.

short-term preferences = short-term preferences-on-reflection ≠ narrow preferences

You have a section titled

learning user preferences for corrigibility isn't enough for corrigible behavior

Would this be more consistently titled "Learning narrow preferences for corrigibility isn't enough for corrigible behavior"?

I understand Paul to be saying that he hopes that corrigibility will fall out if we train an AI to score well on your short-term preferences, not just your narrow-preferences.

38

List of resolved confusions about IDA

38

alignment = intent alignment

short-term preferences ≠ narrow preferences

preferences = "actual" preferences (e.g., preferences-on-reflection)

corrigibility ≠ based on short-term preferences

act-based = based on short-term preferences-on-reflection

act-based corrigibility

learning user preferences for corrigibility isn't enough for corrigible behavior

distill ≈ RL

outer alignment problem exists? = yes

corrigible to the user? ≈ no

strategy stealing ≠ literally stealing strategies

short-term preferences = short-term preferences-on-reflection ≠ narrow preferences

short-term preferences = short-term preferences-on-reflection ≠ narrow preferences