Why is pseudo-alignment "worse" than other ways ML can fail to generalize?

Jul 20, 2020

140

So, I certainly agree that pseudo-alignment is a type of robustness/distributional shift problem. In fact, I would describe “Risks from Learned Optimization” as a deep dive on a particular subset of robustness problems that might be particularly concerning from a safety standpoint. Thus, in that sense, whether it's really a “new” sort of robustness problem is less the point than the analysis that the paper presents of that robustness problem. That being said, I do think that at least the focus on mesa-optimization was fairly novel in terms of caching out the generalization failures we wanted to discuss in terms of the sorts of learned optimization processes that might exhibit them (as well as the discussion of deception, as you mention).

I don't understand what "safety properties of the base optimizer" could be, apart from facts about the optima it tends to produce.

I agree with that and I think that the sentence you're quoting there is meant for a different sort of reader that has less of a clear concept of ML. One way to interpret the passage you're quoting that might help you is that it's just saying that guarantees about global optima don't necessarily translate to local optima or to actual models you might find in practice.

But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former.

I also agree with this. I would describe my picture here as something like: Pseudo-aligned mesa-optimization Objective generalization without capability generalization $\subset$ Robustness problems. Given that picture, I would say that the pseudo-aligned mesa-optimizer case is the most concerning from a safety perspective, then generic objective generalization without capability generalization, then robustness problems in general. And I would argue that it makes sense to break it down in that way precisely because you get more concerning safety problems as you go narrower.

Also, more detail on the capability vs. objective robustness picture is also available here and here.

[-]David Scott Krueger (formerly: capybaralet)5y40

I disagree with the framing that: "pseudo-alignment is a type of robustness/distributional shift problem". This is literally true based on how it's defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).

Rohin Shah

Jul 18, 2020*

120

According to me (and at least some if not all of the authors of that paper disagree with me), the main point is highlighting the possibility of capabilities generalizing while objectives do not. I agree that this is a failure mode that we knew about before, but it's not one that people were paying much attention to. At the very least, when people said they worked on "robustness", they weren't distinguishing between capability failure vs. objective failure (though of course the line between these is blurry).

[-]Charlie Steiner5y60

Although on the other hand, decade+ old arguments about the instrumental utility of good behavior while dependent on humans have more or less the same format. Seeing good behavior is better evidence of intelligence (capabilities generalizing) than it is of benevolence (goals 'generalizing').

The big difference is that the olde-style argument would be about actual agents being evaluated by humans, while the mesa-optimizers argument is about potential configurations of a reinforcement learner being evaluated by a reward function.

[-]habryka5y30

(Really minor formatting nitpick, but it's the kind of thing that really trips me up while reading, but you forgot a closing parenthesis somewhere in your comment)

2Rohin Shah5y

Fixed, thanks.

abramdemski

Apr 13, 2022

So, I think the other answers here are adequate, but not super satisfying. Here is my attempt.

The frame of "generalization failures" naturally primes me (and perhaps others) to think of ML as hunting for useful patterns, but instead fitting to noise. While pseudo-alignment is certainly a type of generalization failure, it has different connotations: that of a system which has "correctly learned" (in the sense of internalizing knowledge for its own use), but still does not perform as intended.

The mesa-optimizers paper defines inner optimizers as performing "search". I think there are some options here and we can define things slightly differently.

In Selection vs Control, I split "optimization" up into two types: "selection" (which includes search, and also weaker forms of selection such as mere sampling bias), and "control" (which implies actively steering the world in a direction, but doesn't always imply search, EG in the case of a thermostat).

In mesa-search vs mesa-control, I applied this distinction to mesa-optimization, arguing that mesa-optimizers which do not use search could still present a danger.

Mesa-controllers are a form of inner optimizers which reliably steer the world in a particular direction. These are distinguished from 'mere' generalization failure because generalization failures do not usually have such an impact. If we define pseudo-alignment in this way, you could say we are defining it by its impact. Clearly, more impactful generalization failures are more concerning. However, you might think it's a little weird to invent entirely new terminology for this case, rather than referring to it as "impactful generalization failures".

Mesa-searchers are a form of inner optimizers characterized by performing internal search. You could say that they're clearly computing something coherent, just not what was desired (which may not be the case for 'mere' generalization failures). These are more clearly a distinct phenomenon, particularly if the intended behavior didn't involve search. (It would seem odd to call them "searching generalization failures" imho.) But the safety concerns are less directly obvious.

It's only when we put these two together that we have something both distinct and of safety concern. Looking for "impactful generalization failures" gives us relatively little to grab onto. But it's particularly plausible that mesa-searchers will also be mesa-controllers, because the machinery for complex planning is present. So, this combination might be particularly worth thinking about.

we'll see this appear mathematically in the definition of the property or in theorems about it, whether or not we have explicitly considered the possibility of mesa-optimizers. (I suppose the argument could be that some candidate safety properties implicitly assume no optimum is a mesa-optimizer, and thus appear to apply to all optima while not really doing so -- somewhat analogous to early notions of continuity which implicitly assumed away the Weierstrass function. But if so, I need a real example of such a case to convince me.)

I tend to agree with this line of thinking. IE, it seems intuitive to me that highly robust alignment technology would rely on arguments that don't explicitly mention inner optimization anywhere, because those failure modes are ruled out via the same general arguments which rule out other failure modes. However, it also seems plausible to me that it's useful to think about inner alignment along the way.

You wanted a convincing example. I think The Solomonoff Prior Is Malign could be such an example. Before becoming aware of this argument, it seemed pretty plausible to me that the Solomonoff prior described a kind of rational ideal for induction. This isn't a full "here's a safety argument that would go through if we assumed no-inner-optimizers", but in a parallel universe where we have access to infinite computation, it could be close to that. (EG, someone could argue that the chance of Solomonoff Induction resulting in generalization failures is very low, and then change their mind when they hear the Solomonoff-is-malign argument.)

Also, it seems highly plausible to me that inner alignment is a useful thing to have in mind for "less than highly robust" alignment approaches (approaches which seek to grab the lower-hanging fruit of alignment research, to create systems that are aligned in the worlds where achieving alignment isn't so hard after all). These approaches can, for example, employ heuristics which make it somewhat unlikely that inner optimizers will emerge. I'm not very interested in that type of alignment research, because it seems to me that alignment technology needs to be backed up with rather tight arguments in order to have any realistic chance of working; but it makes sense for some people to think about that sort of thing.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

21

[ Question ]

Why is pseudo-alignment "worse" than other ways ML can fail to generalize?

21

3 Answers sorted by
top scoring

Jul 20, 2020

Jul 18, 2020*

Apr 13, 2022

21

[ Question ]

Why is pseudo-alignment "worse" than other ways ML can fail to generalize?

21

3 Answers sorted by top scoring

Jul 20, 2020

Jul 18, 2020*

Apr 13, 2022

3 Answers sorted by
top scoring