[ Question ]

Best reasons for pessimism about impact of impact measures?

byTurnTrout14d10th Apr 201932 comments


Habryka recently wrote (emphasis mine):

My inside views on AI Alignment make me think that work on impact measures is very unlikely to result in much concrete progress on what I perceive to be core AI Alignment problems, and I have talked to a variety of other researchers in the field who share that assessment. I think it’s important that this grant not be viewed as an endorsement of the concrete research direction that Alex is pursuing, but only as an endorsement of the higher-level process that he has been using while doing that research.

As such, I think it was a necessary component of this grant that I have talked to other people in AI Alignment whose judgment I trust, who do seem excited about Alex’s work on impact measures. I think I would not have recommended this grant, or at least this large of a grant amount, without their endorsement. I think in that case I would have been worried about a risk of diverting attention from what I think are more promising approaches to AI Alignment, and a potential dilution of the field by introducing a set of (to me) somewhat dubious philosophical assumptions.

I'm interested in learning about the intuitions, experience, and facts which inform this pessimism. As such, I'm not interested in making any arguments to the contrary in this post; any pushback I provide in the comments will be with clarification in mind.

There are two reasons you could believe that "work on impact measures is very unlikely to result in much concrete progress on… core AI Alignment problems". First, you might think that the impact measurement problem is intractable, so work is unlikely to make progress. Second, you might think that even a full solution wouldn't be very useful.

Over the course of 5 minutes by the clock, here are the reasons I generated for pessimism (which I either presently agree with or at least find it reasonable that an intelligent critic would raise the concern on the basis of currently-public reasoning):

  • Declarative knowledge of a solution to impact measurement probably wouldn't help us do value alignment, figure out embedded agency, etc.
  • We want to figure out how to transition to a high-value stable future, and it just isn't clear how impact measures help with that.
  • Competitive and social pressures incentivize people to cut corners on safety measures, especially those which add overhead.
    • Computational overhead.
    • Implementation time.
    • Training time, assuming they start with low aggressiveness and dial it up slowly.
  • Depending on how "clean" of an impact measure you think we can get, maybe it's way harder to get low-impact agents to do useful things.
    • Maybe we can get a clean one, but only for powerful agents.
    • Maybe the impact measure misses impactful actions if you can't predict at near human level.
  • In a world where we know how to build powerful AI but not how to align it (which is actually probably the scenario in which impact measures do the most work), we play a very unfavorable game while we use low-impact agents to somehow transition to a stable, good future: the first person to set the aggressiveness too high, or to discard the impact measure entirely, ends the game.
  • In a More realistic tales of doom-esque scenario, it isn't clear how impact helps prevent "gradually drifting off the rails".

Paul raised concerns along these lines:

We'd like to build AI systems that help us resolve the tricky situation that we're in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.

I think the default "terrible" scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.

That may ultimately culminate with a catastrophic bang, but if it does it's not going to be because we wanted the AI to have a small impact and it had a large impact. It's probably going to be because we have a very limited idea what is going on, but we don't feel like we have the breathing room to step back and chill out (at least not for long) because we don't believe that everyone else is going to give us time.

If I'm trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does "low impact" mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?

(And realistically I doubt we'll fail at alignment with a bang---it's more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn't let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)

It seems like "low objective impact" is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that's a coherent situation to think about and plan for, but we shouldn't mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it's the best hope if you were very pessimistic about what I consider "mainline" alignment.)


New Answer
New Comment
Ask Related Question

6 Answers

When I think about solutions to AI alignment, I often think about 'meaningful reductionism.' That is, if I can factor a problem into two parts, and the parts don't actually rely on each other, now I have two smaller problems to solve. But if the parts are reliant on each other, I haven't really simplified anything yet.

While impact measures feel promising to me as a cognitive strategy (often my internal representation of politeness feels like 'minimizing negative impact', like walking on sidewalks in a way that doesn't startle birds), they don't feel promising to me as reductionism. That is, if I already had a solution to the alignment problem, then impact measures would likely be part of how I implement that solution, but solving it separately from alignment doesn't feel like it gets me any closer to solving alignment.

[The argument here I like most rests on the difference between costs and side effects; we don't want to minimize side effects because that leads to minimizing good side effects also, and it's hard to specify the difference between 'side effects' and 'causally downstream effects,' and so on. But if we just tell the AI "score highly on a goal measure while scoring low on this cost measure," this only works if we specified the goal and the cost correctly.]

But there's a different approach to AI alignment, which is something more like 'correct formalisms.' We talk sometimes about handing a utility function to the robot, or (in old science fiction) providing it with rules to follow, or so on, and by seeing what it actually looks like when we follow that formalism we can figure out how well that formalism fits to what we're interested in. Utility functions on sensory inputs don't seem alignable because of various defects (like wireheading), and so it seems like the right formalism needs to have some other features (it might still be a utility function, but it needs to be an utility function over mental representations of external reality in such a way that the mental representation tracks external reality even when you have freedom to alter your mental representation, in a way that we can't turn into code yet).

So when I ask myself questions like "why am I optimistic about researching impact measures now?" I get answers like "because exploring the possibility space will make clear exactly how the issues link up." For example, looking at things like relative reachability made it clear to me how value-laden the ontology needs to be in order for a statistical measure on states to be meaningful. This provides a different form-factor for 'transferring values to the AI'; instead of trying to ask something like "is scenario A or B better?" and train a utility function, I might instead try to ask something like "how different are scenarios A and B?" or "how are scenarios A and B different?" and train an ontology, with the hopes that this makes other alignment problems easier because the types line up somewhat more closely.

[I think even that last example still performs poorly on the 'meaningful reductionism' angle, since getting more options for types to use in value loading doesn't seem like it addresses the core obstacles of value loading, but provides some evidence of how it could be useful or clarify thinking.]

  • Giving people a slider with "safety" written on one end and "capability" written on the other, and then trying to get people to set it close enough to the "safety" end, seems like a bad situation. (Very similar to points you raised in your 5-min-timer list.)
    • An improvement on this situation would be something which looked more like a theoretical solution to Goodhart's law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities ("this is how you get the most of what you want"), allowing ML researchers to develop algorithms orienting toward this.
    • Even better (but similarly), an approach where capability and alignment go hand in hand would be ideal -- a way to directly optimize for "what I mean, not what I say", such that it is obvious that things are just worse if you depart from this.
    • However, maybe those things are just pipe dreams -- this should not be the fundamental reason to ignore impact measures, unless promising approaches in the other two categories are pointed out; and even then, impact measures as a backup plan would still seem desirable.
      • My response to this is roughly that I prefer mild optimization techniques for this back up plan. Like impact measures, they are vulnerable to the objection above; but they seem better in terms of the objection which follows.
      • Part of my intuition, however, is just that mild optimization is going to be closer to the theoretical heart of anti-Goodhart technology. (Evidence for this is that quantilization seems, to me, theoretically nicer than any low-impact measure.)
        • In other words, conditioned on having a story more like "this is how you get the most of what you want" rather than a slider reading "safety ------- capability", I more expect to see a mild optimizer as opposed to an impact measure.
  • Unlike mild-optimization approaches, impact measures still allow potentially large amounts of optimization pressure to be applied to a metric that isn't exactly what we want.
    • It is apparent that some attempted impact measures run into nearest-unblocked-strategy type problems, where the supposed patch just creates a different problem when a lot of optimization pressure is applied. This gives reason for concern even if you can't spot a concrete problem with a given impact measure: impact measures don't address the basic nearest-unblocked-strategy problem, and so are liable to severe Goodheartian results.
    • If an impact measure were perfect, then adding it as a penalty on an otherwise (slightly or greatly) misaligned utility function just seems good, and adding it as a penalty to a perfectly aligned utility function would seem an acceptable loss. If impact is slightly misspecified, however, then adding it as a penalty may make a utility function less friendly than it otherwise would be.
      • (It is a desirable feature of safety measures, that those safety measures do not risk decreasing alignment.)
    • On the other hand, a mild optimizer seems to get the spirit of what's wanted from low-impact.
      • This is only somewhat true: a mild optimizer may create a catastrophe through negligence, where a low-impact system would try hard to avoid doing so. However, I view this as a much more acceptable and tractable problem than the nearest-unblocked-strategy type problem.
  • Both mild optimization and impact measures require separate approaches to "doing what people want".
    • Arguably this is OK, because they could greatly reduce the bar for alignment of specified utility functions. However, it seems possible to me that we need to understand more about the fundamentally puzzling nature of "do what I want" before we can be confident even in low-impact or mild-optimization approaches, because it is difficult to confidently say that an approach avoids risk of hugely violating your preferences while still being so confused about what human preference even is.

My concern is similar to Wei Dai's: it seems to me that at a fundamental physical level, any plan involving turning on a computer that does important stuff will make pretty big changes to the world's trajectory in phase space. Heat dissipation will cause atmospheric particles to change their location and momentum, future weather patterns will be different, people will do things at different times (e.g. because they're waiting for a computer program to run, or because the computer is designed to change the flow of traffic through a city), meet different people, and have different children. As a result, it seems hard for me to understand how impact measures could work in the real world without a choice of representation very close to the representation humans use to determine the value of different worlds. I suspect that this will need input from humans similar to what value learning approaches might need, and that once it's done one could just do value learning and dispense with the need for impact measures. That being said, this is more of an impression than a belief - I can't quite convince myself that no good method of impact regularisation exists, and some other competent people seem to disagre ewith me.

I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world, because what counts as a negative side effect in the real world seems too complex to easily capture. It seems like AUP tries to get around this by aiming at a lower bar than "avoid negative side effects", namely "avoid catastrophic side effects", and aside from whether it actually succeeds at clearing this lower bar, it would mean that an AI that is only "safe" because of AUP can't be safely used for ordinary goals (e.g., invent a better widget, or make someone personally more successful in life) and instead we have to somehow restrict them to being used just for goals that relate to x-risk reduction, where it's worthwhile to risk incurring less-than-catastrophic negative side effects.

As a side note, it seems generally the case that some approaches to AI safety/alignment aim at the higher bar of "safe for general use" and others aim at "safe enough to use for x-risk reduction", and this isn't always made clear, which can be a source of confusion for both AI safety/alignment researchers and others such as strategists and policy makers.

Thanks Alex for starting this discussion and thanks everyone for the thought-provoking answers. Here is my current set of concerns about the usefulness of impact measures, sorted in decreasing order of concern:

Irrelevant factors. When applied to the real world, impact measures are likely to be dominated by things humans don't care about (heat dissipation, convection currents, positions of air molecules, etc). This seems likely to happen to value-agnostic impact measures, e.g. AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

This may be mitigated by inability to perceive the irrelevant factors, which results in a more coarse-grained state representation: if the agent can't see air molecules, all the states with different air molecule positions will look the same, as they do to humans. Some human-relevant factors can also be difficult to perceive, e.g. the presence of poisonous gas in the room, so we may not want to limit the agent's perception ability to human level. Automatically filtering out irrelevant factors does seem difficult, and I think this might imply that it is impossible to design an impact measure that is both useful and truly value-agnostic.

However, the value-agnostic criterion does not seem very important in itself. I think the relevant criterion is that designing impact measures should be easier than the general value learning problem. We already have a non-value-agnostic impact measure that plausibly satisfies this criterion: RLSP learns what is effectively an impact measure (the human theta parameter) using zero human input just by examining the starting state. This could also potentially be achieved by choosing an attainable utility set that rewards a broad enough sample of things humans care about, and leaves the rest to generalization. Choosing a good attainable utility set may not be easy but it seems unlikely to be as hard as the general value learning problem.

Butterfly effects. Every action is likely to have large effects that are difficult to predict, e.g. taking a different route to work may result in different people being born. Taken literally, this means that there is no such thing as a low-impact action. Humans get around this by only counting easily predictable effects as impact that they are considered responsible for. If we follow a similar strategy of not penalizing butterfly effects, we might incentivize the agent to deliberately cause butterfly effects. The easiest way around this that I can currently see is restricting the agent's capability to model the effects of its actions, though this has obvious usefulness costs as well.

Chaotic world. Every action, including inaction, is irreversible, and each branch contains different states. While preserving reversibility is impossible in this world, preserving optionality (attainable utility, reachability, etc) seems possible. For example, if the attainable set contains a function that rewards the presence of vases, the action of breaking a vase will make this reward function more difficult to satisfy (even if the states with/without vases are different in every branch). If we solve the problem of designing/learning a good utility set that is not dominated by irrelevant factors, I expect chaotic effects will not be an issue.

If any of the above-mentioned concerns are not overcome, impact measures will fail to distinguish between what humans would consider low-impact and high-impact. Thus, penalizing high-impact actions would come with penalizing low-impact actions as well, which would result in a strong safety-capability tradeoff. I think the most informative direction of research to figure out whether these concerns are a deal-breaker is to scale up impact measures to apply beyond gridworlds, e.g. to Atari games.

Here's a relevant passage by Rohin (from Alignment Newsletter #49, March 2019):

On the topic of impact measures, I'll repeat what I've said before: I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on human values), safety (preventing any catastrophic outcomes) and usefulness (the AI system is still able to do useful things). Impact measures are very clearly aiming for the first two criteria, but usually don't have much to say about the third one. My expectation is that there is a strong tradeoff between the first two criteria and the third one, and impact measures have not dealt with this fact yet, but will have to at some point.