I've argued that many methods of AI control - corrigibility, amplification and distillation, low impact, etc... - require a partial definition of human preferences to make sense.

One idea I've heard for low impact is that of reversibility - asking how hard it is to move the situation back to its original state (or how now to close down too many options). The "Conservative Agency" paper uses something akin to that, for example.

On that issue, I'll present the bucket of water thought experiment. A robot has to reach a certain destination; it can do so by wading through a shallow pool. At the front of the pool is a bucket of water. The water in the bucket has a slightly different balance of salts than the water in the pool (maybe due to mixing effects when the water was drawn from the pool).

The fastest way for the robot to reach their destination is to run through the pool, kicking the bucket into it as it goes. Is this a reversible action?

Well, it depends on what humans care about the water in the bucket. If we care about the rough quantity of water, this action is perfectly reversible: just dip the bucket back into the pool and draw out the right amount of water. If we care about the exact balance of salts in the bucket, this is very difficult to reverse, and requires a lot of difficult work to do so. If we care about the exact molecules in the bucket, this action is completely irreversible.

The truth is that, with a tiny set of exceptions, all our actions are irreversible, shutting down many possibilities for ever. But many things are reversible in practice, in that we can return to a state sufficiently similar that we don't care about the difference.

But, in order to establish that, we need some estimate of what we care (and don't care) about. In the example above, things are different if we are considering a) this is a bucket of water, b) this is a bucket of water carefully salt-balanced for an industrial purpose, or c) this is the water from the last bath my adored husband took, before he died.

EDIT: I should point out that "avoid the bucket anyway" is not a valid strategy, since "avoid doing anything that could have a large irreversible impact for some utility function" is equivalent with "don't do anything at all".

The robot has to be capable of kicking the bucket in some circumstances. Precisely in those circumstances where humans don't care about the bucket's contents and it is valuable to kick it. But both of those - "don't care", "valuable" - are human value judgements.

That's why I don't see how there could be any measure of irreversible or low impact that doesn't include some portion of human preferences. It has to distinguish between when kicking the bucket is forbidden, allowable, or mandatory - and the only thing that distinguishes these are human preferences.

New Comment
15 comments, sorted by Click to highlight new comments since:

I think this is the wrong framing, resting on unnecessary assumptions about how impact must be solved. You're concerned that it's hard to look at the world and infer what it means for people to be "minimally impacted". I agree: that's really hard. I get the concern. We shouldn't do that, and that's not what AUP aims to do.

When I say "we shouldn't do that, and that's not what AUP aims to do", I really, really, really mean it. (IIRC) I've never proposed doing it that way (although to be fair, my initial AUP post didn't repeat this point enough). If e.g. my presentation doesn't communicate the better frame, wait for the "Reframing Impact" sequence!

Your presentation had an example with randomly selected utility functions in a block world, that resulted in the agent taking less-irreversible actions around a specific block.

If we have randomly selected utility functions in the bucket-and-pool world, this may include utilities that care about the salt content or the exact molecules, or not. Depending on whether or not we include these, we run the risk of preserving the bucket when we need not, or kicking it when we should preserve it. This is because the "worth" of the water being in the bucket varies depending on human preferences, not on anything intrinsic to the design of the bucket and the pool.

The agent would still walk around for reasonable farsightedness; kicking the bucket into the pool perturbs most AUs. There’s no real “risk” to not kicking the bucket.

AUP is only defined over states for MDPs because states are the observations. AUP in partially observable environments uses reward functions over sensory inputs. Again, I assert we don’t need to think about molecules or ontologies.

But, as covered in the discussion linked above, worrying about penalizing molecular shifts or not misses the point of impact: the agent doesn’t catastrophically reappropriate our power, so we can still get it to do what we want. (The thread is another place to recalibrate on what I’ve tried to communicate thus far.)

The truth is that, with a tiny set of exceptions, all our actions are irreversible, shutting down many possibilities for ever.

AUP doesn’t care much at all about literal reversibility.

I think the discussion of reversibility and molecules is a distraction from the core of Stuart's objection. I think he is saying that a value-agnostic impact measure cannot distinguish between the cases where the water in the bucket is or isn't valuable (e.g. whether it has sentimental value to someone).

If AUP is not value-agnostic, it is using human preference information to fill in the "what we want" part of your definition of impact, i.e. define the auxiliary utility functions. In this case I would expect you and Stuart to be in agreement.

If AUP is value-agnostic, it is not using human preference information. Then I don't see how the state representation/ontology invariance property helps to distinguish between the two cases. As discussed in this comment, state representation invariance holds over all representations that are consistent with the true human reward function. Thus, you can distinguish the two cases as long as you are using one of these reward-consistent representations. However, since a value-agnostic impact measure does not have access to the true reward function, you cannot guarantee that the state representation you are using to compute AUP is in the reward-consistent set. Then, you could fail to distinguish between the two cases, giving the same penalty for kicking a more or less valuable bucket.

That's an excellent summary.

I agree that it's not the core, and I think this is a very cogent summary. There's a deeper disagreement about what we need done that I'll lay out in detail in Reframing Impact.

I've added an edit to the post, to show the problem: sometimes, the robot can't kick the bucket, sometimes it must. And only human preferences distinguish these two cases. So, without knowing these preferences, how can it decide?

kicking the bucket into the pool perturbs most AUs. There’s no real “risk” to not kicking the bucket.

In this specific setup, no. But sometimes kicking the bucket is fine; sometimes kicking the metaphorical equivalent of the bucket is necessary. If the AI is never willing to kick the bucket - ie never willing to take actions that might, for certain utility functions, cause huge and irreparable harm - then it's not willing to take any action at all.

I think for most utility functions, kicking over the bucket and then recreating a bucket with identical salt content (but different atoms) gets you back to a similar value to what you were at before. If recreating that salt mixture is expensive vs. cheap, and if attainable utility preservation works exactly as our initial intuitions might suggest (and I'm very unsure about that, but supposing it does work in the intuitive way), then AUP should be more likely to avoid disturbing the expensive salt mixture, and less likely to avoid disturbing the cheap salt mixture. That's because for those utility functions for which the contents of the bucket were instrumentally useful, the value with respect to those utility functions goes down roughly by the cost of recreating the bucket's contents. Also, if a certain salt mixture is less economically useful, there will be fewer utility functions for which kicking over the bucket leads to a loss in value, so if AUP works intuitively, it should also agree with our intuition there.

If it's true that for most utility functions, the particular collection of atoms doesn't matter, then it seems to me like AUP manages to assign a higher penalty to the actions that we would agree are more impactful, all without any information regarding human preferences.

Thanks Stuart for the example. There are two ways to distinguish the cases where the agent should and shouldn't kick the bucket:

  • Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward. For example, if the agent's goal is to put out a fire on the other end of the pool, you would set a low weight on the impact penalty, which enables the agent to take irreversible actions in order to achieve the goal. This is why impact measures use a reward-penalty tradeoff rather than a constraint on irreversible actions.
  • Absolute value of the bucket contents can be represented by adding weights on the reachable states or attainable utility functions. This doesn't necessarily require defining human preferences or providing human input, since human preferences can be inferred from the starting state. I generally think that impact measures don't have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.

Proposal: in the same way we might try to infer human values from the state of the world, might we be able to infer a high-level set of features such that existing agents like us seem to optimize simple functions of these features? Then we would penalize actions that cause irreversible changes with respect to these high-level features.

This might be entirely within the framework of similarity-based reachability. This might also be exactly what you were just suggesting.

Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward.

Yep, I agree :-)

I generally think that impact measures don't have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.

Then we are in full agreement :-) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. "some" because of arguments like this one; "not all" because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.

In the case of the industrial process, you could consider the action less reversible because while the difference in the water is small, the difference in what happens after that is larger. (Ie industrial part working or failing.). This means that at some point within the knock on effects of tipping over the carefully salt balanced bucket, there needs to be an effect that counts as "significant". However, there must not be an effect that counts as significant in the case where it's a normal swimming pool, and someone will throw the bucket into the pool soon anyway. Lets suppose water with a slightly different salt content will make a nuclear reactor blow up. (And no humans will notice the robot tipping out and refilling the bucket, so the counterfactual on the robots behavior actually contains an explosion.)

Suppose you shake a box of sand. With almost no information, the best you can do to describe the situation is state the mass of sand, shaking speed and a few other average quantities. With a moderate amount of info, you can track the position and speed of each sand grain, with lots of info, you can track each atom.

There is a sense in which average mass and velocity of the sand, or even position of every grain, is a better measure than md5 hash of position of atom 12345. It confines the probability distribution for the near future to a small, non convoluted section of nearby configuration space.

Suppose we have a perfectly Newtonian solar system, containing a few massive objects and many small ones.

We start our description at time 0. If we say how much energy is in the system, then this defines a non convoluted subset of configuration space. The subset stays just as non convoluted under time evolution. Thus total energy is a perfect descriptive measure. If we state the position and velocity of the few massive objects, and the averages for any large dust clouds, then we can approximately track our info forward in time, for a while. Liouvilles theorem says that configuration space volume is conserved, and ours is. However, our configuration space volume is slowly growing more messy and convoluted. This makes the descriptive measure good but not perfect. If we have several almost disconnected systems, the energy in each one would also be a good descriptive measure. If we store the velocity of a bunch of random dust specks, we have much less predictive capability. The subset of configuration space soon becomes warped and twisted until its convex hull, or epsilon ball expansion cover most of the space. This makes velocities of random dust specs a worse descriptive measure. Suppose we take the md5 hash of every objects position, rounded to the nearest nanometer, in some coordinate system and concatenated together. this forms a very bad descriptive measure. After only a second of time evolution, this becomes a subset of configuration space that can only be defined in terms of what it was a second ago, or by a giant arbitrary list.

Suppose the robot has one hour to do its action, and then an hour to reverse it. We measure how well the robot reversed its original action by looking at all good descriptive measures, and seeing the difference in these descriptive measures from what they would have been had the robot done nothing.

We can then call an action reversible if there would exist an action that would reverse it.

Note that the crystal structure of a rock now tells us its crystal structure next year, so can be part of a quite good measure. However the phase of the moon will tell you everything from the energy production of tidal power stations to the migration pattern of moths. If you want to make good (non convoluted) predictions about these things, you can't miss it out. Thus almost all good descriptive measures will contain this important fact.

A reversible action is any action taken in the first hour such that there exists an action that approximately reverses it that the robot could take in the second hour. (The robot need not actually take the reverse action, maybe a human could press a reverse button.)

Functionality of nuclear power stations, and level of radiation in atmosphere are also contained in many good descriptive measures. Hence the robot should tip the bucket if it won't blow up a reactor.

(This is a rough sketch of the algorithm with missing details, but it does seem to have broken the problem down into non value laden parts. I would be unsurprised to find out that there is something in the space of techniques pointed to that works, but also unsurprised to find that none do.)

In the later part of the post, it seems you're basically talking about entropy and similar concepts? And I agree that "reversible" is kinda like entropy, in that we want to be able to return to a "macrostate" that is considered indistinguishable from the starting macrostate (even if the details are different).

However, as in the the bucket example above, the problem is that, for humans, what "counts" as the same macrostate can vary a lot. If we need a liquid, any liquid, then replacing the bucket's contents with purple-tinted alchool is fine; if we're thinking of the bath water of the dear departed husband, then any change to the contents is irreversible. Human concepts of "acceptably similar" don't match up with entropic ones.

there needs to be an effect that counts as "significant".

Are you deferring this to human judgement of significant? If so, we agree - human judgement needs to be included in some way in the definition.

No, what I am saying is that humans judge things to be more different when the difference will have important real world consequences in the future. Consider two cases, one where the water will be tipped into the pool later, and the other where the water will be tipped into a nuclear reactor, which will explode if the salt isn't quite right.

There need not be any difference in the bucket or water whatsoever. While the current bucket states look the same, there is a noticeable macro-state difference between nuclear reactor exploding and not exploding, in a way that there isn't a macrostate difference between marginally different eddy currents in the pool. I was specifying a weird info theoretic definition of significance that made this work, but just saying that the more energy is involved, the more significant works too. Nowhere are we referring to human judgement, we are referring to hypothetical future consequences.

Actually the rule, your action and its reversal should not make a difference worth tracking in its world model, would work ok here. (Assuming sensible Value of info). The rule that it shouldn't knowably affect large amounts of energy is good too. So for example it can shuffle an already well shuffled pack of cards, even if the order of those cards will have some huge effect. It can act freely without worrying about weather chaos effects, the chance of it causing a hurricane counterbalanced by the chance of stopping one. But if it figures out how to twitch its elbow in just the right way to cause a hurricane, it can't do that. This robot won't tip the nuclear bucket, for much the same reason. It also can't make a nanobot that would grey goo earth, or hack into nukes to explode them. All these actions effect a large amount of energy in a predictable direction.