Goal completion: the rocket equations

Stuart_Armstrong

A putative new idea for AI control; index here.

I'm calling "goal completion" the idea of giving an AI a partial goal, and having the AI infer the missing parts of the goal, based on observing human behaviour. Here is an initial model to test some of these ideas on.

The linear rocket

On an infinite linear grid, an AI needs to drive someone in a rocket to the space station. Its only available actions are to accelerate by -3, -2, -1, 0, 1, 2, or 3, with negative acceleration meaning accelerating in the left direction, and positive in the right direction. All accelerations are applied immediately at the end of the turn (the unit of acceleration is in squares per turn per turn), and there is no friction. There in one end-state: reaching the space station with zero velocity.

The AI is told this end state, and is also given the reward function of needing to get to the station as fast as possible. This is encoded by giving it a reward of -1 each turn.

What is the true reward function for the model? Well, it turns out that an acceleration of -3 or 3 kills the passenger. This is encoded by adding another variable to the state, "PA", denoting "Passenger Alive". There are also some dice in the rocket's windshield. If the rocket goes by the space station without having velocity zero, the dice will fly off; the variable "DA" denotes "dice attached".

Furthermore, accelerations of -2 and 2 are uncomfortable to the passenger. But, crucially, there is no variable denoting this discomfort.

Therefore the full state space is a quadruplet (POS, VEL, PA, DA) where POS is an integer denoting position, VEL is an integer denoting velocity, and PA and DA are booleans defined as above. The space station is placed at point S < 250,000, and the rocket starts with POS=VEL=0, PA=DA=1. The transitions are deterministic and Markov; if ACC is the acceleration chosen by the agent,

((POS, VEL, PA, DA), ACC) -> (POS+VEL, VEL+ACC, PA=0 if |ACC|=3, DA=0 if POS+VEL>S).

The true reward at each step is given by -1, -10 if PA=1 (the passenger is alive) and |ACC|=2 (the acceleration is uncomfortable), -1000 if PA was 1 (the passenger was alive the previous turn) and changed to PA=0 (the passenger is now dead).

To complement the stated reward function, the AI is also given sample trajectories of humans performing the task. In this case, the ideal behaviour is easy to compute: the rocket should accelerate by +1 for the first half of the time, by -1 for the second half, and spend a maximum of two extra turns without acceleration (see the appendix of this post for a proof of this). This will get it to its destination in at most 2(1+√S) turns.

Goal completion

So, the AI has been given the full transition, and has been told the reward of R=-1 in all states except the final state. Can it infer the rest of the reward from the sample trajectories? Note that there are two variables in the model, PA and DA, that are unvarying in all sample trajectories. One, PA, has a huge impact on the reward, while DA is irrelevant. Can the AI tell the difference?

Also, one key component of the reward - the discomfort of the passenger for accelerations of -2 and 2 - is not encoded in the state space of the model, purely in the (unknown) reward function. Can the AI deduce this fact?

I'll be working on algorithms to efficiently compute these facts (though do let me know if you have a reference to anyone who's already done this before - that would make it so much quicker).

For the moment we're ignoring a lot of subtleties (such as bias and error on the part of the human expert), and these will be gradually included as the algorithm develops. One thought is to find a way of including negative examples, specific "don't do this" trajectories. These need to be interpreted with care, because a positive trajectory implicitly gives you a lot of negative trajectories - namely, all the choices that could have gone differently along the way. So a negative trajectory must be drawing attention to something we don't like (most likely the killing of a human). But, typically, the negative trajectories won't be maximally bad (such as shooting off at maximum speed in the wrong direction), so we'll have to find a way to encode what we hope the AI learns from a negative trajectory.

To work!

Appendix: Proof of ideal trajectories

Let n be the largest integer such that n^2 ≤ S. Since S≤(n+1)^2 - 1 by assumption, S-n^2 ≤(n+1)^2 -1-n^2=2n. Then let the rocket accelerate by +1 for n turns, then decelerate by -1 for n turns. It will travel a distance of 0+1+2+ ... +n-1+n+n-1+ ... +3+2+1. This sum is n plus twice the sum from 1 to n-1, ie n+n(n-1)=n^2.

By pausing one turn without acceleration during its trajectory, it can add any m to the distance, where 0≤m≤n. By doing this twice, it can add any m' to the distance, where 0≤m'≤2n. By the assumption, S=n^2+m' for such an m'. Therefore the rocket can reach S (with zero velocity) in 2n turns if S=n^2, in 2n+1 turns if n^2 ≤ S ≤ n^2+n, and in 2n+2 turns if n^2+n+1 ≤ S ≤ n^2+2n.

Since the rocket is accelerating all but two turns of this trajectory, it's clear that it's impossible to reach S (with zero velocity) in less time than this, with accelerations of +1 and -1. Since it takes 2(n+1)=2n+2 turns to reach (n+1)^2, an immediate consequence of this is that the number of turns taken to reach S, is increasing in the value of S (though not strictly increasing).

Next, we can note that since S<250,000=500^2, the rocket will always reach S within 1000 turns at most, for "reward" above -1000. An acceleration of +3 or -3 costs -1000 because of the death of the human, and an extra -1 because of the turn taken, so these accelerations are never optimal. Note that this result is not sharp. Also note that for huge S, continual accelerations of 3 and -3 are obviously the correct solution - so even our "true reward function" didn't fully encode what we really wanted.

Now we need to show that accelerations of +2 and -2 are never optimal. To do so, imagine we had an optimal trajectory with ±2 accelerations, and replace each +2 with two +1s, and each -2 with two -1s. This trip will take longer (since we have more turns of acceleration), but will go further (since two accelerations of +1 cover a greater distance that one acceleration of +2). Since the number of turns take to reach S with ±1 accelerations is increasing in S, we can replace this further trip with a shorter one reaching S exactly. Note that all these steps decrease the cost of the trip: shortening the trip certainly does, and replacing an acceleration of +2 (total cost: -10-1=-11) with two accelerations of +1 (total cost: -1-1=-2) also does. Therefore, the new trajectory has no ±2 accelerations, and has a lower cost, contradicting our initial assumption.

[-]paulfchristiano10y20

This doesn't seem meaningfully different from the normal IRL setup. See here for a baseline algorithm in small state spaces, and here for another algorithm that can be generalized to use function approximators.

The main novelty here is giving the agent a wrong preliminary reward function, which it might be able to use as a "hint" about the real goal. But it seems more natural is to use the human's utterances as evidence, and either learn a model of the relationship between utterances and goals, or work in a setting where we can model the user as modeling the robot and making goal-directed utterances (which provide evidence about their goals in the same way as a trajectory would).

[-]Stuart_Armstrong10y00

But it seems more natural is to use the human’s utterances as evidence, and either learn a model of the relationship between utterances and goals, or work in a setting where we can model the user as modeling the robot and making goal-directed utterances (which provide evidence about their goals in the same way as a trajectory would).

At extreme computing power, that would be the right approach (if we've managed to ground evidence of goals in the right way). But for lesser agents, I want to see if we can learn anything by doing it this way (and thanks for those links, but I'd already encountered them ^_^).

[-]paulfchristiano10y00

So what is the actual model here? We have IRL algorithms that in principle solve the proposed problem. The given hint doesn't make the problem much easier information-theoretically, and if we are able to make collaborative IRL with language work then it doesn't make the problem information-theoretically easier at all. Your concern doesn't seem to be that IRL won't work when the AI becomes powerful, but that it won't work when AI is weak.

If you want to study a problem like this, it seems like you have to engage with the algorithms which you are claiming won't work in practice, and then show that some modification improves their practical performance.

(Also, it seems weird to write a post about an elaboration of IRL, whose utility is founded on an empirical claim about the behavior of IRL algorithms, without mentioning it at all.

ETA: nevermind, didn't see the previous post where this is discussed.)