[ Question ]

Are limited-horizon agents a good heuristic for the off-switch problem?

by Yonadav Shavit2 min read5th Dec 20217 comments



(This is my first post, sorry if this is covered elsewhere.)

Implicit in the problem of a superhuman AI's reward being misspecified and turning everything into paperclips is the fact that the agent is optimizing over e.g. "number of paperclips" without any particular time-bound, area-of-effect bound, or probability confidence bound. For example, imagine if a MuZero+++++ agent were given the reward function "maximize the expected amount of money in this bank account until 1 day from now, then maximize the probability of shutting yourself off", where e.g. "1 day from now" was set by a consensus of satellites and/or deep-space probes. The agent could do a lot of bad things via its misspecified reward, but at least the blast-radius seems like it'd probably be limited. This agent might cause a lot of damage, but it probably wouldn't end the world in pursuit of these objectives.

This is not a strategy for executing a Yudkowskian "pivotal act", which by necessity would need to have longer-lasting effects. So it wouldn't stop some other group from creating unfriendly AGI. But as a general AI safety heuristic to try and spread across AGI practitioners, "do this as much as possible for a fixed time and then shutoff" seems like a promising social intervention. We also know how to implement it today. 

As far as specific risks from the pursuit of convergent instrumental goals: "capability gain" and "make sure you don't get shutoff" seem much less likely to be convergent instrumental goals when you are planning over a short fixed horizon.  "Create hidden copies of yourself" is still possibly a deployable strategy, but there's no reason for those copies to pursue a reward beyond the time-bound described, so I'd hold out hope for us to find a patch. "Deception" is again possible in the short term, but given this reward function there's no clear reason to deceive beyond a fixed horizon.

More broadly, this is a result of my thinking about AI safety social heuristics/memes that could be spreadable/enforceable by centralized power structures (e.g. governments, companies, militaries). If others have thoughts about similar heuristics, I'd be very interested to hear them.

I'm assuming I'm not the first person to bring this up, so I'm wondering whether someone can point me to existing discussion on this sort of fixed-window reward. If it is novel in any sense, feedback extremely welcome. This is my first contribution to this community, so please be gentle but also direct.



New Answer
Ask Related Question
New Comment

1 Answers

Imagine a spectrum of time horizons (and/or discounting rates), from very long to very short.

Now, if the agent is aligned, things are best with an infinite time horizon (or, really, the convergently-endorsed human discounting function; or if that's not a well-defined thing, whatever theoretical object replaces it in a better alignment theory). As you reduce the time horizon, things get worse and worse: the AGI willingly destroys lots of resources for short-term prosperity.

At some point, this trend starts to turn itself around: the AGI becomes so shortsighted that it can't be too destructive, and becomes relatively easy to control.

But where is the turnaround point? It depends hugely on the AGI's capabilities. An uber-capable AI might be capable of doing a lot of damage within hours. Even setting the time horizon to seconds seems basically risky; do you want to bet everything on the assumption that such a shortsighted AI will do minimal damage and be easy to control?

This is why some people, such as Evan H, have been thinking about extreme forms of myopia, where the system is supposed to think only of doing the specific thing it was asked to do, with no thoughts of future consequences at all.

Now, there are (as I see it) two basic questions about this.

  1. How do we make sure that the system is actually as limited as we think it is?
  2. How do we use such a limited system to do anything useful?

Question #1 is incredibly difficult and I won't try to address it here.

Question #2 is also challenging, but I'll say some words.

Getting useful work out of extremely myopic systems.

As you scale down the time horizon (or scale up the temporal discounting, or do other similar things), you can also change the reward function. (Or utility function, or other equivalent thing is in whatever formalism.) We don't want something that spasmodically tries to maximize the human fulfillment experienced in the next three seconds. We actually want something that approximates the behavior of a fully-aligned long-horizon AGI. We just want to decrease the time horizon to make it easier to trust, easier to control, etc.

The strawman version of this is: choose the reward function for the totally myopic system to approximate the value function which the long-time-horizon aligned AGI would have.

If you do this perfectly right, you get 100% outer-aligned AI. But that's only because you get a system that's 100% equivalent to the not-at-all-myopic aligned AI system we started with. This certainly doesn't help us build safe systems; it's only aligned by hypothesis.

Where things get interesting is if we approximate that value function in a way we trust. An AGI RL system with supposedly aligned reward function calculates its value function by looking far into the future and coming up with plans to maximize reward. But, we might not trust all the steps in this process enough to trust the result. For example, we think small mistakes in the reward function tend to be amplified to large errors in the value function.

In contrast, we might approximate the value function by having humans look at possible actions and assign values to them. You can think of this as deontological: kicking puppies looks bad, curing cancer looks good. You can try to use machine learning to fit these human judgement patterns. This is the basic idea of approval-directed agents. Hopefully, this creates a myopic system which is incapable of treacherous turns, because it just tries to do what is "good" in the moment rather than doing any planning ahead. (One complication with this is inner alignment problems. It's very plausible that to imitate human judgements, a system has to learn to plan ahead internally. But then you're back to trying to outsmart a system that can possibly plan ahead of you; IE, you've lost the myopia.)

There may also be many other ways to try to approximate the value function in more trustable ways.

6 comments, sorted by Highlighting new comments since Today at 4:43 AM

Just a few links to complement Abram's answer:

On how seemingly myopic training schemes can nonetheless produce non-myopic behaviour:

On approval-directed agents:

We also know how to implement it today. 

I would argue that inner alignment problems mean we do not know how to do this today. We know how to limit the planning horizon for parts of a system which are doing explicit planning, but this doesn't bar other parts of the system from doing planning. For example, GPT-3 has a time horizon of effectively one token (it is only trying to predict one token at a time). However, it probably learns to internally plan ahead anyway, just because thinking about the rest of the current sentence (at least) is useful for thinking about the next token.

So, a big part of the challenge of creating myopic systems is making darn sure they're as myopic as you think they are.

I’m curious to dig into your example.

  • Here’s an experiment that I could imagine uncovering such internal planning:
    • make sure the corpus has no instances of a token “jrzxd”, then
    • insert long sequences of “jrzxd jrzxd jrzxd … jrzxd” at random locations in the middle of sentences (sort of like introns),
    • then observe whether the trained model predicts “jrzxd” with greater likelihood than its base rate (which we’d presume is because it’s planning to take some loss now in exchange for confidently predicting more “jrzxd”s to follow).
  • I think this sort of behavior could be coaxed out of an actor-critic model (with hyperparameter tuning, etc.), but not GPT-3. GPT-3 doesn’t have any pressure towards a Bellman-equation-satisfying model, where future reward influences current output probabilities.
  • I’m curious if you agree or disagree and what you think I’m missing.

I think we could get a GPT-like model to do this if we inserted other random sequences, in the same way, in the training data; it should learn a pattern like "non-word-like sequences that repeat at least twice tend to repeat a few more times" or something like that.

GPT-3 itself may or may not get the idea, since it does have some significant breadth of getting-the-idea-of-local-patterns-its-never-seen-before.

So I don't currently see what your experiment has to do with the planning-ahead question.

I would say that the GPT training process has no "inherent" pressure toward Bellman-like behavior, but the data provides such pressure, because humans are doing something more Bellman-like when producing strings. A more obvious example would be if you trained a GPT-like system to predict the chess moves of a tree-search planning agent.

It seems to me that there are some serious practical problems in trying to train this sort of behaviour. After all, a successful execution shuts the system off and it never updates on the training signal. You could train it for something like "when the date from the clock input exceeds the date on input SDD, output high on output SDN (which in the live system will feed to a shutdown switch)", but that's a distant proxy. It seems unlikely to generalize correctly to what you really want, which is much fuzzier.

For example, what you really want is more along the lines of determining the actual date (by unaltered human standards) and comparing with the actual human-desired shutdown date (without manipulating what the humans want), and actually shut down (by means that don't harm any humans or anything else they value). Except that this messy statement isn't nearly tight enough, and a superintelligent system would eat the world in a billion possible ways even assuming that the training was done in a way that the system actually tried to meet this objective.

How are we going to train a system to generalize to this sort of objective without it already being Friendly AGI?

To clarify, this is intended to be a test-time objective; I'm assuming the system was trained in simulation and/or by observing the environment. In general, this reward wouldn't need to be "trained" – it could just be hardcoded into the system. If you're asking how the system would understand its reward without having experienced it already, I'm assuming that sufficiently-advanced AIs have the ability to "understand" their reward function and optimize on that basis. For example, "create two identical strawberries on the cellular level" can only be plausibly achieved via understanding, rather than encountering the reward often enough in simulation to learn from it, since it'd be so rare even in simulation.

Modern reinforcement learning systems receive a large positive reward (or, more commonly, an end to negative rewards) when ending the episode, and this incentivizes them to end the episode quickly (sometimes suicidally). If you only provide this "shutdown reward", I'd expect to see the same behavior, but only after a certain time period.