Sequences

Experiments in instrumental convergence

Wiki Contributions

Comments

Looks awesome! Minor correction on the cost of the GPT-4 training run: the website says $40 million, but sama confirmed publicly that it was over $100M (and several news outlets have reported the latter number as well).

Done, a few days ago. Sorry thought I'd responded to this comment.

Excellent context here, thank you. I hadn't been aware of this caveat.

Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.

It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.

This setup has the advantage that it's more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.

Agreed. We think our human-AI setting is a useful model of alignment in the limit case, but not really so in the transient case. (For the reason you point out.)

I think you might have reversed the definitions of  and  in your comment,[1] but otherwise I think you're exactly right.

To compute  (the correlation coefficient between terminal values), naively you'd have reward functions  and , that respectively assign human and AI rewards over every possible arrangement of matter . Then you'd look at every such reward function pair over your joint distribution , and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.

And to compute  (the correlation coefficient between instrumental values), you're correct that some of the arrangements of matter  will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each , and you can read the correlation right off the alignment plot!

  1. ^

    Looking again at the write-up, it would have made more sense for us to define  as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn't occur to us. Sorry for the confusion.

Good question. Unfortunately, one weakness of our definition of multi-agent POWER is that it doesn't have much useful to say in a case like this one.

We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.

On the other hand, from other results I've seen anecdotally, I suspect that if you gave one of the agents a purely random policy (i.e., take a random legal action at each state) and assigned the other agent some reasonable reward function distribution over material, you'd stand a decent chance of correctly identifying high-POWER states with high-mobility board positions.

You might also be interested in this comment by David Xu, where he discusses mobility as a measure of instrumental value in chess-playing.

Thanks for you comment. These are great questions. I'll do the best I can to answer here, feel free to ask follow-ups:

  1. On pre-committing as a negotiating tactic: If I've understood correctly, this is a special case of the class of strategies where you sacrifice some of your own options (bad) to constrain those of your opponent (good). And your question is something like: which of these effects is strongest, or do they cancel each other out?

    It won't surprise you that I think the answer is highly context-dependent, and that I'm not sure which way it would actually shake out in your example with Fred and Bob and the $5000. But interestingly, we did in fact discover an instance of this class of "sacrificial" strategies in our experiments!

    You can check out the example in Part 3 if you're interested. But briefly, what happens is that when the agents get far-sighted enough, one of them realizes that there is instrumental value in having the option to bottle up the other agent in a dead-end corridor (i.e., constraining that other agent's options). But it can only actually do this by positioning itself at the mouth of the corridor (i.e., sacrificing its own options). Here is a full-size image of both agents' POWERs in this situation. You can see from the diagram that Agent A prefers to preserve its own options over constraining Agent H's options in this case. But crucially, Agent A values the option of being able to constrain Agent H's options.

    In the language of your negotiating example, there is instrumental value in preserving one's option to pre-commit. But whether actually pre-committing is instrumentally valuable or not depends on the context.
     
  2. On babies being more powerful than adults: Yes, I think your reasoning is right. And it would be relatively easy to do this experiment! All you'd need would be to define a "death" state, and set your transition dynamics so that the agent gets sent to the "death" state after N turns and can never escape from it afterwards. I think this would be a very interesting experiment to run, in fact.
     
  3. On paperclip maximizers: This is a very deep and interesting question. One way to think about this schematically might be: a superintelligent paperclip maximizer will go through a Phase One, in which it accumulates its POWER; and then a Phase Two in which it spends the POWER it's accumulated. During the accumulation phase, the system might drive towards a state where (without loss of generality) the Planet Earth is converted into a big pile of computronium. This computronium-Earth state is high-POWER, because it's a common "way station" state for paperclip maximizers, thumbtack maximizers, safety pin maximizers, No. 2 pencil maximizers, and so on. (Indeed, this is what high POWER means.)

    Once the system has the POWER it needs to reach its final objective, it will begin to spend that POWER in ways that maximize its objective. This is the point at which the paperclip, thumbtack, safety pin, and No. 2 pencil maximizers start to diverge from one another. They will each push the universe towards sharply different terminal states, and the more progress each maximizer makes towards its particular terminal state, the fewer remaining options it leaves for itself if its goal were to suddenly change. Like a male praying mantis, a maximizer ultimately sacrifices its whole existence for the pursuit of its terminal goal. In other words: zero POWER should be the end state of a pure X-maximizer![1]

    My story here is hypothetical, but this is absolutely an experiment on can do (at small scale, naturally). The way to do it would be to run several rollouts of an agent, and plot the POWER of the agent at each state it visits during the rollout. Then we can see whether most agent trajectories have the property where their POWER first goes up (as they, e.g., move to topological junction points) and then goes down (as they move from the junction points to their actual objectives).
     

Thanks again for your great questions. Incidentally, a big reason we're open-sourcing our research codebase is to radically lower the cost of converting thought experiments like the above into real experiments with concrete outcomes that can support or falsify our intuitions. The ideas you've suggested are not only interesting and creative, they're also cheaply testable on our existing infrastructure. That's one reason we're excited to release it!

  1. ^

    Note that this assumes the maximizer is inner aligned to pursue its terminal goal, the terminal goal is stable on reflection, and all the usual similar incantations.

Yes, I think this is right. It's been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a "correct" model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.

We've focused on simple gridworlds here, partly because they're visual, but also because they're tractable. But I suspect there's a mapping between POWER (in the RL context) and generalizability of features in NNs (in the context of something like the circuits work linked above). This would be really interesting to investigate.

Load More