A central concept I got out of Reframing impact is that instrumental convergence can be useful for shaping the motivations of superintelligent agents. i.e. there are two frames for thinking about instrumental convergence, one negative and one positive.

  • Instrumental convergence means most agents will want to take power, this is a safety problem.
  • Instrumental convergence means we can often predict the motivations of agents with arbitrary utility functions.

The discussion seems to center around the former negative frame, but the positive frame is useful too! Ideally, it would be instrumentally convergent (in some sense) for the AI to do the thing we want, then we'd have a nice basin of safety.

Toy example of this framing generating interesting ideas: The following exercise

Using a hypercomputer, create an AGI which takes in some data, builds a world model, and then can communicate with a copy of itself (trained on potentially different data from the same environment) to coordinate on a choice of one object in the environment which is not directly visible to both of their sensors.

Can be totally solved now (with runnable code; I claim) by

Creating a videogame where the two copies of the AGI (e.g. AIXI) communicate, then individually pick an object (via an object specification language). If they pick the same object they "win" and are released into separate simulated universes to do whatever, if they pick different objects they "die" (i.e. lose power, and can't take further actions).

Even though we can't solve inner alignment on the agents, they'll still want to coordinate to preserve optionality/seek power. As long as we don't look into the simulation (the agents will hack you to escape in order to gain more computational resources from base reality) and you prove code correctness. Hardware exploits can still screw you but ignoring them this works.

(If you don't agree this can be turned into code ask me specifically about parts. I have ugly pseudocode equivalent to the description above, which I'm confident can be turned into runnable code (on a hypercomputer) by adding AIXI, seeds (like the game-of-life) for the universes to release the AI into, etc.)

New to LessWrong?

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 8:06 AM

A third point: Instrumental convergence is precisely the thing that makes general intelligence possible.

That is, if there were no sets of behaviors or cognitions that were broadly useful for achieving goals, then any intelligence would have to be entirely specialized to a single goal. It is precisely instrumental convergence that allows broader intelligence.

Corollary: The way capabilities research progresses is through coming up with implementations of instrumentally convergent cognition.

I don't call this instrumental convergence (of goals), more like the Bayesian bowl all intelligent agents fall towards. I also think the instrumental convergence of goals is stronger/more certain than the convergence to approximately Bayesian reasoners.

I don't call this instrumental convergence (of goals), more like the Bayesian bowl all intelligent agents fall towards.

I suppose it's true that there is usually made a distinction between goals vs models, and instrumental convergence is usually phrased as a point about goals rather than models.

I also think the instrumental convergence of goals is stronger/more certain than the convergence to approximately Bayesian reasoners.

I get the impression that you picture something more narrow with my comment than I intended? I don't think my comment is limited to Bayesian rationality; we could also consider non-Bayesian reasoning approaches like logic or frequentism or similar. Even CNNs or transformers would go under what I was talking about. Or Aumann's agreement theorem, or lots of other things.

I get the impression that you picture something more narrow with my comment than I intended? I don't think my comment is limited to Bayesian rationality; we could also consider non-Bayesian reasoning approaches like logic or frequentism or similar. Even CNNs or transformers would go under what I was talking about. Or Aumann's agreement theorem, or lots of other things.

I agree those all count, but those all (mostly) have Bayesian interpretations which is what I was referring to.

 What is this "object specification language"? And how will the "object specification language" robustly handle new, out-of-distribution environments?

I unrealistically assumed that I got to pick the environment, i.e. it was "solve this problem for some environment" whereas in reality it's "solve this problem for every environment in some class of natural environments" or something. This is a good part of how I'm assuming my way out of reality :)

Instrumental convergence means we can often predict the motivations of agents with arbitrary utility functions.

Yes and this could more directly lead to alignment, because instrumental convergence to empowerment implies we can use human empowerment as the objective (or most of it).

Running the superintelligent AI on an arbitrarily large amount of compute in this way seems very dangerous, and runs a high risk of it breaking out, only now with access to a hypercomputer (although I admit this was my first thought too, and I think there are ways around this).

More saliently though, whatever mechanism you implement to potentially "release" the AGI into simulated universes could be gamed or hacked by the AGI itself.  Heck, this might not even be necessary - if all they're getting are simulated universes, then they could probably create those themselves since they're running on arbitrarily large compute anyway.

You're also making the assumption that these AIs would care about what happens inside a simulation created in the future, as something to guide their current actions.  This may be true of some AI systems, but feels like a pretty strong one to hold universally.

(I think this is a pretty cool post, by the way, and appreciate more ASoT content).

More saliently though, whatever mechanism you implement to potentially "release" the AGI into simulated universes could be gamed or hacked by the AGI itself.

I think this is fixable, game of life isn't that complicated, you could prove correctness somehow.

Heck, this might not even be necessary - if all they're getting are simulated universes, then they could probably create those themselves since they're running on arbitrarily large compute anyway.

This is a great point, I forgot AIXI also had unbounded compute, why would it want to escape and get more!

I don't think AIXI can "care" about universes it simulates itself, probably because of the cartesian boundary (non-embeddedness) meaning the utility function is defined on inputs (which AIXI doesn't control). but I'm not sure. I don't understand AIXI well.

You're also making the assumption that these AIs would care about what happens inside a simulation created in the future, as something to guide their current actions. This may be true of some AI systems, but feels like a pretty strong one to hold universally.

The simulation being "created in the future" doesn't seem to matter to me. You could also already be simulating the two universes and the game decides if the AIs gain access to them.

(I think this is a pretty cool post, by the way, and appreciate more ASoT content).

Thanks! Will do