Implied "utilities" of simulators are broad, dense, and shallow

I really like how you've laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we'd live for awhile with input-imitator AIs that get more and more capable but still stay docile.

But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conversing at 1000x speed. For extra fun, allow the imitated committee to send network packets and read responses; for extra extra fun, give them access to a workbench improving their own AI. I'd say this gets awfully close to a world-optimizer that could plausibly defeat the rest of humanity, if the imitator it's running on is good enough (GPT-6 or something). And there's of course no law saying it'll be friendly: you could prompt the inner humans with "you want to destroy real humanity" and watch the fireworks.

^{^}

Using the term as in this post.

^{^}

It assumes a mesaoptimizer was found and that it outperforms other implementations, including other mesaoptimizers, that would have more closely matched the dense output constraints.

It assumes enough space for the mesaoptimizer to learn the relevant kind of instrumentality that would operate beyond a single invocation.

It assumes the mesaoptimizer is able to learn and hold onto misaligned goals early while simultaneously having sufficient capability to realize it needs to hide that misalignment on relevant predictions.

It assumes the extra complexity implied by the capable misalignment is small enough that SGD can accidentally hop into its basin.

And so on.

^{^}

I also note the conspicuous coincidence that some of the most capable architectures yet devised are low instrumentality or rely on world models which are. It would not surprise me if low instrumentality, and the training constraints/hints that it typically corresponds to, often implies relative ease of training at a particular level of capability which seems like a pretty happy accident.

^{^}

I'm working on some experiments, but if you have ideas, please do those experiments too. This feels like a weird little niche that has very little empirical data available.

^{^}

While you could construct such a function for an agent that ends up matching the behavior of a more restricted utility function and its accompanying intermediate behaviors, the usefulness-achieving assumption here is that the dense utility function was chosen to be not-that.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

15

Implied "utilities" of simulators are broad, dense, and shallow

15

Extremely broad, dense reward functions constrain training-compatible goal sets

Impact on internal mesaoptimizers

No greater coherence