Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at turneale[at]oregonstate[dot]edu.

Distinguishing claims about training vs deployment

But the perturbation of "change the environment, and then see what the new optimal policy is" is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent's inputs, or its state, and seeing whether it still behaved instrumentally.

Ah. To clarify, I was referring to holding an environment fixed, and then considering whether, at a given state, an action has a high probability of being optimal across reward functions. I think it makes to call those actions 'robustly instrumental.'

Distinguishing claims about training vs deployment

The first ambiguity I dislike here is that you could either be describing the

emergenceof instrumentality as robust, or thetraitof instrumentalityas robust. It seems like you're trying to do the former, but because "robust" modifies "instrumentality", the latter is a more natural interpretation.

One possibility is that we have to individuate these "instrumental convergence"-adjacent theses using different terminology. I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just *are*.

However, it doesn't make sense to say the same for conjectures about how training such-and-such a system tends to induce property Y, for the reasons you mention. In particular, if property Y is *not *about goal-directed behavior, then it no longer makes sense to talk about 'instrumentality' from the system's perspective. e.g. I'm not sure it makes sense to say 'edge detectors are robustly instrumental for this network structure on this dataset after X epochs'.

(These are early thoughts; I wanted to get them out, and may revise them later or add another comment)

EDIT: In the context of MDPs, however, I prefer to talk in terms of (formal) POWER and of optimality probability, instead of in terms of robust instrumentality. I find 'robust instrumentality' to be better as an informal handle, but its formal operationalization seems better for precise thinking.

Formal Solution to the Inner Alignment Problem

civilization that realizes it's in a civilization

"in a *simulation*", no?

Deducing Impact

The spoiler seems to be empty?

TurnTrout's shortform feed

In an interesting parallel to John Wentworth's *Fixing the Good Regulator Theorem*, I have an MDP result that says:

Suppose we're playing a game where I give you a reward function and you give me its optimal value function in the MDP. If you let me do this for reward functions (one for each state in the environment), and you're able to provide the optimal value function for each, then you know enough to reconstruct the entire environment (up to isomorphism).

Roughly: being able to complete linearly many tasks in the state space means you have enough information to model the entire environment.

Fixing The Good Regulator Theorem

Later information can “choose many different games” - specifically, whenever the posterior distribution of system-state given two possible values is different, there must be

at least onevalue under which optimal play differs for the two values.

Given your four conditions, I wonder if there's a result like "optimally power-seeking agents (minimizing information costs) must model the world." That is, I think power is about being able to achieve a wide range of different goals (to win at 'many different games' the environment could ask of you), and so if you want to be able to sufficiently accurately estimate the expected power provided by a course of action, you have to know how well you can win at all these different games.

Fixing The Good Regulator Theorem

Okay, I agree that if you remove their determinism & full observability assumption (as you did in the post), it seems like your construction should work.

I still think that the original paper seems awful (because it's their responsibility to justify choices like this in order to explain how their result captures the intuitive meaning of a 'good regulator').

Fixing The Good Regulator Theorem### Motte and bailey

### The original theorem seems even dumber than John points out

Status: strong opinions, weakly held. not a control theorist; not only *ready* to eat my words, but I've already set the table.

As I understand it, the original good regulator theorem seems even dumber than you point out.

First, the original paper doesn't make sense to me. Not surprising, old papers are often like that, and I don't know any control theory... but here's John Baez **also **getting stuck, giving up, and deriving his own version of what he imagines the theorem should say:

when I tried to read Conant and Ashby’s paper, I got stuck. They use some very basic mathematical notation in nonstandard ways, and they don’t clearly state the hypotheses and conclusion of their theorem...

However, I have a guess about the essential core of Conant and Ashby’s theorem. So, I’ll state that, and then say more about their setup.

Needless to say, I looked around to see if someone else had already done the work of figuring out what Conant and Ashby were saying...

An *unanswered *StackExchange question* *asks whether anyone has a rigorous proof:

As pointed out by the authors of [3], the importance and generality of this theorem in control theory makes it comparable in importance to Einstein's for physics. However, as John C. Baez carefully argues in a blog post titled The Internal Model Principle it's not clear that Conant and Ashby's paper demonstrates what it sets out to prove. I'd like to add that many other researchers, besides myself, share John C. Baez' perspective.

Hello?? Isn't this one of the fundamental results of control theory? Where's a good proof of it? It's been cited 1,317 times and confidently brandished to make sweeping claims about how to e.g. organize society or make an ethical superintelligence.

It seems plausible that people just read the confident title (*Every good regulator of a system must be a model of that system *- of course the paper proves the claim in its title...), saw the citation count / assumed other people had checked it out (yay information cascades!), and figured it must be correct...

The paper starts off by introducing the components of the regulatory system:

OK, so we're going to be talking about how regulators which ensure *good *outcomes* *also model their environments, right? Sounds good.

Wait...

Later...

We're talking about the *entire outcome space * again. In the introduction we focused on regulators ensuring 'good' states, but we immediately gave that up to talk about entropy .

Why does this matter? Well...

John writes:

Also, though I don’t consider it a “problem” so much as a choice which I think most people here will find more familiar:

- The theorem uses entropy-minimization as its notion of optimality, rather than expected-utility-maximization

I suppose my intuition is that this is actually a significant problem.

At first glance, Good Regulator seems to basically prove something like 'there's a deterministic optimal policy wrt the observations', but even *that's *too generous - it proves that there's a deterministic way to minimize outcome entropy. But what does that guarantee us - how do we know that's a 'good regulator'? Like, imagine an environment with a strong "attractor" outcome, like the streets being full of garbage. The regulator can try to push against this, but they can't always succeed due to the influence of random latent variables (this cuts against the determinism assumption, but you note that this can be rectified by reincluding ). However, by sitting back, they *can *ensure that the streets are full of garbage.

The regulator does so, minimizes the entropy over the unconditional outcome distribution , and is christened a 'good regulator' which has built a 'model' of the environment. In reality, we have a deterministic regulator which does nothing, and our streets are full of trash.

Now, I think it's possible I've misunderstood, so I'd appreciate correction if I have. But *if *I haven't, and *if *no control theorists have in fact repaired and expanded this theorem before John's post...

If that's true, what the heck happened? Control theorists just left a $100 bill on the ground for decades? A quick !scholar search doesn't reveal any good critiques.

Fixing The Good Regulator Theorem

the regulator could just be the identity function: it takes in and returns . This does not sound like a “model”.

What is the type signature of the regulator? It's a policy on state space , and it returns states as well? Are those its "actions"? (From the point of view of the causal graph, I suppose just depends on whatever the regulator outputs, and the true state , so maybe it's not important *what *the regulator outputs. Just that by the original account, any deterministic regulator could be "optimal", even if it doesn't do meaningful computation.)

I'd considered 'attractive instrumentality' a few days ago, to convey the idea that certain kinds of subgoals are attractor points during plan formulation, but the usual reading of 'attractive' isn't 'having attractor-like properties.'