# Alex Turner

Oregon State University PhD student working on AI alignment.

# Sequences

How important are MDPs for AGI (Safety)?

The point of this point is mostly to claim that it's not a hugely useful framework for thinking about RL.

Even though I agree it's unrealistic, MDPs are still easier to prove things in and I still think that they can give us important insights. for example, if I had started with more complex environments when I was investigating instrumental convergence, I would've spent a ton of extra time grappling with the theorems for little perceived benefit. that is, the MDP framework let me more easily cut to the core insights. sometimes it's worth thinking about more general computable environments, but probably not always.

The human side of interaction

why do we even believe that human values are good?

Because they constitute, by definition, our goodness criterion? It's not like we have two separate modules - one for "human values", and one for "is this good?". (ETA or are you pointing out how our values might shift over time as we reflect on our meta-ethics?)

Perhaps the typical human behaviour amplified by possibilities of a super-intelligence would actually destroy the universe.

If I understand correctly, this is "are human behaviors catastrophic?" - not "are human values catastrophic?".

TurnTrout's shortform feed

Very rough idea

In 2018, I started thinking about corrigibility as "being the kind of agent lots of agents would be happy to have activated". This seems really close to a more ambitious version of what AUP tries to do (not be catastrophic for most agents).

I wonder if you could build an agent that rewrites itself / makes an agent which would tailor the AU landscape towards its creators' interests, under a wide distribution of creator agent goals/rationalities/capabilities. And maybe you then get a kind of generalization, where most simple algorithms which solve this solve ambitious AI alignment in full generality.

[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement

I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs change, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world).

Although the point is more easily made in the deterministic environments, impact doesn't happen in expectation for optimal agents in stochastic environments, either. This is by conservation of expected AU (this is the point I was making in The Gears of Impact).

Similar things can be said about power gain – when we think an agent is gaining power... gaining power compared to what? The agent "always had" that power, in a sense – the only thing that happens is that we realize it.

This line of argument makes me more pessimistic about there being a clean formalization of "don't gain power". I do think that the formalization of power is correct, but I suspect people are doing something heuristic and possibly kludgy when we think about someone else gaining power.

Attainable Utility Preservation: Scaling to Superhuman

I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs change, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world).

Coherence arguments do not imply goal-directed behavior

because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence;

That's right; that would prove too much.

namely X = "the reward function is typical". Does that sound right?

Yeah, although note that I proved asymptotic instrumental convergence for typical functions under iid reward sampling assumptions at each state, so I think there's wiggle room to say "but the reward functions we provide aren't drawn from this distribution!". I personally think this doesn't matter much, because the work still tells us a lot about the underlying optimization pressures.

The result is also true in the general case of an arbitrary reward function distribution, you just don't know in advance which terminal states the distribution prefers.

Coherence arguments do not imply goal-directed behavior

Sure, I can say more about Alex Turner's formalism! The theorems show that, with respect to some distribution of reward functions and in the limit of farsightedness (as the discount rate goes to 1), the optimal policies under this distribution tend to steer towards parts of the future which give the agent access to more terminal states.

Of course, there exist reward functions for which twitching or doing nothing is optimal. The theorems say that most reward functions aren't like this.

I encourage you to read the post and/or paper; it's quite different from the one you cited in that it shows how instrumental convergence and power-seeking arise from first principles. Rather than assuming "resources" exist, whatever that means, resource acquisition is explained as a special case of power-seeking.

ETA: Also, my recently completed sequence focuses on formally explaining and deeply understanding why catastrophic behavior seems to be incentivized. In particular, see The Catastrophic Convergence Conjecture.

Towards a mechanistic understanding of corrigibility

The post answers to what extent safely tuning that trade-off is feasible, and the surrounding sequence motivates that penalization scheme in greater generality. From Conclusion to 'Reframing Impact':