Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment.


Reframing Impact


Short summary of mAIry's room

This isn't key for your point, but:

In TD learning, if from some point the model always perfectly predicted the future

If it's a perfect predictor of a deterministic world, sure. But if the world is stochastic, or you can't assume realizability, your network can simultaneously be a global optimum but also have gradient updates. It's just that in expectation, your gradient is zero, but if you update in sufficiently small batches, you might still have non-zero gradients.

Generalizing the Power-Seeking Theorems

Discontinuous with respect to what? The discount rate just is, and there just is an optimal policy set for each reward function at a given discount rate, and so it doesn't make sense to talk about discontinuity without having something to govern what it's discontinuous with respect to. Like, teleportation would be positionally discontinuous with respect to time.

You can talk about other quantities being continuous with respect to change in the discount rate, however, and the paper proves prove the continuity of e.g. POWER and optimality probability with respect to .

Generalizing the Power-Seeking Theorems

What do you mean by "agents have different time horizons"? 

To answer my best guess of what you meant: this post used "most agents do X" as shorthand for "action X is optimal with respect to a large-measure set over reward functions", but the analysis only considers the single-agent MDP setting, and how, for a fixed reward function or reward function distribution, optimal action for an agent tends to vary with the discount rate. There aren't multiple formal agents acting in the same environment. 

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Note 1: This review is also a top-level post.

Note 2: I think that 'robust instrumentality' is a more apt name for 'instrumental convergence.' That said, for backwards compatibility, this comment often uses the latter. 

In the summer of 2019, I was building up a corpus of basic reinforcement learning theory. I wandered through a sun-dappled Berkeley, my head in the clouds, my mind bent on a single ambition: proving the existence of instrumental convergence. 


I needed to find the right definitions first, and I couldn't even imagine what the final theorems would say. The fall crept up on me... and found my work incomplete. 

Let me tell you: if there's ever been a time when I wished I'd been months ahead on my research agenda, it was September 26, 2019: the day when world-famous AI experts debated whether instrumental convergence was a thing, and whether we should worry about it. 

The debate unfolded below the link-preview: an imposing robot staring the reader down, a title containing 'Terminator', a byline dismissive of AI risk:

Scientific American
Don’t Fear the Terminator
"Artificial intelligence never needed to evolve, so it didn’t develop the survival instinct that leads to the impulse to dominate others."

The byline seemingly affirms the consequent: "evolution  survival instinct" does not imply "no evolution  no survival instinct." That said, the article raises at least one good point: we choose the AI's objective, and so why must that objective incentivize power-seeking?

I wanted to reach out, to say, "hey, here's a paper formalizing the question you're all confused by!" But it was too early.

Now, at least, I can say what I wanted to say back then: 

This debate about instrumental convergence is really, really confused. I heavily annotated the play-by-play of the debate in a Google doc, mostly checking local validity of claims. (Most of this review's object-level content is in that document, by the way. Feel free to add comments of your own.)

This debate took place in the pre-theoretic era of instrumental convergence. Over the last year and a half, I've become a lot less confused about instrumental convergence. I think my formalisms provide great abstractions for understanding "instrumental convergence" and "power-seeking." I think that this debate suffers for lack of formal grounding, and I wouldn't dream of introducing someone to these concepts via this debate.

While the debate is clearly historically important, I don't think it belongs in the LessWrong review. I don't think people significantly changed their minds, I don't think that the debate was particularly illuminating, and I don't think it contains the philosophical insight I would expect from a LessWrong review-level essay.

Rob Bensinger's nomination reads:

May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).

I don't think the discussion stands great on its own, but it may be helpful for:

  • people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
  • people new to AI alignment who want to use the views of leaders in the field to help them orient.

I certainly agree with Rob's first bullet point. The debate did show us what certain famous AI researchers thought about instrumental convergence, circa 2019. 

However, I disagree with the second bullet point: reading this debate may disorient a newcomer! While I often found myself agreeing with Russell and Bengio, while LeCun and Zador sometimes made good points, confusion hangs thick in the air: no one realizes that, with respect to a fixed task environment (representing the real world) and their beliefs about what kind of objective function the agent may have, they should be debating the probability that seeking power is optimal (or that power-seeking behavior is learned, depending on your threat model). 

Absent such an understanding, the debate is needlessly ungrounded and informal. Absent such an understanding, we see reasoning like this:

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

I'm glad that this debate happened, but I think it monkeys around too much to be included in the LessWrong 2019 review.

But exactly how complex and fragile?

Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven't yet decided whether I think there are good theorems to be found here, though. 

Can you expand on 

including when the maximization is subject to fairly general constraints... Ideally, we'd find some compact criterion for which perturbations preserve value under which constraints.

This is , right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?

But exactly how complex and fragile?

(I meant to say 'perturbations', not 'permutations')

Not quite. If we frame the question as "which compact ways of generating permutations", then that's implicitly talking about dynamics, since we're asking how the permutations were generated.

Hm, maybe we have two different conceptions. I've been imagining singling out a variable (e.g. the utility function) and perturbing it in different ways, and then filing everything else under the 'dynamics.' 

So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent's utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn't classify as 'perturbations' in the original ontology. 

Point is, these perturbations aren't actually generated within the imagined scenarios, but we generate them outside of the scenarios in order to estimate outcome sensitivity.

Perhaps this isn't clean, and perhaps I should rewrite parts of the review with a clearer decomposition.

The strategy-stealing assumption

Over the last year, I've thought a lot about human/AI power dynamics and influence-seeking behavior. I personally haven't used the strategy-stealing assumption (SSA) in reasoning about alignment, but it seems like a useful concept.

Overall, the post seems good. The analysis is well-reasoned and reasonably well-written, although it's sprinkled with opaque remarks (I marked up a Google doc with more detail). 

If this post is voted in, it might be nice if Paul gave more room to big-picture, broad-strokes "how does SSA tend to fail?" discussion, discussing potential commonalities between specific counterexamples, before enumerating the counterexamples in detail. Right now, "eleven ways the SSA could fail" feels like a grab-bag of considerations.

But exactly how complex and fragile?

Rather than asking "are human values fragile?", we ask "under what distance metric(s) are human values fragile?" - that's the new "API" of the value-fragility question.

In other words: "against which compact ways of generating perturbations is human value fragile?". But don't you still need to consider some dynamics for this question to be well-defined? So it doesn't seem like it captures all of the regularities implied by:

Distance metrics allow us to "factor out" that context-dependence, to wrap it in a clean API.

But I do presently agree that it's a good conceptual handle for exploring robustness against different sets of perturbations.

But exactly how complex and fragile?

(I reviewed this in a top-level post: Review of 'But exactly how complex and fragile?'.)

I've thought about (concepts related to) the fragility of value quite a bit over the last year, and so I returned to Katja Grace's But exactly how complex and fragile? with renewed appreciation (I'd previously commented only a very brief microcosm of this review). I'm glad that Katja wrote this post and I'm glad that everyone commented. I often see private Google docs full of nuanced discussion which will never see the light of day, and that makes me sad, and I'm happy that people discussed this publicly. 

I'll split this review into two parts, since the nominations called for review of both the post and the comments:

I think this post should be reviewed for its excellent comment section at least as much as for the original post, and also think that this post is a pretty central example of the kind of post I would like to see more of.

~ habryka


I think this was a good post. I think Katja shared an interesting perspective with valuable insights and that she was correct in highlighting a confused debate in the community. 

That said, I think the post and the discussion are reasonably confused. The post sparked valuable lower-level discussion of AI risk, but I don't think that the discussion clarified AI risk models in a meaningful way.

The problem is that people are debating "is value fragile?" without realizing that value fragility is a sensitivity measure: given some initial state and some dynamics, how sensitive is the human-desirability of the final outcomes to certain kinds of perturbations of the initial state

Left unremarked by Katja and the commenters, value fragility isn't intrinsically about AI alignment. What matters most is the extent to which the future is controlled by systems whose purposes are sufficiently entangled with human values. This question reaches beyond just AI alignment.

They also seem to be debating an under-specified proposition. Different perturbation sets and different dynamics will exhibit different fragility properties, even though we're measuring with respect to human value in all cases. For example, perturbing the training of an RL agent learning a representation of human value, is different from perturbing the utility function of an expected utility maximizer. 

Setting loose a superintelligent expected utility maximizer is different from setting loose a mild optimizer (e.g. a quantilizer), even if they're both optimizing the same flawed representation of human value; the dynamics differ. As another illustration of how dynamics are important for value fragility, imagine if recommender systems had been deployed within a society which already adequately managed the impact of ML systems on its populace. In that world, we may have ceded less of our agency and attention to social media, and would therefore have firmer control over the future and value would be less fragile with respect to the training process of these recommender systems. 

The Post

But exactly how complex and fragile? and its comments debate whether "value is fragile." I think this is a bad framing because it hides background assumptions about the dynamics of the system being considered. This section motivates a more literal interpretation of the value fragility thesis, demonstrating its coherence and its ability to meaningfully decompose AI alignment disagreements. The next section will use this interpretation to reveal how the comments largely failed to explore key modelling assumptions. This, I claim, helped prevent discussion from addressing the cruxes of disagreements.

The post and discussion both seem to slip past (what I view as) the heart of 'value fragility', and it seems like many people are secretly arguing for and against different propositions. Katja says:

it is hard to write down what kind of future we want, and if we get it even a little bit wrong, most futures that fit our description will be worthless. 

But this leaves hidden a key step:

it is hard to write down the future we want, feed the utility function punchcard into the utility maximizer and then press 'play', and if we get it even a little bit wrong, most futures that fit our description will be worthless.

Here is the original 'value is fragile' claim: 

Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth.

~  Eliezer Yudkowsky, Value is Fragile

Eliezer claims that if the future is not shaped by a goal system, there's not much worth. He does not explicitly claim, in that original essay, that we have to/will probably build an X-maximizer AGI, where X is an extremely good (or perfect) formalization of human values (whatever that would mean!). He does not explicitly claim that we will mold a mind from shape Y and that that probably goes wrong, too. He's talking about goal systems chartering a course through the future, and how sensitive the outcomes are to that process.

Let's ground this out. Imagine you're acting, but you aren't quite sure what is right. For a trivial example, you can eat bananas or apples at any given moment, but you aren't sure which is better. There are a few strategies you could follow: preserve attainable utility for lots of different goals (preserve the fruits as best you can); retain option value where your normative uncertainty lies (don't toss out all the bananas or all of the apples); etc.

But what if you have to commit to an object-level policy now, a way-of-steering-the-future now, without being able to reflect more on your values? What kind of guarantees can you get? 

In Markov decision processes, if you're maximally uncertain, you can't guarantee you won't lose at least half of the value you could have achieved for the unknown true goal (I recently proved this for an upcoming paper). Relatedly, perfectly optimizing an -incorrect reward function only bounds regret to  per time step (see also Goodhart's Curse). The main point is that you can't pursue every goal at once. It doesn't matter whether you use reinforcement learning to train a policy, or whether you act randomly, or whether you ask Mechanical Turk volunteers what you should do in each situation. Whenever your choices mean anything at all, no sequence of actions can optimize all goals at the same time

So there has to be something which differentially pushes the future towards "good" things and away from "bad" things. That something could be 'humanity', or 'aligned AGI', or 'augmented humans wielding tool AIs', or 'magically benevolent aliens' - whatever. But it has to be something, some 'goal system' (as Eliezer put it), and it has to be entangled with the thing we want it to optimize for (human morals and metamorals). Otherwise, there's no reason to think that the universe weaves a "good" trajectory through time.

Hence, one might then conclude

Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will not be optimized for human morals and metamorals.

But how do we get from "will not be optimized for" to "will contain almost nothing of worth"? There are probably a few ways of arguing this; the simplest may be:

our universe has 'resources'; making the universe decently OK-by-human-standards requires resources which can be used for many other purposes; most purposes are best accomplished by not using resources in this way.

This is not an argument that we will deploy utility maximizers with a misspecified utility function, and that that will be how our fragile value is shattered and our universe is extinguished. The thesis holds merely that 

Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth. 

As Katja notes, this argument is secretly about how the "forces of optimization" shape the future, and not necessarily about AIs or anything. The key point is to understand how the future is shaped, and then discuss how different kinds of AI systems might shape that future. 

Concretely, I can claim 'value is fragile' and then say 'for example, if we deployed a utility-maximizer in our society but we forgot to have it optimize for variety, people might loop a single desirable experience forever.' But on its own, the value fragility claim doesn't center on AI.

[Human] values do not emerge in all possible minds.  They will not appear from nowhere to rebuke and revoke the utility function of an expected paperclip maximizer.

Touch too hard in the wrong dimension, and the physical representation of those values will shatter - and not come back, for there will be nothing left to want to bring it back.

And the referent of those values - a worthwhile universe - would no longer have any physical reason to come into being.

Let go of the steering wheel, and the Future crashes.

Value is Fragile

Katja (correctly) implies that concluding that AI alignment is difficult requires extra arguments beyond value fragility:

... But if [the AI] doesn’t abruptly take over the world, and merely becomes a large part of the world’s systems, with ongoing ability for us to modify it and modify its roles in things and make new AI systems, then the question seems to be how forcefully the non-alignment is pushing us away from good futures relative to how forcefully we can correct this. And in the longer run, how well we can correct it in a deep way before AI does come to be in control of most decisions. So something like the speed of correction vs. the speed of AI influence growing. 

But exactly how complex and fragile?

As I see it, Katja and the commenters mostly discuss their conclusions about how AI+humanity might steer the future, how hard it will be to achieve the requisite entanglement with human values, instead of debating the truth value of the 'value fragility' claim which Eliezer made. Katja and the commenters discuss points which are relevant to AI alignment, but which are distinct from the value fragility claim. No one remarks that this claim has truth value independent of how we go about AI alignment, or how hard it is for AI to further our values. 

Value fragility quantifies the robustness of outcome value to perturbation of the "motivations" of key actors within a system, given certain dynamics. This may become clearer as we examine the comments. This insight allows us to decompose debates about "value fragility" into e.g.

  1. In what ways is human value fragile, given a fixed optimization scheme? 

    In other words: given fixed dynamics, to what classes of perturbations is outcome value fragile?
  2. What kinds of multi-agent systems tend to veer towards goodness and beauty and value?

    In other words: given a fixed set of perturbations, what kinds of dynamics are unusually robust against these perturbations?
    1. What kinds of systems will humanity end up building, should we act no further? This explores our beliefs about how probable alignment pressures will interact with value fragility.

I think this is much more enlightening than debating


The Comments

If no such decomposition takes place, I think debate is just too hard and opaque and messy, and I think some of this messiness spilled over into the comments. Locally, each comment is well thought-out, but it seems (to me) that cruxes were largely left untackled

To concretely point out something I consider somewhat confused, johnwentsworth authored the top-rated comment:

I think [Katja's summary] is an oversimplification of the fragility argument, which people tend to use in discussion because there's some nontrivial conceptual distance on the way to a more rigorous fragility argument.

The main conceptual gap is the idea that "distance" is not a pre-defined concept. Two points which are close together in human-concept-space may be far apart in a neural network's learned representation space or in an AGI's world-representation-space. It may be that value is not very fragile in human-concept-space; points close together in human-concept-space may usually have similar value. But that will definitely not be true in all possible representations of the world, and we don't know how to reliably formalize/automate human-concept-space.

The key point is not "if there is any distance between your description and what is truly good, you will lose everything", but rather, "we don't even know what the relevant distance metric is or how to formalize it". And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.

This is a good point. But what exactly happens between "we write down something too distant from the 'truth'" and the result? The AI happens. But this part, the dynamics, it's kept invisible.

So if you think that there will be fast takeoff via utility maximizers (a la AIXI), you might say "yes, value is fragile", but if I think it'll be more like slow CAIS with semi-aligned incentives making sure nothing goes too wrong, I reply "value isn't fragile." Even if we agree on a distance metric! This is how people talk past each other.

Crucially, you have to realize that your mind can hold separate the value fragility considerations, the considerations as to how vulnerable the outcomes are to the aforementioned perturbations, you have to know you can hold these separate from your parameter values for e.g. AI timelines.

Many other comments seem off-the-mark in a similar way. That said, I think that Steve Byrnes left an underrated comment:

Corrigibility is another reason to think that the fragility argument is not an impossibility proof: If we can make an agent that sufficiently understands and respects the human desire for autonomy and control, then it would presumably ask for permission before doing anything crazy and irreversible, so we would presumably be able to course-correct later on (even with fast/hard takeoff).

The reason that corrigibility-like properties are so nice is that they let us continue to steer the future through the AI itself; its power becomes ours, and so we remain the "goal system with detailed reliable inheritance from human morals and metamorals" shaping the future.


The problem is that people are debating "is value fragile?" without realizing that value fragility is a sensitivity measure: given some initial state and some dynamics, how sensitive is the human-desirability of the final outcomes to certain kinds of perturbations of the initial state

Left unremarked by Katja and the commenters, value fragility isn't intrinsically about AI alignment. What matters most is the extent to which the future is controlled by systems whose purposes are sufficiently entangled with human values. This question reaches beyond just AI alignment.

I'm glad Katja said "Hey, I'm not convinced by this key argument", but I don't think it makes sense to include But exactly how complex and fragile? in the review. 

Thanks to Rohin Shah for feedback on this review.

Avoiding Side Effects in Complex Environments

Well, if , the penalty would vanish, since both of those auxiliary reward function templates are state-based. If they were state-action reward functions, then the penalty would be the absolute difference in greedy reward compared to taking the null action. This wouldn't correlate to environmental dynamics, and so the penalty would be random noise.

Load More