## AI ALIGNMENT FORUMAF

Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at turneale[at]oregonstate[dot]edu.

# Sequences

The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

# Wiki Contributions

Agents Over Cartesian World Models

In contrast, humans map multiple observations onto the same internal state.

Is this supposed to say something like: "Humans can map a single observation onto different internal states, depending on their previous internal state"?

$U_{{E}}(e, t) = \text{the number of paperclips in }$.

Unrendered latex.

For HCH-bot, what's the motivation? If we can compute the KL, we can compute HCH(i), so why not just use HCH(i) instead? Or is this just exploring a potential approximation?

A consequential approval-maximizing agent takes the action that gets the highest approval from a human overseer. Such agents have an incentive to tamper with their reward channels, e.g., by persuading the human they are conscious and deserve reward.

Why does this incentive exist? Approval-maximizers take the local action which the human would rate most highly. Are we including "long speech about why human should give high approval to me because I'm suffering" as an action? I guess there's a trade-off here, where limiting to word-level output demands too much lookahead coherence of the human, while long sentences run the risk of incentivizing reward tampering. Is that the reason you had in mind?

If the agent can act to leave itself unchanged, loops of the same sequences of internal states rule out utility functions of type . Similarly, loops of the same (internal state, action) pairs rule out utility functions of type  and . Finally, if the agent ever takes different actions, we can rule out a utility function of type  (assuming the action space is not changing).

This argument doesn't seem to work, because the zero utility function makes everything optimal. VNM theorem can't rule that out given just an observed trajectory. However, if you know the agent's set of optimal policies, then you can start ruling out possibilities (because I can't have purely environment-based utility if  | internal state 1, but  | internal state 2).

The alignment problem in different capability regimes

Other examples of problems that people sometimes call alignment problems that aren’t a problem in the limit of competence: avoiding negative side effects, safe exploration...

I don't understand why you think that negative side effect avoidance belongs on that list.

A sufficiently intelligent system will probably be able to figure out when it's having negative side effects. This does not mean that it will—as a matter of fact—avoid having these side effects, and it does not mean that its NegativeSideEffect? predicate is accessible. A paperclip maximizer may realize that humans consider extinction to be a "negative side effect." This consideration does not move it. Increasing agent intelligence does not naturally solve the problem of getting the agent to not do catastrophically impactful things while optimizing its objective.

In contrast, once an agent realizes that an exploration strategy is unsafe, the agent will be instrumentally motivated to find a better one. Increasing agent intelligence naturally solves the problem of safe exploration.

it will massively outperform humans on writing ethics papers or highly upvoted r/AmItheAsshole comments.

Presumably you meant to say "it will be able to massively outperform..."? (I think you did, since you mention a similar consideration under "Ability to understand itself.") A competent agent will understand, but will only act accordingly if so aligned (for either instrumental or terminal reasons).

Finite Factored Sets: Applications

Throughout this sequence, we have assumed finiteness fairly gratuitously. It is likely that many of the results can be extended to arbitrary finite sets.

To arbitrary factored sets?

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

Thanks! I think you're right. I think I actually should have defined  differently, because writing it out, it isn't what I want. Having written out a small example, intuitively,  should hold iff , which will also induce  as we want.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation. Probably it's more natural to represent  as , which makes your insight obvious.

The post is edited and the issues should now be fixed.

Environmental Structure Can Cause Instrumental Convergence

I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition).

&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.

Power-seeking for successive choices

For (3), environments which "almost" have the right symmetries should also "almost" obey the theorems. To give a quick, non-legible sketch of my reasoning:

For the uniform distribution over reward functions on the unit hypercube (), optimality probability should be Lipschitz continuous on the available state visit distributions (in some appropriate sense). Then if the theorems are "almost" obeyed, instrumentally convergent actions still should have extremely high probability, and so most of the orbits still have to agree.

So I don't currently view (3) as a huge deal. I'll probably talk more about that another time.

Environmental Structure Can Cause Instrumental Convergence

Gotcha. I see where you're coming from.

I think I underspecified the scenario and claim. The claim wasn't supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase.

If the agent has a choice between one action ("break vase and move forwards") or another action ("don't break vase and more forwards"), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend to break it eventually, depending on the granularity and balance of final states.

So I think we're actually both making a correct point, but you're making an argument for  under certain kinds of models and whether the agent will eventually break the vase. I (meant to) discuss the immediate break-it-or-not decision in terms of option preservation at all discount rates.

[Edited to reflect the ancestor comments]

Power-seeking for successive choices

You're being unhelpfully pedantic. The quoted portion even includes the phrase "As a quick summary (read the paper and sequence if you want more details)"! This reads to me as an attempted pre-emption of "gotcha" comments.

The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn't about the stochastic sensitivity issue, and I don't think it should have to talk about the sensitivity issue.

Environmental Structure Can Cause Instrumental Convergence

Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don't "tend to avoid breaking the vase". Those optimal policies don't behave as if they care about the 'strictly more states' that can be reached by not breaking the vase.

This is factually wrong BTW. I had just explained why the opposite is true.