tailcalled

Sorted by New

# Wiki Contributions

Corrigibility Can Be VNM-Incoherent

Actually upon thinking further I don't think this argument works, at least not as it is written right now.

Corrigibility Can Be VNM-Incoherent

Imagine that policies decompose into two components, . For instance, they may be different sets of parameters in a neural network. We can then talk about the effect of one of the components by considering how it influences the power/injectivity of the features with respect to the other component.

Suppose, for instance, that  is such that the policy just ends up acting in a completely random-twitching way. Technically  has a lot of effect too, in that it chaotically controls the pattern of the twitching, but in terms of the features  is basically constant. This is a low power situation, and if one actually specified what  would be, then a TurnTrout-style argument could probably prove that such values of  would be avoided for power-seeking reasons. On the other hand, if  made the policy act like an optimizer which optimizes a utility function over the features of  with the utility function being specified by , then that would lead to a lot more power/injectivity.

On the other hand, I wonder if there's a limit to this style of argument. Too much noninjectivity would require crazy interaction effects to fill out the space in a Hilbert-curve-style way, which would be hard to optimize?

Corrigibility Can Be VNM-Incoherent

Since you can convert a utility function over states or observation-histories into a utility function over policies (well, as long as you have a model for measuring the utility of a policy), and since utility functions over states/observation-histories do satisfy instrumental convergence, yes you are correct.

I feel like in a way, one could see the restriction to defining it in terms of e.g. states as a definition of "smart" behavior; if you define a reward in terms of states, then the policy must "smartly" generate those states, rather than just yield some sort of arbitrary behavior.

🤔 I wonder if this approach could generalize TurnTrout's approach. I'm not entirely sure how, but we might imagine that a structured utility function  over policies could be decomposed into , where  is the features that the utility function pays attention to, and  is the utility function expressed in terms of those features. E.g. for state-based rewards, one might take  to be a model that yields the distribution of states visited by the policy, and  to be the reward function for the individual states (some sort of modification would have to be made to address the fact that f outputs a distribution but r takes in a single state... I guess this could be handled by working in the category of vector spaces and linear transformations but I'm not sure if that's the best approach in general - though since  can be embedded into this category, it surely can't hurt too much).

Then the power-seeking situation boils down to that the vast majority of policies  lead to essentially the same features , but that there is a small set of power-seeking policies that lead to a vastly greater range of different features? And so for most , a  that optimizes/satisfices/etc.  will come from this small set of power-seeking policies.

I'm not sure how to formalize this. I think it won't hold for generic vector spaces, since almost all linear transformations are invertible? But it seems to me that in reality, there's a great degree of non-injectivity. The idea of "chaos inducing abstractions" seems relevant, in the sense that parameter changes in  will mostly tend to lead to completely unpredictable/unsystematic/dissipated effects, and partly tend to lead to predictable and systematic effects. If most of the effects are unpredictable/unsystematic, then  must be extremely non-injective, and this non-injectivity then generates power-seeking.

(Or does it? I guess you'd have to have some sort of interaction effect, where some parameters control the degree to which the function is injective with regards to other parameters. But that seems to holds in practice.)

I'm not sure whether I've said anything new or useful.

Corrigibility Can Be VNM-Incoherent

🤔 I was about to say that I felt like my approach could still be done in terms of state rewards, and that it's just that my approach violates some of the technical assumptions in the OP. After all, you could just reward for being in a state such that the various counterfactuals apply when rolling out from this state; this would assign higher utility to the blue states than the red states, encouraging corrigibility, and contradicting TurnTrout's assumption that utility would be assigned solely based on the letter.

But then I realized that this introduces a policy dependence to the reward function; the way you roll out from a state depends on which policy you have. (Well, in principle; in practice some MDPs may not have much dependence on it.) The special thing about state-based rewards is that you can assign utilities to trajectories without considering the policy that generates the trajectory at all. (Which to me seems bad for corrigibility, since corrigibility depends on the reasons for the trajectories, and not just the trajectories themselves.)

But now consider the following: If you have the policy, you can figure out which actions were taken, just by applying the policy to the state/history. And instrumental convergence does not apply to utility functions over action-observation histories. So therefore it doesn't apply to utility functions over (policies, observation histories). (I think?? At least if the set of policies is closed under replacing an action under a specified condition, and there's no Newcombian issues that creates non-causal dependencies between policies and observation histories?).

So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like "humans have control over the AI" (as this is a causal statement and thus depends on the AI).

We could consider u-P, utility functions over policies. This is the most general sort of utility function (I think??), and as such it is also way way too general, just like u-AOH is. I think maybe what I should try to do is define some causal/counterfactual generalizations of u-AOH, u-OH, and u-S, which allow better behaved utility functions.

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

## Appendix: tracking key limitations of the power-seeking theorems

I want to say that there's another key limitation:

Let  be a set of utility functions which is closed under permutation.

It seems like a rather central assumption to the whole approach, but in reality people seem to tend to specify "natural" utility functions in some sense (e.g. generally continuous, being functions of only a few parameters, etc.). I feel like for most forms of natural utility functions, the basic argument will still hold, but I'm not sure how far it generalizes.

As I understand it, Google's proposed model is a MoE model, and I've heard MoE models achieve poorer understanding for equivalent parameter count than classical transformer models do.

Dutch-Booking CDT: Revised Argument

So I see two possible interpretations of traditional Dutch books:

I disagree, I don't think it's a simple binary thing. I don't think Dutch book arguments in general never apply to recursive things, but it's more just that the recursion needs to be modelled in some way, and since your OP didn't do that, I ended up finding the argument confusing.

The standard dutch book arguments would apply to the imp. Why should you be in such a different position from the imp?

I don't think your argument goes through for the imp, since it never needs to decide its action, and therefore the second part of selling the contract back never comes up?

For example, multiply the contract payoff by 0.001.

Hmm, on further reflection, I had an effect in mind which doesn't necessarily break your argument, but which increases the degree to which other counterarguments such as AlexMennen's break your argument. This effect isn't necessarily solved by multiplying the contract payoff (since decisions aren't necessarily continuous as a function of utilities), but it may under many circumstances be approximately solved by it. So maybe it doesn't matter so much, at least until AlexMennen's points are addressed so I can see where it fits in with that.

Dutch-Booking CDT: Revised Argument

This, again, seems plausible if the payoff is made sufficiently small.

How do you make the payoff small?

This is actually very similar to traditional Dutch-book arguments, which treat the bets as totally independent of everything.

Isn't your Dutch-book argument more recursive than standard ones? Your contract only pays out if you act, so the value of the dutch book causally depends on the action you choose.

Dutch-Booking CDT: Revised Argument

So the overall expectation is .

Wouldn't it be P(Act=a|do(buy B)) rather than P(Act=a)? Like my thought would be that the logical thing for CDT would be to buy the contract and then as a result its expected utilities change, which leads to its probabilities changing, and as a result it doesn't want to sell the contract. I'd think this argument only puts a bound on how much cdt and edt can differ, rather than on whether they can differ at all. Very possible I'm missing something though.