# All of TurnTrout's Comments + Replies

Soares, Tallinn, and Yudkowsky discuss AGI cognition

no-one has the social courage to tackle the problems that are actually important

I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.

I would guess that if people aren't tackling Hard Problems enough, it's not because they lack social courage, but because 1) they aren't running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they're wrong about what problems are Hard Problems. My money's mostly on (1), with a bit of (2).

Solve Corrigibility Week

But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.

Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

Addendum: One lesson to take away is that quantilization doesn't just depend on the base distribution being safe to sample from unconditionally. As the theorems hint, quantilization's viability depends on base(plan | plan doing anything interesting) also being safe with high probability, because we could (and would) probably resample the agent until we get something interesting. In this post's terminology, A := {safe interesting things}, B := {power-seeking interesting things}, C:= A and B and {uninteresting things}.

Ngo and Yudkowsky on alignment difficulty

I've started commenting on this discussion on a Google Doc. Here are some excerpts:

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up.

• Well-modelled as binary "has-AGI?" predicate;
• (I am sympathetic to the microeconomics of intelligence explosion working out in a way where "Well-modelled
Corrigibility Can Be VNM-Incoherent

Although I didn't make this explicit, one problem is that manipulation is still weakly optimal—as you say. That wouldn't fit the spirit of strict corrigibility, as defined in the post.

Note that AUP doesn't have this problem.

Corrigibility Can Be VNM-Incoherent

though since  can be embedded into [Vect], it surely can't hurt too much

As an aside, can you link to/say more about this? Do you mean that there exists a faithful functor from Set to Vect (the category of vector spaces)? If you mean that, then every concrete category can be embedded into Vect, no? And if that's what you're saying, maybe the functor Set -> Vect is something like the "Group to its group algebra over field " functor.

Corrigibility Can Be VNM-Incoherent

I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce "smart" or "powerful" behavior from simple rules.

I share an intuition in this area, but "powerful" behavior tendencies seems nearly equivalent to instrumental convergence to me. It feels logically downstream of instrumental convergence.

from simple rules

I already have a (somewhat weak) result on power-seeking wrt the simplicity prior over state-based reward functions. This isn't about utility functions over policies, though... (read more)

Corrigibility Can Be VNM-Incoherent

So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like "humans have control over the AI" (as this is a causal statement and thus depends on

Corrigibility Can Be VNM-Incoherent

change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.

I'm not parsing this. You change the utility function, but it ends up in the same place with the same utility function? Did we change it or not? (I think simply rewording it will communicate your point to me)

1Charlie Steiner10dSo we have a switch with two positions, "R" and "L." When the switch is "R," the agent is supposed to want to go to the right end of the hallway, and vice versa for "L" and left. It's not that you want this agent to be uncertain about the "correct" value of the switch and so it's learning more about the world as you send it signals - you just want the agent to want to go to the left when the switch is "L," and to the right when the switch is "R." If you start with the agent going to the right along this hallway, and you change the switch to "L," and then a minute later change your mind and switch back to "R," it will have turned around and passed through the same spot in the hallway multiple times. The point is that if you try to define a utility as a function of the state for this agent, you run into an issue with cycles - if you're continuously moving "downhill", you can't get back to where you were before.
Corrigibility Can Be VNM-Incoherent

If you can correct the agent to go where you want, it already wanted to go where you want. If the agent is strictly corrigible to terminal state , then  was already optimal for it.

If the reward function has a single optimal terminal state, there isn't any new information being added by . But we want corrigibility to let us reflect more on our values over time and what we want the AI to do!

If the reward function has multiple optimal terminal states, then corrigibility again becomes meaningful. Bu

Corrigibility Can Be VNM-Incoherent

I worded the title conservatively because I only showed that corrigibility is never nontrivially VNM-coherent in this particular MDP Maybe there's a more general case to be proven for all MDPs, and using more realistic (non-single-timestep) reward aggregation schemes.

Ngo and Yudkowsky on alignment difficulty

Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence

I'm interested in hearing about how your approach handles this environment, because I think I'm getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.

0Koen Holtman7dUpdate: I just recalled that Eliezer and MIRI often talk about Dutch booking [https://en.wikipedia.org/wiki/Dutch_book] when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here. When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below: To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.

Read your post, here are my initial impressions on how it relates to the discussion here.

In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.

However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the te... (read more)

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

Could you give a toy example of this being insufficient (I'm assuming the "set copy relation" is the "B contains n of A" requiring)?

A:={(1 0 0)} B:={(0 .3 .7), (0 .7 .3)}

Less opaquely, see the technical explanation for this counterexample, where the right action leads to two trajectories, and up leads to a single one.

How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it's not greater/equal to it's involution? Is this when the options in B are originally more optimal?

I don't think I understand the question. Can you rephrase?

Also, that theorem requires each involution to be greater/equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything?

Less-than involutions aren't guaranteed to add anything. For example, if  iff a goes le... (read more)

3elriggs13dYour example actually cleared this up for me as well! I wanted an example where the inequality failed even if you had an involution on hand.
Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

Right, I was intending "3. [these results] don't account for the ways in which we might practically express reward functions" to capture that limitation.

TurnTrout's shortform feed

# How might we align AGI without relying on interpretability?

I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around?

My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its  parameter space being 3-colored as follows:

• Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
• Red if the parameter vector+... leads to a misaligned or dece
Discussion with Eliezer Yudkowsky on AGI interventions

I'm sympathetic to most of your points.

highly veiled contempt for anyone not doing that

I have sympathy for the "this feels somewhat contemptuous" reading, but I want to push back a bit on the "EY contemptuously calling nearly everyone fakers" angle, because I think "[thinly] veiled contempt" is an uncharitable reading. He could be simply exasperated about the state of affairs, or wishing people would change their research directions but respect them as altruists for Trying At All, or who knows what? I'd rather not overwrite his intentions with our reaction... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

I worry that "Prosaic Alignment Is Doomed" seems a bit... off as the most appropriate crux. At least for me. It seems hard for someone to justifiably know that this is true with enough confidence to not even try anymore. To have essayed or otherwise precluded all promising paths of inquiry, to not even engage with the rest of the field, to not even try to argue other researchers out of their mistaken beliefs, because it's all Hopeless.

Consider the following analogy: Someone who wants to gain muscle, but has thought a lot about nutrition and their gen... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

I just remembered (!) that I have more public writing disentangling various forms of corrigibility, and their benefits—Non-obstruction: A simple concept motivating corrigibility.

Discussion with Eliezer Yudkowsky on AGI interventions

Maybe there's been a lot of non public work that I'm not privy to?

In Aug 2020 I gave formalizing corrigibility another shot, and got something interesting but wrong out the other end. Am planning to publish sometime, but beyond that I'm not aware of other attempts.

When I visited MIRI for a MIRI/CHAI social in 2018, I seriously suggested a break-out group in which we would figure out corrigibility (or the desirable property pointed at by corrigibility-adjacent intuitions) in two hours. I think more people should try this exact exercise more often—including myself.

2Lawrence Chan20dYeah, we've also spent a while (maybe ~5 hours total?) in various CHAI meetings (some of which you've attended) trying to figure out the various definitions of corrigibility to no avail, but those notes are obviously not public. :( That being said I don't think failing in several hours of meetings/a few unpublished attempts is that much evidence of the difficulty?
TurnTrout's shortform feed

Are there any alignment techniques which would benefit from the supervisor having a severed corpus callosum, or otherwise atypical neuroanatomy? Usually theoretical alignment discourse focuses on the supervisor's competence / intelligence. Perhaps there are other, more niche considerations.

P₂B: Plan to P₂B Better

We take “planning” to include things that are relevantly similar to this procedure, such as following a bag of heuristics that approximates it.

In theory, optimal policies could be tabularly implemented. In this case, it is impossible for them to further improve their "planning." Yet optimal policies tend to seek power and pursue convergent instrumental subgoals, such as staying alive.

So I'm not (yet) convinced that this frame is useful reductionism for better understanding subgoals. It feels somewhat unnatural to me, although I am also getting a tad ... (read more)

4Daniel Kokotajlo1moI realized we forgot to put in the footnotes! There was one footnote which was pretty important, I'll put it here because it's related to what you said. It was a footnote after the "make the planners with your goal better at planning" sub-maxim.
3Adam Shimi1moThat sounds wrong. Planning as defined in this post is sufficiently broad that acting like a planner makes you a planner. So if you unwrap a structural planner into a tabular policy, the latter would improve its planning (for example by taking actions that instrumentally help it accomplish the goal we can best ascribe it using the intentional stance). Another way of framing the point IMO is that the OPs define planning in terms of computation instead of algorithm, and so planning better means facilitating or making the following part of the computation more efficient.
Emergent modularity and safety

I'm not convinced that these similarities are great enough to merit such anchoring. Just because NNs have more in common with brains than with SVMs, does not imply that we will understand NNs in roughly the same ways that we understand biological brains. We could understand them in a different set of ways than we understand biological brains, and differently than we understand SVMs.

Rather than arguing over reference class, it seems like it would make more sense to note the specific ways in which NNs are similar to brains, and what hints those specific similarities provide.

[AN #167]: Concrete ML safety problems and their relevance to x-risk

What is the extra assumption? If you're making a coherence argument, that already specifies the domain of coherence, no? And so I'm not making any more assumptions than the original coherence argument did (whatever that argument was). I agree that the original coherence argument can fail, though.

4Rohin Shah1moI think we're just debating semantics of the word "assumption". Consider the argument: I think we both agree this is not a valid argument, or is at least missing some details about what the AI is VNM-rational over before it becomes a valid argument. That's all I'm trying to say. -------------------------------------------------------------------------------- Unimportant aside on terminology: I think in colloquial English it is reasonable to say that this is "missing an assumption". I assume that you want to think of this as math. My best guess at how to turn the argument above into math would be something that looks like: ?⟹VNM rational over state-based outcomes VNM rational over state-based outcomes⟹Convergent instrumental subgoals This still seems like "missing assumption", since the thing filling the ? seems like an "assumption". -------------------------------------------------------------------------------- Maybe you're like "Well, if you start with the setup of an agent that satisfies the VNM axioms over state-based outcomes, then you really do just need VNM to conclude 'convergent instrumental subgoals', so there's no extra assumptions needed". I just don't start with such a setup; I'm always looking for arguments with the conclusion "in the real world, we have a non-trivial chance of building an agent that causes an existential catastrophe". (Maybe readers don't have the same inclination? That would surprise me, but is possible.)
[AN #167]: Concrete ML safety problems and their relevance to x-risk

My point in that post is that coherence arguments alone are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences).

Coherence arguments sometimes are enough, depending on what the agent is coherent over.

2Rohin Shah1moThat's an assumption :P (And it's also not one that's obviously true, at least according to me.)
AXRP Episode 11 - Attainable Utility and Power with Alex Turner

I think it's the latter that an AI cares about when evaluating the impact of its own actions.

I'm discussing a philosophical framework for understanding low impact. I'm not prescribing how the AI actually accomplishes this.

AXRP Episode 11 - Attainable Utility and Power with Alex Turner

I want to clarify something.

Does the notion of "low-impact" break down, though, if humans are eventually going to use the results from these experiments to build high-impact AI?

I think the notion doesn't break down. The low-impact AI hasn't changed human attainable utilities by the end of the experiments. If we eventually build a high-impact AI, that seems "on us." The low-impact AI itself hasn't done something bad to us. I therefore think the concept I spelled out still works in this situation.

As I mentioned in the other comment, I don't feel optimistic about actually designing these AIs via explicit low-impact objectives, though.

1Charlie Steiner2moIt seems like evaluating human AU depends on the model. There's a "black box" sense where you can replace the human's policy with literally anything in calculating AU for different objectives, and there's a "transparent box" sense in which you have to choose from a distribution of predicted human behaviors. The former is closer to what I think you mean by "hasn't changed the humans' AU," but I think it's the latter that an AI cares about when evaluating the impact of its own actions.
AXRP Episode 11 - Attainable Utility and Power with Alex Turner

Is the notion that if we understand how to build low-impact AI, we can build AIs with potentially bad goals, watch them screw up, and we can then fix our mistakes and try again?

Depends. I think this is roughly true for small-scale AI deployments, where the AI makes mistakes which "aren't big deals" for most goals—instead of irreversibly smashing furniture, maybe it just navigates to a distant part of the warehouse.

I think this paradigm is less clearly feasible or desirable for high-impact TAI deployment, and I'm currently not optimistic about that use case for impact measures.

Beth Barnes's Shortform

We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn't have any computation left to turn the activation into a text output.

Surely there exist correct fixed points, though? (Although probably not that useful, even if feasible)

1Beth Barnes2moYou mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.
Agents Over Cartesian World Models

In contrast, humans map multiple observations onto the same internal state.

Is this supposed to say something like: "Humans can map a single observation onto different internal states, depending on their previous internal state"?

For HCH-bot, what's the motivation? If we can compute the KL, we can compute HCH(i), so why not just use HCH(i) instead? Or is this just exploring a potential approximation?

A consequential approval-maximizing agent takes the action that gets the highest approval from a human overseer. Such a

1Ben Pace2moFixed the LaTex.
3Evan Hubinger3moYeah, that's exactly right—I'm interested in how an agent can do something like manage resource allocation to do the best HCH imitation in a resource-bounded setting. Yep, that's the idea. Yeah, that's fair—if you add the assumption that every trajectory has a unique utility (that is, your preferences are a total ordering), though, then I think the argument still goes through. I don't know how realistic an assumption like that is, but it seems plausible for any complex utility function over a relatively impoverished domain (e.g. a complex utility function over observations or actions would probably have this property, but a simple utility function over world states would probably not).
The alignment problem in different capability regimes

Other examples of problems that people sometimes call alignment problems that aren’t a problem in the limit of competence: avoiding negative side effects, safe exploration...

I don't understand why you think that negative side effect avoidance belongs on that list.

A sufficiently intelligent system will probably be able to figure out when it's having negative side effects. This does not mean that it will—as a matter of fact—avoid having these side effects, and it does not mean that its NegativeSideEffect? predicate is accessible. A paperclip maximizer ... (read more)

Re the negative side effect avoidance: Yep, you're basically right, I've removed side effect avoidance from that list.

And you're right, I did mean "it will be able to" rather than "it will"; edited.

1Adam Shimi3moThat was my reaction when reading the competence subsection too. I'm really confused, because that's quite basic Orthogonality Thesis, so should be quite obvious to the OP. Maybe it's a problem of how the post was written that implies some things the OP didn't meant?
Finite Factored Sets: Applications

Throughout this sequence, we have assumed finiteness fairly gratuitously. It is likely that many of the results can be extended to arbitrary finite sets.

To arbitrary factored sets?

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

Thanks! I think you're right. I think I actually should have defined  differently, because writing it out, it isn't what I want. Having written out a small example, intuitively,  should hold iff , which will also induce  as we want.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation. Probably it's more natural to represent  as , which makes your insight ... (read more)

1Edouard Harris3moNo problem! Glad it was helpful. I think your fix makes sense. Yeah, I figured maybe it was because the dummy variableℓwas being used in the EV to sum over outcomes, while the vectorlwas being used to represent the probabilities associated with those outcomes. Becauseℓandlare similar it's easy to conflate their meanings, and if you applyϕto the wrong one by accident that has the same effect as applyingϕ−1to the other one. In any case though, the main result seems unaffected. Cheers!
Environmental Structure Can Cause Instrumental Convergence

I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition).

&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.

1Ofer Givoli3moThanks for the figure. I'm afraid I didn't understand it. (I assume this is a gridworld environment; what does "standing near intact vase" mean? Can the robot stand in the same cell as the intact vase?) I don't follow (To be clear, I was not trying to apply any theorem from the paper via that involution). But does this mean you are NOT making that claim ("most agents will not immediately break the vase") in the limit of the discount rate going to 1? My understanding is that the main claim in the abstract of the paper is meant to assume that setting, based on the following sentence from the paper:
Power-seeking for successive choices

For (3), environments which "almost" have the right symmetries should also "almost" obey the theorems. To give a quick, non-legible sketch of my reasoning:

For the uniform distribution over reward functions on the unit hypercube (), optimality probability should be Lipschitz continuous on the available state visit distributions (in some appropriate sense). Then if the theorems are "almost" obeyed, instrumentally convergent actions still should have extremely high probability, and so most of the orbits still have to agree.

So I don't curre

1Ofer Givoli3moThat quote does not seem to mention the "stochastic sensitivity issue". In the post [https://www.lesswrong.com/posts/Yc5QSSZCQ9qdyxZF6/the-more-power-at-stake-the-stronger-instrumental#Note_of_caution__redux] that you linked to, "(3)" refers to: So I'm still not sure what you meant when you wrote "The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other [https://www.lesswrong.com/posts/Yc5QSSZCQ9qdyxZF6/the-more-power-at-stake-the-stronger-instrumental#Note_of_caution__redux] posts, and discussed at length in other comment threads." (Again, I'm not aware of any previous mention of the "stochastic sensitivity issue" other than in my comment here [https://www.lesswrong.com/posts/b6jJddSvWMdZHJHh3/environmental-structure-can-cause-instrumental-convergence?commentId=fJNEzZ8AkDi7TfCE7] .)
Environmental Structure Can Cause Instrumental Convergence

Gotcha. I see where you're coming from.

I think I underspecified the scenario and claim. The claim wasn't supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase.

If the agent has a choice between one action ("break vase and move forwards") or another action ("don't break vase and more forwards"), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend t... (read more)

1Ofer Givoli3moI don't see why that claim is correct either, for a similar reason. If you're assuming here that most reward functions incentivize avoiding immediately breaking the vase then I would argue that that assumption is incorrect, and to support this I would point to the same involution from my previous comment [https://www.lesswrong.com/posts/b6jJddSvWMdZHJHh3/environmental-structure-can-cause-instrumental-convergence?commentId=pqfJmZs9cMdLXqTAw] .
3Ofer Givoli4moThanks. We can construct an involution over reward functions that transforms every state by switching the is-the-vase-broken bit in the state's representation. For every reward function that "wants to preserve the vase" we can apply on it the involution and get a reward function that "wants to break the vase". (And there are the reward functions that are indifferent about the vase which the involution map to themselves.)
Power-seeking for successive choices

You're being unhelpfully pedantic. The quoted portion even includes the phrase "As a quick summary (read the paper and sequence if you want more details)"! This reads to me as an attempted pre-emption of "gotcha" comments.

The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn't about the stochastic sensitivity issue, and I don't think it should have to talk about the sensitivity issue.

1Ofer Givoli3moI noticed that after my previous comment [https://www.lesswrong.com/posts/5Hc4R6rj5yJ3xBhiX/power-seeking-for-successive-choices?commentId=KEgwj55FKQTgeoe9n] you've edited your comment to include the page number and the link. Thanks. I still couldn't find in the paper (top of page 9) an explanation for the "stochastic sensitivity issue". Perhaps you were referring to the following: But the issue is with stochastic MDPs, not randomly generated MDPs. Re the linked post section, I couldn't find there anything about stochastic MDPs.
0Ofer Givoli4moI haven't found an explanation about the "stochastic sensitivity issue" in the paper, can you please point me to a specific section/page/quote? All that I found about this in the paper was the sentence: (I'm also not aware of previous posts/threads that discuss this, other than my comment here [https://www.lesswrong.com/posts/b6jJddSvWMdZHJHh3/environmental-structure-can-cause-instrumental-convergence?commentId=fJNEzZ8AkDi7TfCE7] .) I brought up this issue as a demonstration of the implications of incorrectly assuming that the theorems in the paper apply when there are more "options" available after action 1 than after action 2. (I argue that this issue shows that the informal description in the OP does not correctly describe the theorems in the paper, and it's not just a matter of omitting details.)
Environmental Structure Can Cause Instrumental Convergence

Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don't "tend to avoid breaking the vase". Those optimal policies don't behave as if they care about the 'strictly more states' that can be reached by not breaking the vase.

This is factually wrong BTW. I had just explained why the opposite is true.

1Ofer Givoli4moAre you saying that my first sentence ("Most of the reward functions are either indifferent about the vase or want to break the vase") is in itself factually wrong, or rather the rest of the quoted text?
Power-seeking for successive choices

The point of using scare quotes is to abstract away that part. So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.

1Ofer Givoli4moI think the quoted description is not at all what the theorems in the paper show, no matter what concept the word "options" (in scare quotes) refers to. In order to apply the theorems we need to show that an involution with certain properties exist; not that <some set of things after action 1> is larger than <some set of things after action 2>. To be more specific, the concept that the word "options" refers to here is recurrent state distributions. If the quoted description was roughly correct, there would not be a problem with applying the theorems in stochastic environments. But in fact the theorems can almost never be applied in stochastic environments. For example, suppose action 1 leads to more available "options", and action 2 causes "immediate death" with probability 0.7515746, and that precise probability does not appear in any transition that follows action 1. We cannot apply the theorems because no involution with the necessary properties exists.
Environmental Structure Can Cause Instrumental Convergence

The original version of this post claimed that an MDP-independent constant C helped lower-bound the probability assigned to power-seeking reward functions under simplicity priors. This constant is not actually MDP-independent (at least, the arguments given don't show that): the proof sketch assumes that the MDP is given as input to the permutation-finding algorithm (which is the same, for every MDP you want to apply it to!). But the input's description length must also be part of the Kolmogorov complexity (else you could just compute any string for free by... (read more)

Seeking Power is Convergently Instrumental in a Broad Class of Environments

contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation)

We aren't talking about MDPs, we're talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title.

it's an unbounded integer.

I don't follow.

the program needs to contain way more than 100 bits

See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits.

But IMO the basic idea in this case is,... (read more)

Seeking Power is Convergently Instrumental in a Broad Class of Environments

The results are not centrally about the uniform distribution. The uniform distribution result is more of a consequence of the (central) orbit result / scaling law for instrumental convergence. I gesture at the uniform distribution to highlight the extreme strength of the statistical incentives.

Environmental Structure Can Cause Instrumental Convergence

Relatedly [to power-seeking under the simplicity prior], Rohin Shah wrote:

if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.

TurnTrout's shortform feed

My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.

My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even... (read more)

Seeking Power is Often Convergently Instrumental in MDPs

I proposed changing "instrumental convergence" to "robust instrumentality." This proposal has not caught on, and so I reverted the post's terminology. I'll just keep using 'convergently instrumental.' I do think that 'convergently instrumental' makes more sense than 'instrumentally convergent', since the agent isn't "convergent for instrumental reasons", but rather, it's more reasonable to say that the instrumentality is convergent in some sense.

For the record, the post used to contain the following section:

## A note on terminology

[AN #156]: The scaling hypothesis: a plan for building AGI

Sure.

Additional note for posterity: when I talked about "some objectives [may] make alignment far more likely", I was considering something like "given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

2Rohin Shah4moYou're going to need the ease of specification condition, or something similar; else you'll probably run into no-free-lunch considerations (at which point I think you've stopped talking about anything useful).
[AN #156]: The scaling hypothesis: a plan for building AGI

Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.

Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn't matter.

And this is true up to a point: up to constant factors, it doesn't matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does ... (read more)

6Rohin Shah4moYeah, I agree with all this. I still think the pretraining objective basically doesn't matter for alignment (beyond being "reasonable") but I don't think the argument I've given establishes that. I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention [https://www.lesswrong.com/posts/X2AD2LgtKgkRNPj2a/privileging-the-hypothesis] (and thus Claim 4 as well).
The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies

As I understand expanding candy into A and B but not expanding the other will make the ratios go differently.

What do you mean?

If we knew what was important and what not we would be sure about the optimality. But since we think we don't know it or might be in error about it we are treating that the value could be hiding anywhere.

I'm not currently trying to make claims about what variants we'll actually be likely to specify, if that's what you mean. Just that in the reasonably broad set of situations covered by my theorems, the vast majority of variants of every objective function will make power-seeking optimal.

A world in which the alignment problem seems lower-stakes

Yeah, we are magically instantly influencing an AGI which will thereafter be outside of our light cone. This is not a proposal, or something which I'm claiming is possible in our universe. Just take for granted that such a thing is possible in this contrived example environment.

My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.

Well, maybe here's a better way of communicating what I'm after: