All of ofer's Comments + Replies

Announcing the Alignment of Complex Systems Research Group

I think this comment is lumping together the following assumptions under the "continuity" label, as if there is a reason to believe that either they are all correct or all incorrect (and I don't see why):

  1. There is large distance in model space between models that behave very differently.
  2. Takeoff will be slow.
  3. It is feasible to create models that are weak enough to not pose an existential risk yet able to sufficiently help with alignment.

I bet more on scenarios where we get AGI when politics is very different compared to today.

I agree that just before... (read more)

Six Dimensions of Operational Adequacy in AGI Projects

Even with adequate closure and excellent opsec, there can still be risks related to researchers on the team quitting and then joining a competing effort or starting their own AGI company (and leveraging what they've learned).

autonomy: the missing AGI ingredient?

Do you generally think that people in the AI safety community should write publicly about what they think is "the missing AGI ingredient"?

It's remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).

Thoughts on gradient hacking

Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.

Relaxed adversarial training for inner alignment

If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?

3Evan R. Murphy2mo
That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI. This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.
AMA Conjecture, A New Alignment Startup

What I can do is point to my history of acting in ways that, I hope, show my consistent commitment to doing what is best for the longterm future (even if of course some people with different models of what is “best for the longterm future” will have legitimate disagreements with my choices of past actions), and pledge to remain in control of Conjecture and shape its goals and actions appropriately.

Sorry, do you mean that you are actually pledging to "remain in control of Conjecture"? Can some other founder(s) make that pledge too if it's necessary for m... (read more)

AMA Conjecture, A New Alignment Startup

Your website says: "WE ARE AN ARTIFICIAL GENERAL INTELLIGENCE COMPANY DEDICATED TO MAKING AGI SAFE", and also "we are committed to avoiding dangerous AI race dynamics".

How are you planning to avoid exacerbating race dynamics, given that you're creating a new 'AGI company'? How will you prove to other AI companies—that do pursue AGI—that you're not competing with them?

Do you believe that most of the AI safety community approves of the creation of this new company? In what ways (if any) have you consulted with the community before starting the company?

To address the opening quote - the copy on our website is overzealous, and we will be changing it shortly. We are an AGI company in the sense that we take AGI seriously, but it is not our goal to accelerate progress towards it. Thanks for highlighting that.

We don’t have a concrete proposal for how to reliably signal that we’re committed to avoiding AGI race dynamics beyond the obvious right now. There is unfortunately no obvious or easy mechanism that we are aware of to accomplish this, but we are certainly open to discussion with any interested parties ab... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:

Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?

5Rohin Shah3mo
Idk, 95%? Probably I should push that down a bit because I haven't thought about it very hard. It's a bit fuzzy what "deployed" means, but for now I'm going to assume that we mean that we put inputs into the AI system for the primary purpose of getting useful outputs, rather than for seeing what the AI did so that we can make it better. Any existential catastrophe that didn't involve a failure of alignment seems like it had to involve a deployed system. For failures of alignment, I'd expect that before you get an AI system that can break out of the training process and kill you, you get an AI system that can break out of deployment and kill you, because there's (probably) less monitoring during deployment. You're also just running much longer during deployment -- if an AI system is waiting for the right opportunity, then even if it is equally likely to happen for a training vs deployment input (i.e. ignoring the greater monitoring during training), you'd still expect to see it happen at deployment since >99% of the inputs happen at deployment.
Thoughts on gradient hacking

The two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic "ruins" a different critical activation value).

2Richard Ngo4d
In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.
1Not Relevant2mo
I confess I’m confused as to how the network gradient hacks to create these redundant mechanisms in the first place? Since one of them seems completely useless unless designed deliberately.
Formalizing Objections against Surrogate Goals

Regarding the following part of the view that you commented on:

But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI.

Just wanted to add: It may be important to consider potential downside risks of such work. It may be important to be vigilant when working on certain topics in game theory and e.g. make certain binding commitments before investigating certain issues, because otherwise one might lose a commitment race in logical time. (I think this is a special case of a... (read more)

Obstacles to gradient hacking

To make sure I understand your notation, is some set of weights, right? If it's a set of multiple weights I don't know what you mean when you write .

There should also exist at least some f1,f2 where C(f_1,f_1)≠C(f_2,f_2), since otherwise C no longer depends on the pair of redundant networks at all

(I don't yet understand the purpose of this claim, but it seems to me wrong. If for every , why is it true that does not depend on and when ?)

Obstacles to gradient hacking

But gradient descent doesn’t modify a neural network one weight at a time

Sure, but the gradient component that is associated with a given weight is still zero if updating that weight alone would not affect loss.

Obstacles to gradient hacking

This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example.

I wouldn't say that this work it attempting to "exhibit gradient hacking". (Succeeding in that would require to create a model that can actually model SGD.) Rather, my understanding is that this work is trying to demonstrate techniques that might be used in a gradient hacking scenario.

There are a few ways to protect a subnetwork from being modified by gradient descent that I can think

... (read more)
2StellaAthena10mo
You seem to be under the impression that the goal is to make the NN robust to single-weight perturbation. But gradient descent doesn’t modify a neural network one weight at a time, and so being robust to single-weight modification doesn’t come with any real guarantees. The backward pass could result in weights of both forks being updated.
1leogao10mo
The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights (f,g) completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven't succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking. I don't think redundancy will work. Suppose you have some continuous everywhere, differentiable countably-almost everywhere combining functiony=C(f1,f2)that takes the outputs from two redundant copies offand outputs some kind of combined output. (If you're allowed functions that don't meet the "continuous everywhere, differentiable countably-almost everywhere" requirement, you might as well just skip the whole redundancy thing and just use a staircase.) Since this function prevents any gradients tof1andf2when they are equal, then it must be that at all points wheref1=f2,∂y∂f1=∂y∂f2=0. There should also exist at least somef1,f2where C(f1,f1)≠C(f2,f2), since otherwiseCno longer depends on the pair of redundant networks at all which means that those networks can't actually affect what the network does which defeats the whole point of this in the first place. Let us then defineg(x)=C(x,x). Then,dgdx=∂y∂f1+∂y∂f2=0for allx. This implies thatgis a constant function. Therefore, there do not existf1,f2whereC(f1,f1)≠C(f 2,f2). This is a contradiction, and thereforeCcannot exist.
Formalizing Objections against Surrogate Goals

In the bandits example, it seems like the caravan can unilaterally employ SPI to reduce the badness of the bandit's threat. For example, the caravan can credibly commit that they will treat Nerf guns identically to regular guns, so that (a) any time one of them is shot with a Nerf gun, they will flop over and pretend to be a corpse, until the heist has been resolved, and (b) their probability of resisting against Nerf guns will be the same as the probability of resisting against actual guns. In this case the bandits might as well use Nerf guns (perhaps be

... (read more)
4Rohin Shah10mo
Yeah, this seems right. I'll note though that you may want to make it at least somewhat easier to make the new threat, so that the other player has an incentive to use the new threat rather than the old threat, in cases where they would have used the old threat (rather than being indifferent between the two). This does mean it is no longer a Pareto improvement, but it still seems like this sort of unilateral commitment can significantly help in expectation.
2Vojtech Kovarik10mo
I would say that my main objection is that if you know that you will encounter SPI in situation X, you have an incentive to alter the policy that you will be using in X. Which might cause other agents to behave differently, possibly in ways that lead to the threat being carried out (which is precisely the thing that SPI aimed to avoid). In the bandit case, suppose the caravan credibly commits to treating nerf guns identically to regular runs. And suppose this incentivizes the bandits to avoid regular guns. Then you are incentivized to self-modify to start resisting more. (EG, if you both use CDT and the "logical time" is "self modify?" --> "credibly commit?" --> "use nerf?" .) However, if the bandits realize this --- i.e., if the "logical time" is "use nerf?" --> "self modify?" --> "credibly commit?" --- then the bandits will want to not use nerf guns, forcing you to not self-modify. And if you each think that you are "logically before" the other party, you will make incompatible comitments (use regular guns & self-modify to resist) and people get shot with regular guns. So, I agree that credible unilateral commitments can be useful and they can lead to guaranteed Pareto improvements. It's just that I don't think it addresses my main objection against the proposal. Yup, I fully agree.
Thoughts on gradient hacking

But if the agent is repeatedly carrying out its commitment to fail, then there’ll be pretty strong pressure from gradient descent to change that. What changes might that pressure lead to? The two most salient options to me:

  1. The agent’s commitment to carrying out gradient hacking is reduced.
  2. The agent’s ability to notice changes implemented by gradient descent is reduced.

In a gradient hacking scenario, we should expect the malicious conditionally-fail-on-purpose logic to be optimized for such outcomes not to occur. For example, the malicious logic may ... (read more)

2Richard Ngo9mo
What mechanism would ensure that these two logic pieces only fire at the same time? Whatever it is, I expect that mechanism to be changed in response to failures.
2[comment deleted]9mo
Power-seeking for successive choices

That quote does not seem to mention the "stochastic sensitivity issue". In the post that you linked to, "(3)" refers to:

  1. Not all environments have the right symmetries
    • But most ones we think about seem to

So I'm still not sure what you meant when you wrote "The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads."

(Again, I'm not aware of any previous mention of the "stochastic sensitivity issue" other than in my comment here.)

Environmental Structure Can Cause Instrumental Convergence

Thanks for the figure. I'm afraid I didn't understand it. (I assume this is a gridworld environment; what does "standing near intact vase" mean? Can the robot stand in the same cell as the intact vase?)

&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.

I don't follow (To be clear, I was not trying to apply any theorem from the paper via that involution). But does this mean you are NOT making that claim ("most agents wil... (read more)

Environmental Structure Can Cause Instrumental Convergence

The claim should be: most agents will not immediately break the vase.

I don't see why that claim is correct either, for a similar reason. If you're assuming here that most reward functions incentivize avoiding immediately breaking the vase then I would argue that that assumption is incorrect, and to support this I would point to the same involution from my previous comment.

2Alex Turner10mo
I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition). &You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.
Power-seeking for successive choices

The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn't about the stochastic sensitivity issue, and I don't think it should have to talk about the sensitivity issue.

I noticed that after my previous comment you've edited your comment to include the page number and the link. Thanks.

I still couldn't find in the paper (top of page 9) an explanation for the "stochastic sensitivity issue". Perhaps you were referring to the following:

randomly generat

... (read more)
2Alex Turner10mo
Power-seeking for successive choices

As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more "options" available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.

Also, this claim is missing the "disjoint requirement" and so it is incorrect even without the "they show that" part (i.e. it's not just that the theorem... (read more)

Environmental Structure Can Cause Instrumental Convergence

Thanks.

We can construct an involution over reward functions that transforms every state by switching the is-the-vase-broken bit in the state's representation. For every reward function that "wants to preserve the vase" we can apply on it the involution and get a reward function that "wants to break the vase".

(And there are the reward functions that are indifferent about the vase which the involution map to themselves.)

2Alex Turner10mo
Gotcha. I see where you're coming from. I think I underspecified the scenario and claim. The claim wasn't supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase. If the agent has a choice between one action ("break vase and move forwards") or another action ("don't break vase and more forwards"), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend to break it eventually, depending on the granularity and balance of final states. So I think we're actually both making a correct point, but you're making an argument forγ=1under certain kinds of models and whether the agent will eventually break the vase. I (meant to) discuss the immediate break-it-or-not decision in terms of option preservation at all discount rates. [Edited to reflect the ancestor comments]
Power-seeking for successive choices

The phenomena you discuss are explainted in the paper, and in other posts, and discussed at length in other comment threads.

I haven't found an explanation about the "stochastic sensitivity issue" in the paper, can you please point me to a specific section/page/quote? All that I found about this in the paper was the sentence:

Our theorems apply to stochastic environments, but we present a deterministic case study for clarity.

(I'm also not aware of previous posts/threads that discuss this, other than my comment here.)

I brought up this issue as a demons... (read more)

Environmental Structure Can Cause Instrumental Convergence

Are you saying that my first sentence ("Most of the reward functions are either indifferent about the vase or want to break the vase") is in itself factually wrong, or rather the rest of the quoted text?

2Alex Turner10mo
The first sentence
Power-seeking for successive choices

So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.

I think the quoted description is not at all what the theorems in the paper show, no matter what concept the word "options" (in scare quotes) refers to. In order to apply the theorems we need to show that an involution with certain properties exist; not that <some set of things after action 1> is larger than <some set of things after action 2>.

To be more specific, the concept that the word "options" refers to here is ... (read more)

5Alex Turner10mo
You're being unhelpfully pedantic. The quoted portion even includes the phrase "As a quick summary (read the paper [https://arxiv.org/abs/1912.01683] and sequence [https://www.alignmentforum.org/s/fSMbebQyR4wheRrvk?_ga=2.57865513.2011457156.1628785968-1006623555.1586831181] if you want more details)"! This reads to me as an attempted pre-emption of "gotcha" comments. The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other [https://www.lesswrong.com/posts/Yc5QSSZCQ9qdyxZF6/the-more-power-at-stake-the-stronger-instrumental#Note_of_caution__redux] posts, and discussed at length in other comment threads. But this post isn't about the stochastic sensitivity issue, and I don't think it should have to talk about the sensitivity issue.
Power-seeking for successive choices

As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more "options" available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.

That is not what the theorems in the paper show at all (it's not just a matter of details). The relevant theorems require a much stronger and more com... (read more)

0Ofer Givoli10mo
Also, this claim is missing the "disjoint requirement" and so it is incorrect even without the "they show that" part (i.e. it's not just that the theorems in the paper don't show the thing that is being claimed, but rather the thing that is being claimed is incorrect). Consider the following example where action 1 leads to more "options" but most optimal policies choose action 2:
2Alex Turner10mo
The point of using scare quotes is to abstract away that part. So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.
Automating Auditing: An ambitious concrete technical research proposal

In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.

How does such a scenario (in which "automating auditing fails") look like? The alignment researchers who will work on this will always be able to say: "Our current M... (read more)

2Evan Hubinger10mo
Sure, but presumably they'll also say what particular attacks are so hard that current ML models aren't capable of solving them—and I think that's a valuable piece of information to have.
Seeking Power is Convergently Instrumental in a Broad Class of Environments

I still don't see how this works. The "small constant" here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it's not a constant; it's an unbounded integer.

Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward func... (read more)

5Rohin Shah10mo
Yes, good point. I retract the original claim. (You're right, what you do here is you search for the kth permutation satisfying the theorem's requirements, where k is the specified number.)
2Alex Turner10mo
We aren't talking about MDPs, we're talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title. I don't follow. See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits. But IMO the basic idea in this case is, you can construct reasonably simple utility functions like "utility 1 if history has the agent taking actionaiat time steptgiven action-observation history prefixh1:t−1, and 0 otherwise." This is reasonably short, and you can apply it for all actions and time steps. Sure, the complexity will vary a little bit (probably later time steps will be more complex), but basically you can produce reasonably simple programs which make any sequence of actions optimal. And so I agree with Rohin that simplicity priors on u-AOH will quantitatively - but not qualitatively affect the conclusions for the generic u-AOH case. [EDIT: this reasoning is different from the one Rohin gave, TBC]
Seeking Power is Convergently Instrumental in a Broad Class of Environments

They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like "for every reward function that <doesn't do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity" (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post.

I don'... (read more)

2Rohin Shah10mo
You're right, in my previous comment it should be "at most a small constant more complexity +O(logN)", to specify the number of times that the permutation has to be applied. I still think the upshot is similar; in this case it's "power-seeking is at best a little less probable than non-power-seeking, but you should really not expect that the bound is tight".
rohinmshah's Shortform

The incentive of social media companies to invest billions into training competitive RL agents that make their users spend as much time as possible in their platform seem like an obvious reason to be concerned. Especially when such RL agents plausibly already select a substantial fraction of the content that people in developed countries consume.

2Rohin Shah10mo
I don't trust [https://www.lesswrong.com/posts/TmHRACaxXrLbXb5tS/rohinmshah-s-shortform?commentId=xiwm2vXSpK8Kovgp8] this sort of armchair reasoning. I think this is sufficient reason to raise the hypothesis to attention, but not enough to conclude that it is likely a real concern. And the data I have seen [https://www.lesswrong.com/posts/zvWqPmQasssaAWkrj/an-159-building-agents-that-know-how-to-experiment-by#RECOMMENDER_SYSTEMS_] does not seem kind to the hypothesis (though there may be better data out there that does support the hypothesis).
A world in which the alignment problem seems lower-stakes

I think that most of the citations in Superintelligence are in endnotes. In the endnote that follows the first sentence after the formulation of instrumental convergence thesis, there's an entire paragraph about Stephen Omohundro's work on the topic (including citations of Omohundro's "two pioneering papers on this topic").

A world in which the alignment problem seems lower-stakes

Bostrom's original instrumental convergence thesis needs to be applied carefully. The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent's environment

This post uses the phrase "Bostrom's original instrumental convergence thesis". I'm not aware of there being more than one instrumental convergence thesis. In the 2012 paper that is linked here the formulation of the thesis is identical to the one in the book Superintelligence (2014), except that the paper... (read more)

1Charlie Steiner1y
Weird coincidence, but I just read Superintelligence for the first time, and I was struck by the lack of mention of Steve Omohundro (though he does show up in endnote 8). My citation for instrumental convergence would be Omohundro 2008 [https://selfawaresystems.com/2007/11/30/paper-on-the-basic-ai-drives/].
2Alex Turner1y
Right. But I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.) (Edited post to clarify)
Environmental Structure Can Cause Instrumental Convergence

Because you can do "strictly more things" with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.

Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don't "tend to avoid breaking the vase". Those optimal policies don't behave as if they care about the 'strictly more states' that can... (read more)

2Alex Turner10mo
This is factually wrong BTW. I had just explained why the opposite is true.
Environmental Structure Can Cause Instrumental Convergence

That one in particular isn't a counterexample as stated, because you can't construct a subgraph isomorphism for it.

Probably not an important point, but I don't see why we can't use the identity isomorphism (over the part of the state space that a1 leads to).

2Rohin Shah1y
This particular argument is not talking about farsightedness -- when we talk about having more options, each option is talking about the entire journey and exact timesteps, rather than just the destination. Since all the "journeys" starting with the S --> Z action go to Z first, and all the "journeys" starting with the S --> A action go to A first, the isomorphism has to map A to Z and vice versa, so thatϕ(T(S,a1))=T(S,a2). (What assumption does this correspond to in the theorem? In the theorem, the involution has to mapFato a subset ofFa′; every possibility inFa1starts with A, and every possibility inFa2starts with Z, so you need to map A to Z.)
Environmental Structure Can Cause Instrumental Convergence

I was referring to the claim being made in Rohin's summary. (I no longer see counter examples after adding the assumption that "a1 and a2 lead to disjoint sets of future options".)

Environmental Structure Can Cause Instrumental Convergence

(we’re going to ignore cases where a1 or a2 is a self-loop)

I think that a more general class of things should be ignored here. For example, if a2 is part of a 2-cycle, we get the same problem as when a2 is a self-loop. Namely, we can get that most reward functions have optimal policies that take the action a1 over a2 (when the discount rate is sufficiently close to 1), which contradicts the claim being made.

2Rohin Shah1y
Thanks for the correction. That one in particular isn't a counterexample as stated, because you can't construct a subgraph isomorphism for it. When writing this I thought that actually meant I didn't need more of a caveat (contrary to what I said to you earlier), but now thinking about it a third time I really do need the "no cycles" caveat. The counterexample is: Z <--> S --> A --> B With every state also having a self-loop. In this case, the involution {S:B, Z:A}, would suggest that S --> Z would have more options than S --> A, but optimal policies will take S --> A more often than S --> Z. (The theorem effectively says "no cycles" by conditioning on the policy being S --> Z or S --> A, in which case the S --> Z --> S --> S --> ... possibility is not actually possible, and the involution doesn't actually go through.) EDIT: I've changed to say that the actions lead to disjoint parts of the state space.
2Alex Turner1y
I don't understand what you mean. Nothing contradicts the claim, if the claim is made properly, because the claim is a theorem and always holds when its preconditions do. (EDIT: I think you meant Rohin's claim in the summary?) I'd say that we can just remove the quoted portion and just explain "a1 and a2 lead to disjoint sets of future options", which automatically rules out the self-loop case. (But maybe this is what you meant, ofer?)
Discussion: Objective Robustness and Inner Alignment Terminology

Suppose we train a model, and at some point during training the inference execution hacks the computer on which the model is trained, and the computer starts doing catastrophic things via its internet connection. Does the generalization-focused approach consider this to be an outer alignment failure?

Environmental Structure Can Cause Instrumental Convergence

Optimal policies will tend to avoid breaking the vase, even though some don't. 

Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?

This is just making my point - Blackwell optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If D1 is {the first four cycles} and D2 is {the last cycle}, then optimal policies tend to end up in D1 instead of D2. Most optimal policies will avoi

... (read more)
2Alex Turner1y
Because you can do "strictly more things" with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives. Right, good question. I'll explain the general principle (not stated in the paper - yes, I agree this needs to be fixed!), and then answer your question about your environment. When the agent maximizes average reward, we know that optimal policies tend to seek power when there's something like: "Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1." (This follows by combining proposition 6.12 and theorem 6.13) From start, let a1 go to candy and let a2 go to wait. This satisfies the criterion: Wait! tends to be optimal and it tends to have more POWER, as expected.Let's reconsider your example: Let the discount rate equal 1. If action a1 goes to the right and a2 stays put at state 1, then the cycle sets are not disjoint (that's the violated condition). Prop 6.12 correctly tells us that state 1 tends to have more POWER than state 2, but theorem 6.13 tells us that average-optimal policies tend to go to the right. This highlights how power-seeking and optimality probability can come apart.Again, I very much agree that this part needs more explanation. Currently, the main paper has this to say: Throughout the paper, I focused on the survival case because it automatically satisfies the above criterion (death is definitionally disjoint from non-death, since we assume you can't do other things while dead), without my having to use limited page space explaining the nuances of this criterion. Yes, although SafeLife requires a bit of squinting (as I noted in the main pos
Environmental Structure Can Cause Instrumental Convergence

The paper supports the claim with:

  • Embodied environment in a vase-containing room (section 6.3)

I think this refers to the following passage from the paper:

Consider an embodied navigation task through a room with a vase. Proposition 6.9 suggests that optimal policies tend to avoid breaking the vase, since doing so would strictly decrease available options.

This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.

Regarding your next bullet point:

  • Pac-Man (fig
... (read more)
7Alex Turner1y
There are fewer ways for vase-breaking to be optimal. Optimal policies will tend to avoid breaking the vase, even though some don't. This is just making my point - average-optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. IfD1is {the first four cycles} andD2is {the last cycle}, then average-optimal policies tend to end up inD1instead ofD2. Most average-optimal policies will avoid entering the final state, just as section 7 claims. (EDIT: Blackwell -> average-) (And I claim that the whole reason you're able to reason about this environment is because my theorems apply to them - you're implicitly using my formalisms and frames to reason about this environment, while seemingly trying to argue that my theorems don't let us reason about this environment? Or something? I'm not sure, so take this impression with a grain of salt.) Why is it interesting to prove things about this set of MDPs? At this point, it feels like someone asking me "why did you buy a hammer - that seemed like a waste of money?". Maybe before I try out the hammer, I could have long debates about whether it was a good purchase. But now I know the tool is useful because I regularly use it and it works well for me, and other people have tried it and say it works well for them [https://www.lesswrong.com/posts/b6jJddSvWMdZHJHh3/environmental-structure-can-cause-instrumental-convergence?commentId=jyzYeStnG9hHMhxBW] . I agree that there's room for cleaner explanation of when the theorems apply, for those readers who don't want to memorize the formal conditions. But I think the theory says interesting things because it's already starting to explain the things I built it to explain (e.g. SafeLife). And whenever I imagine some new environment I want to reason about, I'm almost always able to reason about it using my theorems (modulo already flagged issues like partial observability etc). From this, I infer that the set of MDPs is "interest
Environmental Structure Can Cause Instrumental Convergence

For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paper v.7?).

I did read the "Note of caution" section in the OP. It says that most of the environments we think about seem to "have the right symmetries", which may be true, but I haven't seen the paper support that claim.

Maybe I just missed ... (read more)

I haven't seen the paper support that claim.

The paper supports the claim with:

  • Embodied environment in a vase-containing room (section 6.3)
  • Pac-Man (figure 8)
    • And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)
  • Average-optimal robots not idling in a particular spot (beginning of section 7)

This post supports the claim with:

  • Tic-Tac-Toe
  • Vase gridworld
  • SafeLife

So yes, this is sufficient support for speculation that most relevant environments have these symmetries. 

Maybe I just missed it, but I

... (read more)
Environmental Structure Can Cause Instrumental Convergence

No worries, thanks for the clarification.

[EDIT: the confusion may have resulted from me mentioning the LW username "adamShimi", which I'll now change to the display name on the AF ("Adam Shimi").]

Environmental Structure Can Cause Instrumental Convergence

Meta: it seems that my original comment was silently removed from the AI Alignment Forum. I ask whoever did this to explain their reasoning here. Since every member of the AF could have done this AFAIK, I'm going to try to move my comment back to AF, because I think it obviously belongs there (I don't believe we have any norms about this sort of situations...). If the removal was done by a forum moderator/admin, please let me know.

[This comment is no longer endorsed by its author]Reply
2Alex Turner1y
My apologies - I had thought I had accidentally moved your comment to AF by unintentionally replying to your comment on AF, and so (from my POV) I "undid" it (for both mine and yours). I hadn't realized it was already on AF.
Environmental Structure Can Cause Instrumental Convergence

I've ended up spending probably more than 40 hours discussing, thinking and reading this paper (including earlier versions; the paper was first published on December 2019, and the current version is the 7th, published on June 1st, 2021). My impression is very different than Adam Shimi's. The paper introduces many complicated definitions that build on each other, and its theorems say complicated things using those complicated definitions. I don't think the paper explains how its complicated theorems are useful/meaningful.

In particular, I don't think the pap... (read more)

5Adam Shimi1y
Despite disagreeing with you, I'm glad that you published this comment and I agree that airing up disagreements is really important for the research community. There's a sense in which I agree with you: AFAIK, there is no formal statement of the set of MDPs with the structural properties that Alex studies here. That doesn't mean it isn't relatively easy to state: * Proposition 6.9 [https://arxiv.org/pdf/1912.01683.pdf] requires that there is a state with two actionsa1anda2such that (let's say)a1leads to a subMDP that can be injected/strictly injected into the subMDP thata2leads to. * Theorems 6.12 and 6.13 [https://arxiv.org/pdf/1912.01683.pdf] require that there is a state with two actionsa1and such that (let's say)a1leads to a set of RSDs (final cycles that are strictly optimal for some reward function) that can be injected/strictly injected into the set of RSDs froma2. The first set of MDPs is quite restrictive (because you need an exact injection), which is why IIRC Alex extends the results to the sets of RSDs, which captures a far larger class of MDPs. Intuitively, this is the class of MDPs such that some action leads to more infinite horizon behaviors than another for the same state. I personally find this class quite intuitive, and also I feel like it captures many real world situations where we worry about power and instrumental convergence. Once again, I agree in part with the statement that the paper doesn't IIRC explicitly discuss different convergent instrumental goals. On the other hand, the paper explicitly says that it focus on a special case of the instrumental convergence thesis. That being said, you just made me want to look more into how well power-seeking captures different convergent instrumental goals from Omohundro's paper [https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf], so thanks for that. :)
1Ofer Givoli1y
Meta: it seems that my original comment was silently removed from the AI Alignment Forum. I ask whoever did this to explain their reasoning here. Since every member of the AF could have done this AFAIK, I'm going to try to move my comment back to AF, because I think it obviously belongs there (I don't believe we have any norms about this sort of situations...). If the removal was done by a forum moderator/admin, please let me know.
8Alex Turner1y
For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paperv.7?). I don't think it will be useful for me to engage in detail, given that we've already extensively debated these points at length, without much consensus being reached.
Formal Inner Alignment, Prospectus

Brainstorming

The following is a naive attempt to write a formal, sufficient condition for a search process to be "not safe with respect to inner alignment".

Definitions:

: a distribution of labeled examples. Abuse of notation: I'll assume that we can deterministically sample a sequence of examples from .

: a deterministic supervised learning algorithm that outputs an ML model. has access to an infinite sequence of training examples that is provided as input; and it uses a certain "amount of compute" that is also provided as input. If we operationalize... (read more)

MDP models are determined by the agent architecture and the environmental dynamics

Not from the paper. I just wrote it.

Consider adding to the paper a high-level/simplified description of the environments for which the following sentence from the abstract applies: "We prove that for most prior beliefs one might have about the agent’s reward function [...] one should expect optimal policies to seek power in these environments." (If it's the set of environments in which "the “vast majority” of RSDs are only reachable by following a subset of policies" consider clarifying that in the paper). It's hard (at least for me) to infer that from ... (read more)

MDP models are determined by the agent architecture and the environmental dynamics

see also: "optimal policies tend to take actions which strictly preserve optionality*"

Does this quote refer to a passage from the paper? (I didn't find it.)

It certainly has some kind of effect, but I don't find it obvious that it has the effect you're seeking - there are many simple ways of specifying action-history+state reward functions, which rely on the action-history and not just the rest of the state.

There are very few reward functions that rely on action-history—that can be specified in a simple way—relative to all the reward functions that r... (read more)

3Alex Turner1y
Not from the paper. I just wrote it. It isn't the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment. Sure. Here's what I said: The broader claim I was trying to make was not "it's hard to write down any state-based reward functions that don't incentivize power-seeking", it was that there are fewer qualitatively distinct ways to do it in the state-based case. In particular, it's hard to write down state-based reward functions which incentivize any given sequence of actions: If you disagree, then try writing down a state-based reward function for e.g. Pacman for which an optimal policy starts off by (EDIT: circling the level counterclockwise) (at a discount rate close to 1). Such reward functions provably exist, but they seem harder to specify in general. Also: thanks for your engagement, but I still feel like my points aren't landing (which isn't necessarily your fault or anything), and I don't want to put more time into this right now. Of course, you can still reply, but just know I might not reply and that won't be anything personal. EDIT: FYI I find your action-camera example interesting. Thank you for pointing that out.
MDP models are determined by the agent architecture and the environmental dynamics

The theorems hold for all finite MDPs in which the formal sufficient conditions are satisfied (i.e. the required environmental symmetries exist; see proposition 6.9, theorem 6.13, corollary 6.14). For practical advice, see subsection 6.3 and beginning of section 7.

It seems to me that the (implicit) description in the paper of the set of environments over which "one should expect optimal policies to seek power" ("for most prior beliefs one might have about the agent’s reward function") involves a lot of formalism/math. I was looking for some high-level/s... (read more)

2Alex Turner1y
Ah, I see. In addition to the cited explanation, see also: "optimal policies tend to take actions which strictly preserve optionality*", where the optionality preservation is rather strict (requiring a graphical similarity, and not just "there are more options this way than that"; ironically, this situation is considerably simpler in arbitrary deterministic computable environments, but that will be the topic of a future post). No - the sufficient condition is about the environment, and instrumental convergence is about policies over that environment. I interpret instrumental convergence as "intelligent goal-directed agents tend to take certain kinds of actions"; this informal claim is necessarily vague. This is a formal sufficient condition which allows us to conclude that optimal goal-directed agents will tend to take a certain action in the given situation. It certainly has some kind of effect, but I don't find it obvious that it has the effect you're seeking - there are many simple ways of specifying action-history+state reward functions, which rely on the action-history and not just the rest of the state. What's special is that (by assumption) the action logger always logs the agent's actions, even if the agent has been literally blown up in-universe. That wouldn't occur with the security camera. With the security camera, once the agent is dead, the agent can no longer influence the trajectory, and the normal death-avoiding arguments apply. But your action logger supernaturally writes a log of the agent's actions into the environment. Right, but if you want the optimal policies to take actionsa1…ak, then write a reward function which returns 1 iff the action-logger begins with those actions and 0 otherwise. Therefore, it's extremely easy to incentivize arbitrary action sequences.
MDP models are determined by the agent architecture and the environmental dynamics

I'll address everything in your comment, but first I want to zoom out and say/ask:

  1. In environments that have a state graph that is a tree-with-constant-branching-factor, the POWER—defined over IID-over-states reward distribution—is equal in all states. I argue that environments with very complex physical dynamics are often like that, but not if at some time step the agent can't influence the environment. (I think we agree so far?) I further argue that we can take any MDP environment and "unroll" its state graph into a tree-with-constant-branching-factor (e.
... (read more)
2Alex Turner1y
Thanks for taking the time to write this out. I'm sorry - although I think I mentioned it in passing, I did not draw sufficient attention to the fact that I've been talking about a drastically broadened version of the paper, compared to what was on arxiv when you read it. The new version should be up in a few days. I feel really bad about this - especially since you took such care in reading the arxiv version! The theorems hold for all finite MDPs in which the formal sufficient conditions are satisfied (i.e. the required environmental symmetries exist; see proposition 6.9, theorem 6.13, corollary 6.14). For practical advice, see subsection 6.3 and beginning of section 7. (I shared the Overleaf with Ofer; if other lesswrong readers want to read without waiting for arxiv to update, message me! ETA: The updated version is now on arxiv [https://arxiv.org/abs/1912.01683].) I agree that you can do that. I also think that instrumental convergence doesn't apply in such MDPs (as in, "most" goals over the environment won't incentivize any particular kind of optimal action), unless you restrict to certain kinds of reward functions. Fix a reward function distributionDMiidin the original MDPM. For simplicity, let's supposeDMiidis max-ent (and thus IID). Let's suppose we agree that optimal policies underDMiidtend to avoid getting shut off. Translated to the rolled-out MDPM′,DMiidno longer distributes reward uniformly over states. In fact, in its support, each reward function has the rather unusual property that its reward is only dependent on the current state, and not on the action log's contents. When translated intoM′,DMiidimposes heavy structural assumptions on its reward functions, and it's not max-ent over the states ofM′. By the "functional equivalence", it still gives you the same optimality probabilities as before, and so it still tends to incentivize shutdown avoidance. However, if you take a max-ent over the rolled-out states ofM′, then this max-ent won't ince
MDP models are determined by the agent architecture and the environmental dynamics

So we can't set the 'arbitrary' part aside - instrumentally convergent means that the incentives apply across most reward functions - not just for one. You're arguing that one reward function might have that incentive. But why would most goals tend to have that incentive?

I was talking about a particular example, with a particular reward function that I had in mind. We seemed to disagree about whether instrumental convergence arguments apply there, and my purpose in that comment was to argue that they do. I'm not trying to define here the set of reward f... (read more)

2Alex Turner1y
ETA: I agree with this point in the main - they don't apply to all reward functions. But, we should be able to ground the instrumental convergence arguments via reward functions in some way. Edited out because I read through that part of your comment a little too fast, and replied to something you didn't say. What does it mean to "shut down" the process? 'Doesn't mean they won't' - so new strings will appear in the environment? Then how was the agent "shut down"? What is it instead? We're considering description length? Now it's not clear that my theory disagrees with your prediction, then. If you say we have a simplicity prior over reward functions given some encoding, well, POWER and optimality probability now reflect your claims, and they now say there is instrumental convergence to the extent that that exists under a simplicity prior? (I still don't think they would exist; and my theory shows that in the space of all possible reward function distributions, equal proportions incentivize action A over action B, as vice versa - we aren't just talking about uniform. and so the onus is on you to provide the sense in which instrumental convergence exists here.) And to the extent we were always considering description length - was the problem that IID-optimality probability doesn't reflect simplicity-weighted behavioral tendencies? I still don't know what it would mean for Ofer-instrumental convergence to exist in this environment, or not.
MDP models are determined by the agent architecture and the environmental dynamics

So if you disagree, please explain why arbitrary reward functions tend to incentivize outputting one string sequence over another?

(Setting aside the "arbitrary" part, because I didn't talk about an arbitrary reward function…)

Consider a string, written by the chatbot, that "hacks" the customer and cause them to invoke a process that quickly takes control over most of the computers on earth that are connected to the internet, then "hacks" most humans on earth by showing them certain content, and so on (to prevent interferences and to seize control ASAP); ... (read more)

2Alex Turner1y
To clarify: when I say that taking over the world is "instrumentally convergent", I mean that most objectives incentivize it. If you mean something else, please tell me. (I'm starting to think there must be a serious miscommunication somewhere if we're still disagreeing about this?) So we can't set the 'arbitrary' part aside - instrumentally convergent means that the incentives apply across most reward functions - not just for one. You're arguing that one reward function might have that incentive. But why would most goals tend to have that incentive? This doesn't make sense to me. We assumed the agent is Cartesian-separated from the universe, and its actions magically make strings appear somewhere in the world. How could humans interfere with it? What, concretely, are the "risks" faced by the agent? (Technically, the agent's goals are defined over the text-state, and you can assign high reward to text-states in which people bought stuff. But the agent doesn't actually have goals over the physical world as we generally imagine them specified.) This statement is vacuous, because it's true about any possible string. ---- The original argument given for instrumental convergence and power-seeking is that gaining resources tends to be helpful for most objectives (this argument isn't valid in general, but set that aside for now). But even that's not true here. The problem is that the 'text-string-world' model is framed in a leading way, which is suggestive of the usual power-seeking setting (it's representing the real world and it's complicated, there must be instrumental convergence), even though it's structurally a whole different beast. Objective functions induce preferences over text-states (with a "what's the world look like?" tacked on). The text-state the agent ends up in is, by your assumption, determined by the text output of the agent. Nothing which happens in the world expands or restrict's the agent's ability to output text. So there's no particular reas
MDP models are determined by the agent architecture and the environmental dynamics

is the amount of money paid by the client part of the state?

Yes; let's say the state representation determines the location of every atom in that earth-like environment. The idea is that the environment is very complicated (and contains many actors) and thus the usual arguments for instrumental convergence apply. (If this fails to address any of the above issues let me know.)

2Alex Turner1y
Yeah, i claim that this intuition is actually wrong and there's no instrumental convergence in this environment. Complicated & contains actors doesn't mean you can automatically conclude instrumental convergence. The structure of the environment is what matters for "arbitrarily capable agents"/optimal policies (learned policies are probably more dependent on representation and training process). So if you disagree, please explain why arbitrary reward functions tend to incentivize outputting one string sequence over another? Because, again, this environment is literally isomorphic to What I think you're missing is that the environment can't affect the agent's capabilities or available actions; it can't gain or lose power, just freely steer through different trajectories.
Load More