All of ofer's Comments + Replies

Environmental Structure Can Cause Instrumental Convergence

For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paper v.7?).

I did read the "Note of caution" section in the OP. It says that most of the environments we think about seem to "have the right symmetries", which may be true, but I haven't seen the paper support that claim.

Maybe I just missed ... (read more)

2Alex Turner3hThe paper supports the claim with: * Embodied environment in a vase-containing room (section 6.3) * Pac-Man (figure 8) * And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed) * Blackwell-optimal robots not idling in a particular spot (beginning of section 7) This post supports the claim with: * Tic-Tac-Toe * Vase gridworld * SafeLife So yes, this is sufficient support for speculation that most relevant environments have these symmetries. Sorry - I meant the "future work" portion of the discussion section 7. The future work highlights the "note of caution" bits. I also made sure that the intro emphasizes that the results don't apply to learned policies. Key part: earlier version of the paper. (I've talked to Ben since then, including about the newest results, their limitations, and their usefulness.) Your advice was beneficial a year ago, because that was a very different paper. I think it is no longer beneficial: I still agree with it, but I don't think it needs to be mentioned on the margin. At this point, I have put far more care into hedging claims than most other work which I can recall. At some point, you're hedging too much. And I'm not interested in hedging any more, unless I've made some specific blatant oversights which you'd like to inform me of.
Environmental Structure Can Cause Instrumental Convergence

Meta: it seems that my original comment was silently removed from the AI Alignment Forum. I ask whoever did this to explain their reasoning here. Since every member of the AF could have done this AFAIK, I'm going to try to move my comment back to AF, because I think it obviously belongs there (I don't believe we have any norms about this sort of situations...). If the removal was done by a forum moderator/admin, please let me know.

[This comment is no longer endorsed by its author]Reply
2Alex Turner4hMy apologies - I had thought I had accidentally moved your comment to AF by unintentionally replying to your comment on AF, and so (from my POV) I "undid" it (for both mine and yours). I hadn't realized it was already on AF.
Environmental Structure Can Cause Instrumental Convergence

I've ended up spending probably more than 40 hours discussing, thinking and reading this paper (including earlier versions; the paper was first published on December 2019, and the current version is the 7th, published on June 1st, 2021). My impression is very different than adamShimi's. The paper introduces many complicated definitions that build on each other, and its theorems say complicated things using those complicated definitions. I don't think the paper explains how its complicated theorems are useful/meaningful.

In particular, I don't think the pape... (read more)

1Ofer Givoli4hMeta: it seems that my original comment was silently removed from the AI Alignment Forum. I ask whoever did this to explain their reasoning here. Since every member of the AF could have done this AFAIK, I'm going to try to move my comment back to AF, because I think it obviously belongs there (I don't believe we have any norms about this sort of situations...). If the removal was done by a forum moderator/admin, please let me know.
2Alex Turner6hFor my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paperv.7?). I don't think it will be useful for me to engage in detail, given that we've already extensively debated these points at length, without much consensus being reached.
Formal Inner Alignment, Prospectus

Brainstorming

The following is a naive attempt to write a formal, sufficient condition for a search process to be "not safe with respect to inner alignment".

Definitions:

: a distribution of labeled examples. Abuse of notation: I'll assume that we can deterministically sample a sequence of examples from .

: a deterministic supervised learning algorithm that outputs an ML model. has access to an infinite sequence of training examples that is provided as input; and it uses a certain "amount of compute" that is also provided as input. If we operationalize... (read more)

MDP models are determined by the agent architecture and the environmental dynamics

Not from the paper. I just wrote it.

Consider adding to the paper a high-level/simplified description of the environments for which the following sentence from the abstract applies: "We prove that for most prior beliefs one might have about the agent’s reward function [...] one should expect optimal policies to seek power in these environments." (If it's the set of environments in which "the “vast majority” of RSDs are only reachable by following a subset of policies" consider clarifying that in the paper). It's hard (at least for me) to infer that from ... (read more)

MDP models are determined by the agent architecture and the environmental dynamics

see also: "optimal policies tend to take actions which strictly preserve optionality*"

Does this quote refer to a passage from the paper? (I didn't find it.)

It certainly has some kind of effect, but I don't find it obvious that it has the effect you're seeking - there are many simple ways of specifying action-history+state reward functions, which rely on the action-history and not just the rest of the state.

There are very few reward functions that rely on action-history—that can be specified in a simple way—relative to all the reward functions that r... (read more)

2Alex Turner16dNot from the paper. I just wrote it. It isn't the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment. Sure. Here's what I said: The broader claim I was trying to make was not "it's hard to write down any state-based reward functions that don't incentivize power-seeking", it was that there are fewer qualitatively distinct ways to do it in the state-based case. In particular, it's hard to write down state-based reward functions which incentivize any given sequence of actions: If you disagree, then try writing down a state-based reward function for e.g. Pacman for which an optimal policy starts off by (EDIT: circling the level counterclockwise) (at a discount rate close to 1). Such reward functions provably exist, but they seem harder to specify in general. Also: thanks for your engagement, but I still feel like my points aren't landing (which isn't necessarily your fault or anything), and I don't want to put more time into this right now. Of course, you can still reply, but just know I might not reply and that won't be anything personal. EDIT: FYI I find your action-camera example interesting. Thank you for pointing that out.
MDP models are determined by the agent architecture and the environmental dynamics

The theorems hold for all finite MDPs in which the formal sufficient conditions are satisfied (i.e. the required environmental symmetries exist; see proposition 6.9, theorem 6.13, corollary 6.14). For practical advice, see subsection 6.3 and beginning of section 7.

It seems to me that the (implicit) description in the paper of the set of environments over which "one should expect optimal policies to seek power" ("for most prior beliefs one might have about the agent’s reward function") involves a lot of formalism/math. I was looking for some high-level/s... (read more)

2Alex Turner17dAh, I see. In addition to the cited explanation, see also: "optimal policies tend to take actions which strictly preserve optionality*", where the optionality preservation is rather strict (requiring a graphical similarity, and not just "there are more options this way than that"; ironically, this situation is considerably simpler in arbitrary deterministic computable environments, but that will be the topic of a future post). No - the sufficient condition is about the environment, and instrumental convergence is about policies over that environment. I interpret instrumental convergence as "intelligent goal-directed agents tend to take certain kinds of actions"; this informal claim is necessarily vague. This is a formal sufficient condition which allows us to conclude that optimal goal-directed agents will tend to take a certain action in the given situation. It certainly has some kind of effect, but I don't find it obvious that it has the effect you're seeking - there are many simple ways of specifying action-history+state reward functions, which rely on the action-history and not just the rest of the state. What's special is that (by assumption) the action logger always logs the agent's actions, even if the agent has been literally blown up in-universe. That wouldn't occur with the security camera. With the security camera, once the agent is dead, the agent can no longer influence the trajectory, and the normal death-avoiding arguments apply. But your action logger supernaturally writes a log of the agent's actions into the environment. Right, but if you want the optimal policies to take actionsa1…ak, then write a reward function which returns 1 iff the action-logger begins with those actions and 0 otherwise. Therefore, it's extremely easy to incentivize arbitrary action sequences.
MDP models are determined by the agent architecture and the environmental dynamics

I'll address everything in your comment, but first I want to zoom out and say/ask:

  1. In environments that have a state graph that is a tree-with-constant-branching-factor, the POWER—defined over IID-over-states reward distribution—is equal in all states. I argue that environments with very complex physical dynamics are often like that, but not if at some time step the agent can't influence the environment. (I think we agree so far?) I further argue that we can take any MDP environment and "unroll" its state graph into a tree-with-constant-branching-factor (e.
... (read more)
2Alex Turner22dThanks for taking the time to write this out. I'm sorry - although I think I mentioned it in passing, I did not draw sufficient attention to the fact that I've been talking about a drastically broadened version of the paper, compared to what was on arxiv when you read it. The new version should be up in a few days. I feel really bad about this - especially since you took such care in reading the arxiv version! The theorems hold for all finite MDPs in which the formal sufficient conditions are satisfied (i.e. the required environmental symmetries exist; see proposition 6.9, theorem 6.13, corollary 6.14). For practical advice, see subsection 6.3 and beginning of section 7. (I shared the Overleaf with Ofer; if other lesswrong readers want to read without waiting for arxiv to update, message me! ETA: The updated version is now on arxiv [https://arxiv.org/abs/1912.01683].) I agree that you can do that. I also think that instrumental convergence doesn't apply in such MDPs (as in, "most" goals over the environment won't incentivize any particular kind of optimal action), unless you restrict to certain kinds of reward functions. Fix a reward function distributionDMiidin the original MDPM. For simplicity, let's supposeDMiidis max-ent (and thus IID). Let's suppose we agree that optimal policies underDMiidtend to avoid getting shut off. Translated to the rolled-out MDPM′,DMiidno longer distributes reward uniformly over states. In fact, in its support, each reward function has the rather unusual property that its reward is only dependent on the current state, and not on the action log's contents. When translated intoM′,DMiidimposes heavy structural assumptions on its reward functions, and it's not max-ent over the states ofM′. By the "functional equivalence", it still gives you the same optimality probabilities as before, and so it still tends to incentivize shutdown avoidance. However, if you take a max-ent over the rolled-out states ofM′, then this max-ent won't ince
MDP models are determined by the agent architecture and the environmental dynamics

So we can't set the 'arbitrary' part aside - instrumentally convergent means that the incentives apply across most reward functions - not just for one. You're arguing that one reward function might have that incentive. But why would most goals tend to have that incentive?

I was talking about a particular example, with a particular reward function that I had in mind. We seemed to disagree about whether instrumental convergence arguments apply there, and my purpose in that comment was to argue that they do. I'm not trying to define here the set of reward f... (read more)

2Alex Turner1moETA: I agree with this point in the main - they don't apply to all reward functions. But, we should be able to ground the instrumental convergence arguments via reward functions in some way. Edited out because I read through that part of your comment a little too fast, and replied to something you didn't say. What does it mean to "shut down" the process? 'Doesn't mean they won't' - so new strings will appear in the environment? Then how was the agent "shut down"? What is it instead? We're considering description length? Now it's not clear that my theory disagrees with your prediction, then. If you say we have a simplicity prior over reward functions given some encoding, well, POWER and optimality probability now reflect your claims, and they now say there is instrumental convergence to the extent that that exists under a simplicity prior? (I still don't think they would exist; and my theory shows that in the space of all possible reward function distributions, equal proportions incentivize action A over action B, as vice versa - we aren't just talking about uniform. and so the onus is on you to provide the sense in which instrumental convergence exists here.) And to the extent we were always considering description length - was the problem that IID-optimality probability doesn't reflect simplicity-weighted behavioral tendencies? I still don't know what it would mean for Ofer-instrumental convergence to exist in this environment, or not.
MDP models are determined by the agent architecture and the environmental dynamics

So if you disagree, please explain why arbitrary reward functions tend to incentivize outputting one string sequence over another?

(Setting aside the "arbitrary" part, because I didn't talk about an arbitrary reward function…)

Consider a string, written by the chatbot, that "hacks" the customer and cause them to invoke a process that quickly takes control over most of the computers on earth that are connected to the internet, then "hacks" most humans on earth by showing them certain content, and so on (to prevent interferences and to seize control ASAP); ... (read more)

2Alex Turner1moTo clarify: when I say that taking over the world is "instrumentally convergent", I mean that most objectives incentivize it. If you mean something else, please tell me. (I'm starting to think there must be a serious miscommunication somewhere if we're still disagreeing about this?) So we can't set the 'arbitrary' part aside - instrumentally convergent means that the incentives apply across most reward functions - not just for one. You're arguing that one reward function might have that incentive. But why would most goals tend to have that incentive? This doesn't make sense to me. We assumed the agent is Cartesian-separated from the universe, and its actions magically make strings appear somewhere in the world. How could humans interfere with it? What, concretely, are the "risks" faced by the agent? (Technically, the agent's goals are defined over the text-state, and you can assign high reward to text-states in which people bought stuff. But the agent doesn't actually have goals over the physical world as we generally imagine them specified.) This statement is vacuous, because it's true about any possible string. ---- The original argument given for instrumental convergence and power-seeking is that gaining resources tends to be helpful for most objectives (this argument isn't valid in general, but set that aside for now). But even that's not true here. The problem is that the 'text-string-world' model is framed in a leading way, which is suggestive of the usual power-seeking setting (it's representing the real world and it's complicated, there must be instrumental convergence), even though it's structurally a whole different beast. Objective functions induce preferences over text-states (with a "what's the world look like?" tacked on). The text-state the agent ends up in is, by your assumption, determined by the text output of the agent. Nothing which happens in the world expands or restrict's the agent's ability to output text. So there's no particular reas
MDP models are determined by the agent architecture and the environmental dynamics

is the amount of money paid by the client part of the state?

Yes; let's say the state representation determines the location of every atom in that earth-like environment. The idea is that the environment is very complicated (and contains many actors) and thus the usual arguments for instrumental convergence apply. (If this fails to address any of the above issues let me know.)

2Alex Turner1moYeah, i claim that this intuition is actually wrong and there's no instrumental convergence in this environment. Complicated & contains actors doesn't mean you can automatically conclude instrumental convergence. The structure of the environment is what matters for "arbitrarily capable agents"/optimal policies (learned policies are probably more dependent on representation and training process). So if you disagree, please explain why arbitrary reward functions tend to incentivize outputting one string sequence over another? Because, again, this environment is literally isomorphic to What I think you're missing is that the environment can't affect the agent's capabilities or available actions; it can't gain or lose power, just freely steer through different trajectories.
MDP models are determined by the agent architecture and the environmental dynamics

OK, but now that seems okay again, because there isn't any instrumental convergence here either. This is just an alternate representation ('reskin') of a sequential string output MDP, where the agent just puts a string in slot t at time t.

I think we're still not thinking about the same thing; in the example I'm thinking about the agent is supposed to fill the role of a human salesperson, and the reward function is (say) the amount of money that the client paid (possibly over a long time period). So an optimal policy may be very complicated and involve instrumentally convergent goals.

2Alex Turner1moFor that particular reward function, yes, the optimal policies may be very complicated. But why are there instrumentally convergent goals in that environment? Why should I expect capable agents in that environment to tend to output certain kinds of string sequences, over other kinds of string sequences? (Also, is the amount of money paid by the client part of the state? Or is the agent just getting rewarded for the total number of purchase-assents in the conversation over time?)
MDP models are determined by the agent architecture and the environmental dynamics

I was imagining a formal (super-complex) MDP that looks like our world. The customer in my example is meant to be equivalent to a human on earth.

But I haven't taken into account that this runs into embedded agency issues. (E.g. how does the state transition function look like when the computer that "runs the agent" is down?)

And if you update the encodings and dynamics to account for real-world resource gain possibilities, then POWER and optimality probability will update accordingly and appropriately.

Because states from which the agent can (say) preven... (read more)

2Alex Turner1moRight, that does complicate things. I'd like to get a better picture of the considerations here, but given how POWER behaves on environment structures so far, I'm pretty confident it'll adapt to appropriate ways of modelling the situation. OK, but now that seems okay again, because there isn't any instrumental convergence here either. This is just an alternate representation ('reskin') of a sequential string output MDP, where the agent just puts a string in slot t at time t.
MDP models are determined by the agent architecture and the environmental dynamics

There aren't any robustly instrumental goals in this setting, as best I can tell.

If we consider a sufficiently high level of capability, the instrumental convergence thesis applies. (E.g. the agent might manipulate/hack the customer and then gain control over resources, and stop anyone from interfering with its plan.)

2Alex Turner1moThe instrumental convergence thesis is not a fact about every situation involving "capable AI", but a thesis pointing out a reliable-seeming pattern across environments and goals. It can't be used as a black-box reason on its own - you have to argue why the reasoning applies in the environment. In particular, we assumed that the agent is interacting with the text MDP, where Optimal policies do not have particular tendencies in this model. There's nothing more "capable" than an optimal policy. Which is to say, optimal policies for the actual text interaction MDP do not exhibit instrumental convergence (which says nothing about learned optimizer risks, etc). But you seem to be secretly switching from the pure-text-interaction-MDP to a real-world-modelling-MDP, and then saying that POWER in the former doesn't correspond to POWER in the latter. Well, that's no big surprise. The real world MDP model is no longer modelling just the text interaction, but also the broader environment, which violates the very representation assumption which led to your "IID-POWER equality" conclusion. And if you update the encodings and dynamics to account for real-world resource gain possibilities, then POWER and optimality probability will update accordingly and appropriately. However, if you meant for the environment dynamics to originally include possibilities like "the agent can get shut off, or interfered with", then the model is no longer regular in the way you mentioned, and IID-POWER is no longer equal across states.
MDP models are determined by the agent architecture and the environmental dynamics

I agree that in MDP problems in which the agent can lose its ability to influence the environment, we can generally expect POWER to correlate with not-losing-the-ability-to-influence-the-environment. The environments in such problems never have a state graph that is a tree-with-a-constant-branching-factor, no matter how complex they are, and thus my argument doesn't apply to them. (And I think publishing work about such MDP environments may be very useful.)

I don't think all real-world problems are like that (though many are), and the choice of the state re... (read more)

2Alex Turner1moI agree. Also: the state and action representations determine which reward functions we can express, and I claim that it makes sense for the theory to reflect that fact. Agreed. I also don't currently see a problem here. There aren't any robustly instrumental goals in this setting, as best I can tell.
Formal Inner Alignment, Prospectus

I would sure be awfully surprised to see that! Wouldn't you?

My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about . If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn't be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.

Note that the examples in my comment don't rely on deceptive alignment. To "convert" your PacMan ... (read more)

3Steve Byrnes1moMy hunch is that we don't disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you're misinterpreting me as saying something more interesting than I am.
Formal Inner Alignment, Prospectus

By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they're instrumentally useful, or they're a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.

We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may ... (read more)

1Steve Byrnes1moLike, if we do gradient descent, and the training signal is "get a high score in PacMan", then "mesa-optimize for a high score in PacMan" is incentivized by the training signal, and "mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips" is also incentivized by the training signal. For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy. Here's a nice example. Let's say we do RL, and our model is initialized with random weights. The training signal is "get a high score in PacMan". We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it's fabulously effective at calculating digits of π—it calculates them by the billions—and it's doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it's in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn't you? If so, then you agree with me that "reasoning about training incentives" is a valid type of reasoning about what to expect from trained ML models. I don't think it's a controversial opinion... Again, I did not (and don't) claim that this type of reasoning should lead people to believe that mesa-optimizers won't happen, because there do tend to be training incentives for mesa-optimization.
Draft report on existential risk from power-seeking AI

Just to summarize my current view: For MDP problems in which the state representation is very complex, and different action sequences always yield different states, POWER-defined-over-an-IID-reward-distribution is equal for all states, and thus does not match the intuitive concept of power.

At some level of complexity such problems become relevant (when dealing with problems with real-world-like environments). These are not just problems that show up when one adverserially constructs an MDP problem to game POWER, or when one makes "really weird modelling ch... (read more)

Draft report on existential risk from power-seeking AI

You shouldn't need to contort the distribution used by POWER to get reasonable outputs.

I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP's state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state. This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states... (read more)

3Alex Turner1moI replied to this point with a short post [https://www.lesswrong.com/posts/XkXL96H6GknCbT5QH/mdp-models-are-determined-by-the-agent-architecture-and-the] .
2Alex Turner2moNot necessarily true - you're still considering the IID case. Yes, if you insist in making really weird modelling choices (and pretending the graph still well-models the original situation, even though it doesn't), you can make POWER say weird things. But again, I can prove that up to a large range of perturbation, most distributions will agree that some obvious states have more POWER than other obvious states. Your original claim was that POWER isn't a good formalization of intuitive-power/influence. You seem to be arguing that because there exists a situation "modelled" by an adversarially chosen environment grounding such that POWER returns "counterintuitive" outputs (are they really counterintuitive, given the information available to the formalism?), therefore POWER is inappropriately sensitive to the reward function distribution. Therefore, it's not a good formalization of intuitive-power. I deny both of the 'therefores.' The right thing to do is just note that there is some dependence on modelling choices, which is another consideration to weigh (especially as we move towards more sophisticated application of the theorems to e.g. distributions over mesa objectives and their attendant world models). But you should sure that the POWER-seeking conclusions hold under plausible modelling choices, and not just the specific one you might have in mind. I think that my theorems show that they do in many reasonable situations (this is a bit argumentatively unfair of me, since the theorems aren't public yet, but I'm happy to give you access). If this doesn't resolve your concern and you want to talk more about this, I'd appreciate taking this to video, since I feel like we may be talking past each other. EDIT: Removed a distracting analogy.
Draft report on existential risk from power-seeking AI

A person does not become less powerful (in the intuitive sense) right after paying college tuition (or right after getting a vaccine) due to losing the ability to choose whether to do so. [EDIT: generally, assuming they make their choices wisely.]

I think POWER may match the intuitive concept when defined over certain (perhaps very complicated) reward distributions; rather than reward distributions that are IID-over-states (which is what the paper deals with).

Actually, in a complicated MDP environment—analogous to the real world—in which every sequence of a... (read more)

2Alex Turner2moTwo clarifications. First, even in the existing version, POWER can be defined for any bounded reward function distribution - not just IID ones. Second, the power-seeking results no longer require IID. Most reward function distributions incentivize POWER-seeking, both in the formal sense, and in the qualitative "keeping options open" sense. To address your main point, though, I think we'll need to get more concrete. Let's represent the situation with a state diagram. Let's say that up is gap year, and right is go to college right away.Both you and Rohin are glossing over several relevant considerations, which might be driving misunderstanding. For one: Power depends on your time preferences. [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/6DuJxY8X45Sco4bS2#POWER_seeking_depends_on_the_agent_s_time_preferences] If your discount rate is very close to 1 and you irreversibly close off your ability to pursueϵpercent of careers, then yes, you have decreased your POWER by going to college right away. If your discount rate is closer to 0, then college lets you pursue more careers quickly, increasing your POWER for most reward function distributions. You shouldn't need to contort the distribution used by POWER to get reasonable outputs. Just be careful that we're talking about the same time preferences. (I can actually prove that in a wide range of situations, the POWER of state 1 vs the POWER of state 2 is ordinally robust to choice of distribution. I'll explain that in a future post, though.) My position on "is POWER a good proxy for intuitive-power?" is that yes, it's very good, after thinking about it for many hours (and after accounting for sub-optimality; see the last part of appendix B). I think the overhauled power-seeking post [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/6DuJxY8X45Sco4bS2] should help, but perhaps I have more explaining to do. Also, I perceive an undercurrent of "goal-driven agents should tend to seek power in all kinds of situations; y
Draft report on existential risk from power-seeking AI

I probably should have written the "because ..." part better. I was trying to point at the same thing Rohin pointed at in the quoted text.

Taking a quick look at the current version of the paper, my point still seems to me relevant. For example, in the environment in figure 16, with a discount rate of ~1, the maximally POWER-seeking behavior is to always stay in the same first state (as noted in the paper), from which all the states are reachable. This is analogous to the student from Rohin's example who takes a gap year instead of going to college.

2Alex Turner2moRight. But what does this have to do with your “different concept” claim?
Draft report on existential risk from power-seeking AI

By “power” I mean something like: the type of thing that helps a wide variety of agents pursue a wide variety of objectives in a given environment. For a more formal definition, see Turner et al (2020).

I think the draft tends to use the term power to point to an intuitive concept of power/influence (the thing that we expect a random agent to seek due to the instrumental convergence thesis). But I think the definition above (or at least the version in the cited paper) points to a different concept, because a random agent has a single objective (rather th... (read more)

2Alex Turner2moThis is indeed a misunderstanding. My paper analyzes the single-objective setting; no intrinsic power-seeking drive is assumed.
Which counterfactuals should an AI follow?

Maybe "logical counterfactuals" are also relevant here (in the way I've used them in this post). For example, consider a reward function that depends on whether the first 100 digits after the th digit in the decimal representation of are all 0. I guess this example is related to the "closest non-expert model" concept.

My research methodology

For any competitive alignment scheme that involve helper (intermediate) ML models, I think we can construct the following story about an egregiously misaligned AI being created:

Suppose that there does not exist an ML model (in the model space being searched) that fulfills both the following conditions:

  1. The model is useful for either creating safe ML models or evaluating the safety of ML models, in a way that allows being competitive.
  2. The model is sufficiently simple/weak/narrow such that it's either implausible that the model is egregiously misaligned, or
... (read more)
Formal Solution to the Inner Alignment Problem

To extended Evan's comment about coordination between deceptive models: Even if the deceptive models lack relevant game theoretical mechanisms, they may still coordinate due to being (partially) aligned with each other. For example, a deceptive model X may prefer [some other random deceptive model seizing control] over [model X seizing control with probability 0.1% and the entire experiment being terminated with probability 99.9%].

Why should we assume that the deceptive models will be sufficiently misaligned with each other such that this will not be an is... (read more)

2michaelcohen4moThat's possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options--our world optimized by us and our world optimized by a random deceptive model--probably contain very little of value as judged by agents in another random deceptive model. So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.
Short summary of mAIry's room

The topic of risks related to morally relevant computations seems very important, and I hope a lot more work will be done on it!

My tentative intuition is that learning is not directly involved here. If the weights of a trained RL agent are no longer being updated after some point[1], my intuition is that the model is similarly likely to experience pain before and after that point (assuming the environment stays the same).

Consider the following hypothesis which does not involve a direct relationship between learning and pain: In sufficiently large scale (an... (read more)

4Alex Turner5moThis isn't key for your point, but: If it's a perfect predictor of a deterministic world, sure. But if the world is stochastic, or you can't assume realizability, your network can simultaneously be a global optimum but also have gradient updates. It's just that in expectation, your gradient is zero, but if you update in sufficiently small batches, you might still have non-zero gradients.
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

My understanding is that the 2020 algorithms in Ajeya Cotra's draft report refer to algorithms that train a neural network on a given architecture (rather than algorithms that search for a good neural architecture etc.). So the only "special sauce" that can be found by such algorithms is one that corresponds to special weights of a network (rather than special architectures etc.).

2Daniel Kokotajlo5moHuh, that's not how I interpreted it. I should reread the report. Thanks for raising this issue.
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Great post!

we’ll either have to brute-force search for the special sauce like evolution did

I would drop the "brute-force" here (evolution is not a random/naive search).

Re the footnote:

This "How much special sauce is needed?" variable is very similar to Ajeya Cotra's variable "how much compute would lead to TAI given 2020's algorithms."

I don't see how they are similar.

1Daniel Kokotajlo5moThanks! Fair enough re: brute force; I guess my problem is that I don't have a good catchy term for the level of search evolution does. It's better than pure random search, but a lot worse than human-intelligent search. I think "how much compute would lead to TAI given 2020's algorithms" is sort of an operationalization of "how much special sauce is needed." There are three ways to get special sauce: Brute-force search, awesome new insights, or copy it from the brain. "given 2020's algorithms" rules out two of the three. It's like operationalizing "distance to Edinburgh" as "time it would take to get to Edinburgh by helicopter."
Why I'm excited about Debate

One might argue:

We don't need the model to use that much optimization power, to the point where it breaks the operator. We just need it to perform roughly at human-level, and then we can just deploy many instances of the trained model and accomplish very useful things (e.g. via factored cognition).

So I think it's important to also note that, getting a neural network to "perform roughly at human-level in an aligned manner" may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much... (read more)

Gradient hacking

It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.

Gradient hacking

Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network

I think the part in bold should instead be something like "failing hard if SGD would (not) update weights in such and such way". (SGD is a local search algorithm; it gradually improves a single network.)

This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears li

... (read more)
4Adam Shimi5moAgreed. I said something similar in my comment [https://www.lesswrong.com/posts/uXH4r6MmKPedk8rMA/gradient-hacking?commentId=sG2t5h3yBXw2tcg3R] . Thanks for the concrete example, I think I understand better what you meant. What you describe looks like the hypothesis "Any sufficiently intelligent model will be able to gradient hack, and thus will do it". Which might be true. But I'm actually more interested in the question of how gradient hacking could emerge without having to pass that threshold of intelligence, because I believe such examples will be easier to interpret and study. So in summary, I do think what you say makes sense for the general risk of gradient hacking, yet I don't believe it is really useful for studying gradient hacking with our current knowledge.
Gradient hacking

My point was that there's no reason that SGD will create specifically "deceptive logic" because "deceptive logic" is not privileged over any other logic that involves modeling the base objective and acting according to it. But I now think this isn't always true - see the edit block I just added.

Gradient hacking

"deceptive logic" is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.

But you can similarly say this for the following logic: "check whether 1+1<4 and if so, act according to the base objective". Why is SGD more likely to create "deceptive logic" than this simpler logic (or any other similar logic)?

[EDIT: actually, this argument doesn't work in a setup where the base objective corresponds to a sufficiently long time horizon during which it is possible for humans to de... (read more)

1Adam Shimi6moHum, I would say that your logic is probably redundant, and thus might end up being removed for simplicity reasons? Whereas I expect deceptive logic to includes very useful things like knowing how the optimization process works, which would definitely help having better performance. But to be honest, how can SGD create gradient hacking (if it's even possible) is completely an open research problem.
Gradient hacking

I think that if SGD makes the model slightly deceptive it's because it made the model slightly more capable (better at general problem solving etc.), which allowed the model to "figure out" (during inference) that acting in a certain deceptive way is beneficial with respect to the mesa-objective.

This seems to me a lot more likely than SGD creating specifically "deceptive logic" (i.e. logic that can't do anything generally useful other than finding ways to perform better on the mesa-objective by being deceptive).

3Adam Shimi6moI agree with your intuition, but I want to point out again that after some initial useless amount, "deceptive logic" is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective. SGD making the model more capable seems the most obvious way to satisfy the conditions for deceptive alignement [https://www.alignmentforum.org/s/r9tYkB2a8Fp4DN8yB/p/zthDPAjh9w6Ytbeks#4_2__Conditions_for_deceptive_alignment] .
Gradient hacking

The less philosophical approach to this problem is to notice that the appearance of gradient hacking would probably come from the training stumbling on a gradient hacker.

[EDIT: you may have already meant it this way, but...] The optimization algorithm (e.g. SGD) doesn't need to stumble upon the specific logic of gradient hacking (which seems very unlikely). I think the idea is that a sufficiently capable agent (with a goal system that involves our world) instrumentally decides to use gradient hacking, because otherwise the agent will be modified in a suboptimal manner with respect to its current goal system.

3Adam Shimi6moActually, I did meant that SGD might stumble upon gradient hacking. Or to be a bit more realistic, make the model slightly deceptive, at which point decreasing a bit the deceptiveness makes the model worse but increasing it a bit makes the model better at the base-objective, and so there is a push towards deceptiveness, until the model is basically deceptive enough to use gradient hacking in the way you mention. Does that make more sense to you?
AGI safety from first principles: Introduction

Early work tends to be less relevant in the context of modern machine learning

I'm curious why you think the orthogonality thesis, instrumental convergence, the treacherous turn or Goodhart's law arguments are less relevant in the context of modern machine learning. (We can use here Facebook's feed-creation-algorithm as an example of modern machine learning, for the sake of concreteness.)

Against GDP as a metric for timelines and takeoff speeds

Thank you for writing this up! This topic seems extremely important and I strongly agree with the core arguments here.

I propose the following addition to the list of things we care about when it comes to takeoff dynamics, or when it comes to defining slow(er) takeoff:

  1. Foreseeability: No one creates an AI with a transformative capability X at a time when most actors (weighted by influence) believe it is very unlikely that an AI with capability X will be created within a year.

Perhaps this should replace (or be merged with) the "warning shots" entry in the... (read more)

2Daniel Kokotajlo6moThanks! Good point about "warning shot" having the wrong connotations. And I like your foreseeability suggestion. I wonder if I can merge it with something. It seems similar to warning shots and risk awareness. Maybe I should just have a general category for "How surprised are the relevant actors when things like AGI, alignment failures, etc. start happening."
What are the best precedents for industries failing to invest in valuable AI research?

And the other part of the core idea is that that's implausible.

I don't see why that's implausible. The condition I gave is also my explanation for why the EMH fulfills (in markets where it does), and it doesn't explain why big corporations should be good at predicting AGI.

it's in their self-interest (at least, given their lack of concern for AI risk) to pursue it aggressively

So the questions I'm curious about here are:

  1. What mechanism is supposed to causes big corporations to be good at predicting AGI?
  2. How come that mechanism doesn't also cause big corporations to understand the existential risk concerns?
1Daniel Kokotajlo6moI think the idea is that in general they are good at doing things that are in their self-interest, and since they don't currently think AI is an existential threat, they should think it's in their self-interest to make AGI if possible, and if it is possible, they should be able to recognise that since the relevant expertise in AI and AI forecasting is something they can acquire. To be honest, I don't put much stock in this argument, which is why I'm asking this question.
What are the best precedents for industries failing to invest in valuable AI research?

(I'm not an economist but my understanding is that...) The EMH works in markets that fulfill the following condition: If Alice is way better than the market at predicting future prices, she can use her superior prediction capability to gain more and more control over the market, until the point where her control over the market makes the market prices reflect her prediction capability.

If Alice is way better than anyone else at predicting AGI, how can she use her superior prediction capability to gain more control over big corporations? I don't see how the EMH an EMH-based argument applies here.

1Daniel Kokotajlo6moYeah, maybe it's not really EMH-based but rather EMH-inspired or EMH-adjacent. The core idea is that if AI is close lots of big corporations are really messing up big time; it's in their self-interest (at least, given their lack of concern for AI risk) to pursue it aggressively. And the other part of the core idea is that that's implausible.
Seeking Power is Often Robustly Instrumental in MDPs

Instrumental convergence is a very simple idea that I understand very well, and yet I failed to understand this paper (after spending hours on it) [EDIT: and also the post], so I'm worried about using it for the purpose of 'standing up to more intense outside scrutiny'. (Though it's plausible I'm just an outlier here.)

5Alex Turner6moWhile it's been my experience that most people have understood the important parts of the post and paper, a few intelligent readers (like you) haven't had the expected clicks. The paper [https://arxiv.org/abs/1912.01683] has already been dramatically rewritten, to the point of being a different paper than that originally linked to by this post. I'm also planning on rewriting some of the post. I'd be happy to send a draft by you to ensure it's as clearly written as possible. As far as 'more intense outside scrutiny', we received two extremely positive NeurIPS reviews which were quite persuaded, and one very negative review which pointed out real shortcomings that have since been rectified. I anticipate that it will do quite well at ICML / NeurIPS this year. That said, I don't plan on circulating these arguments to LeCun et al. until after the work has passed peer review.
In a multipolar scenario, how do people expect systems to be trained to interact with systems developed by other labs?

Some off-the-cuff thoughts:

It seems plausible that transformative agents will be trained exclusively on real-world data (without using simulated environments) [EDIT: in "data" I mean to include the observation/reward signal from the real-world environment in an online RL setup]; including social media feed-creation algorithms, and algo-trading algorithms. In such cases, the researchers don't choose how to implement the "other agents" (the other agents are just part of the real-world environment that the researchers don't control).

Focusing on agents that ar... (read more)

Some AI research areas and their relevance to existential safety

Great post!

I suppose you'll be more optimistic about Single/Single areas if you update towards fast/discontinuous takeoff?

"Inner Alignment Failures" Which Are Actually Outer Alignment Failures

you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).

I think a model can be deceptively aligned even if formally it maps every possible input to the correct (safe) output. For example, suppose that on input X the inference execution hacks the computer on which the inference is being executed, in order to do arbitrary consequentialist stuff (while the inference logic, as a mathematical object, formally yields the correct output for X).

2Evan Hubinger8moSure—we're just trying to define things in the abstract here, though, so there's no harm in just defining the model's output to include stuff like that as well.
ofer's Shortform

[Question about reinforcement learning]

What is the most impressive/large-scale published work in RL that you're aware of where—during training—the agent's environment is the real world (rather than a simulated environment)?

Draft report on AI timelines

Let and be two optimization algorithms, each searching over some set of programs. Let be some evaluation metric over programs such that is our evaluation of program , for the purpose of comparing a program found by to a program found by . For example, can be defined as a subjective impressiveness metric as judged by a human.

Intuitive definition: Suppose we plot a curve for each optimization algorithm such that the x-axis is the inference compute of a yielded program and the y-axis is our evaluation value of that program. If the curves ... (read more)

Draft report on AI timelines

Some thoughts:

  1. The development of transformative AI may involve a feedback loop in which we train ML models that help us train better ML models and so on (e.g. using approaches like neural architecture search which seems to be getting increasingly popular in recent years). There is nothing equivalent to such a feedback loop in biological evolution (animals don't use their problem-solving capabilities to make evolution more efficient). Does your analysis assume there won't be such a feedback loop (or at least not one that has a large influence on timeline

... (read more)
2Ofer Givoli8moLet O1 and O2 be two optimization algorithms, each searching over some set of programs. Let V be some evaluation metric over programs such that V(p) is our evaluation of program p, for the purpose of comparing a program found by O1 to a program found by O2. For example, V can be defined as a subjective impressiveness metric as judged by a human. Intuitive definition: Suppose we plot a curve for each optimization algorithm such that the x-axis is the inference compute of a yielded program and the y-axis is our evaluation value of that program. If the curves of O1 and O2 are similar up to scaling along the x-axis, then we say that O1 and O2 are similarly-scaling w.r.t inference compute, or SSIC for short. Formal definition: Let O1 and O2 be optimization algorithms and let V be an evaluation function over programs. Let us denote with Oi(n) the program that Oi finds when it uses n flops (which would correspond to the training compute if Oi is an ML algorithms). Let us denote with C(p) the amount of compute that program p uses. We say that O1 and O2 are SSIC with respect to V if for any n1,n′1,n2,n′ 2 such that C(O1(n1))C(O2(n2))≈C(O1(n′1))C(O2(n′2)), if V(O1(n1))≈V(O2(n2)) then V(O1(n′1))≈V(O2(n′2)). I think the report draft implicitly uses the assumption that human evolution and the first ML algorithm that will result in TAI are SSIC (with respect to a relevant V). It may be beneficial to discuss this assumption in the report. Clearly, not all pairs of optimization algorithms are SSIC (e.g. consider a pure random search + any optimization algorithm). Under what conditions should we expect a pair of optimization algorithms to be SSIC with respect to a given V? Maybe that question should be investigated empirically, by looking at pairs of optimization algorithms, were one is a popular ML algorithm and the other is some evolutionary computation [https://en.wikipedia.org/wiki/Evolutionary_computation] algorithm (searching over a very different model space), and c
[AN #121]: Forecasting transformative AI timelines using biological anchors

My point here is that in a world where an algo-trading company has the lead in AI capabilities, there need not be a point in time (prior to an existential catastrophe or existential security) where investing more resources into the company's safety-indifferent AI R&D does not seem profitable in expectation. This claim can be true regardless of researchers' observations beliefs and actions in given situations.

[AN #121]: Forecasting transformative AI timelines using biological anchors

We might get TAI due to efforts by, say, an algo-trading company that develops trading AI systems. The company can limit the mundane downside risks that it faces from non-robust behaviors of its AI systems (e.g. by limiting the fraction of its fund that the AI systems control). Of course, the actual downside risk to the company includes outcomes like existential catastrophes, but it's not clear to me why we should expect that prior to such extreme outcomes their AI systems would behave in ways that are detrimental to economic value.

2Rohin Shah8moI predict that this will not lead to transformative AI; I don't see how an algorithmic trading system leads to an impact on the world comparable to the industrial revolution. You can tell a story where you get an Eliezer-style near-omniscient superintelligent algorithmic trading system that then reshapes the world because it is a superintelligence, and that the researchers thought that it was not a superintelligence and so assumed that the downside risk was bounded, but both clauses (Eliezer-style superintelligence and researchers being horribly miscalibrated) seem unlikely to me.
[AN #121]: Forecasting transformative AI timelines using biological anchors

you need to ensure that your model is aligned, robust, and reliable (at least if you want to deploy it and get economic value from it).

I think it suffices for the model to be inner aligned (or deceptively inner aligned) for it to have economic value, at least in domains where (1) there is a usable training signal that corresponds to economic value (e.g. users' time spent in social media platforms, net income in algo-trading companies, or even the stock price in any public company); and (2) the downside economic risk from a non-robust behavior is limited (e.g. an algo-trading company does not need its model to be robust/reliable, assuming the downside risk from each trade is limited by design).

2Rohin Shah8moSure, I mean, logistic regression has had economic value and it doesn't seem meaningful to me to say whether it is "aligned" or "inner aligned". I'm talking about transformative AI systems, where downside risk is almost certainly not limited.
Load More