# All of Vika's Comments + Replies

Optimization Concepts in the Game of Life

Ah I see, thanks for the clarification! The 'bottle cap' (block) example is robust to removing any one cell but not robust to adding cells next to it (as mentioned in Oscar's comment). So most random perturbations that overlap with the block will probably destroy it.

Optimization Concepts in the Game of Life
1. Actually, we realized that if we consider an empty board an optimizing system, then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.
4Edouard Harris1moGreat catch. For what it's worth, it actually seems fine to me intuitively that any finite pattern would be an optimizing system for this reason, though I agree most such patterns may not directly be interesting. But perhaps this is a hint that some notion of independence or orthogonality of optimizing systems might help to complete this picture. Here's a real-world example: you could imagine a universe where humans are minding their own business over here on Earth, while at the same time, over there in a star system 20 light-years away, two planets are hurtling towards each other under the pull of their mutual gravitation. No matter what humans may be doing on Earth, this universe as a whole can still reasonably be described as an optimizing system! Specifically, it achieves the property that the two faraway planets will crash into each other under a fairly broad set of contexts. Now suppose we describe the state of this universe as a single point in a gargantuan phase space — let's say it's the phase space of classical mechanics, where we assign three positional and three momentum degrees of freedom to each particle in the universe (so if there are N particles in the universe, we have a 6N-dimensional phase space). Then there is a subspace of this huge phase space that corresponds to the crashing planets, and there is another, orthogonal subspace that corresponds to the Earth and its humans. You could then say that the crashing-planets subspace is an optimizing system that's independent of the human-Earth subspace. In particular, if you imagine that these planets (which are 20 light-years away from Earth) take less than 20 years to crash into each other, then the two subspaces won't come into causal contact before the planet subspace has achieved the "crashed into each other" property. Similarly on the GoL grid, you could imagine having an interesting eater over here, while over there you have a pretty boring, mostly empty grid with just a single live cell in
Optimization Concepts in the Game of Life

Thanks for pointing this out! We realized that if we consider an empty board an optimizing system then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.

The 'bottle cap' example would be an optimizing system if it was robust to cells colliding / interacting with it, e.g. being hit by a glider (similarly to the eater).

2Pattern1moAh. I interpreted the statement about the empty board as being one of: A small random perturbation, will probably be non-viable/collapse back to the empty board. (Whereas patterns that are viable don't (necessarily) have this property.) I then, asked about whether the bottle cap example, had the same robustness.
List of good AI safety project ideas?

Thanks Aryeh for collecting these! I added them to a new Project Ideas section in my AI Safety Resources list.

Classifying specification problems as variants of Goodhart's Law

Writing this post helped clarify my understanding of the concepts in both taxonomies - the different levels of specification and types of Goodhart effects. The parts of the taxonomies that I was not sure how to match up usually corresponded to the concepts I was most confused about. For example, I initially thought that adversarial Goodhart is an emergent specification problem, but upon further reflection this didn't seem right. Looking back, I think I still endorse the mapping described in this post.

Tradeoff between desirable properties for baseline choices in impact measures

It was not my intention to imply that semantic structure is never needed - I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it's unlikely we can get away without it.

There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it's plausible that agents can learn the semantic structure tha... (read more)

2Koen Holtman1yThanks for the clarification, I think our intuitions about how far you could take these techniques may be more similar than was apparent from the earlier comments. You bring up the distinction between semantic structure that is learned via unsupervised learning, and semantic structure that comes from 'explicit human input'. We may be using the term 'semantic structure' in somewhat different ways when it comes to the question of how much semantic structure you are actually creating in certain setups. If you set up things to create an impact metric via unsupervised learning, you still need to encode some kind of impact metric on the world state by hand, to go into the agents's reward function, e.g. you may encode 'bad impact' as the observable signal 'the owner of the agent presses the do-not-like feedback button'. For me, that setup uses a form of indirection to create an impact metric that is incredibly rich in semantic structure. It is incredibly rich because it indirectly incorporates the impact-related semantic structure knowledge that is in the owner's brain. You might say instead that the metric does not have a rich of semantic structure at all, because it is just a bit from a button press. For me, an impact metric that is defined as 'not too different from the world state that already exists' would also encode a huge amount of semantic structure, in case the world we are talking about is not a toy world but the real world.
Tradeoff between desirable properties for baseline choices in impact measures

Looks great, thanks! Minor point: in the sparse reward case, rather than "setting the baseline to the last state in which a reward was achieved", we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline).

3Rohin Shah1yGood point, changed to
Tradeoff between desirable properties for baseline choices in impact measures

I would say that impact measures don't consider these kinds of judgments. The "doing nothing" baseline can be seen as analogous to the agent never being deployed, e.g. in the Low Impact AI paper. If the agent is never deployed, and someone dies in the meantime, then it's not the agent's responsibility and is not part of the agent's impact on the world.

I think the intuition you are describing partly arises from the choice of language: "killing someone by not doing something" vs "someone dying while you are doing nothing". The word "killing" is an active ver

Tradeoff between desirable properties for baseline choices in impact measures

Thanks Flo for pointing this out. I agree with your reasoning for why we want the Markov property. For the second modification, we can sample a rollout from the agent policy rather than computing a penalty over all possible rollouts. For example, we could randomly choose an integer N, roll out the agent policy and the inaction policy for N steps, and then compare the resulting states. This does require a complete environment model (which does make it more complicated to apply standard RL), while inaction rollouts only require a partial environment model (p

Tradeoff between desirable properties for baseline choices in impact measures

I don't think the pedestrian example shows a need for semantic structure. The example is intended to illustrate that an agent with the stepwise inaction baseline has no incentive to undo the delayed effect that it has set up. We want the baseline to incentivize the agent to undo any delayed effect, whether it involves hitting a pedestrian or making a pigeon fly.

The pedestrian and pigeon effects differ in the magnitude of impact, so it is the job of the deviation measure to distinguish between them and penalize the pedestrian effect more. Optionality-

Tradeoff between desirable properties for baseline choices in impact measures

The baseline is not intended to indicate what should happen, but rather what happens by default. The role of the baseline is to filter out effects that were not caused by the agent, to avoid penalizing the agent for them (which would produce interference incentives). Explicitly specifying what should happen usually requires environment-specific human input, and impact measures generally try to avoid this.

2Adam Shimi1yI understood that the baseline that you presented was a description of what happens by default, but I wondered if there was a way to differentiate between different judgements on what happens by default. Intuitively, killing someone by not doing something feels different from not killing someone by not doing something. So my question was a check to see if impact measures considered such judgements (which apparently they don't) and if they didn't, what was the problem.
Specification gaming: the flip side of AI ingenuity

Thanks Koen for your feedback! You make a great point about a clearer call to action for RL researchers. I think an immediate call to action is to be aware of the following:

• there is a broader scope of aligned RL agent design
• there are difficult unsolved problems in this broader scope
• for sufficiently advanced agents, these problems need general solutions rather than ad-hoc ones

Then a long-term call to action (if/when they are in the position to deploy an advanced AI system) is to consider the broader scope and look for general solutions to specification prob... (read more)

Specification gaming: the flip side of AI ingenuity

Thanks John for the feedback! As Oliver mentioned, the target audience is ML researchers (particularly RL researchers). The post is intended as an accessible introduction to the specification gaming problem for an ML audience that connects their perspective with a safety perspective on the problem. It is not intended to introduce novel concepts or a principled breakdown of the problem (I've made a note to clarify this in a later version of the post).

Regarding your specific questions about the breakdown, I think faithfully capturing the human concept o... (read more)

Specification gaming: the flip side of AI ingenuity

For the Lego example, the agent received a fixed shaping reward for grasping the red brick if the bottom face was above a certain height (3cm), rather than being rewarded in proportion to the height of the bottom face. Thus, it found an easy way to collect the shaping reward by flipping the brick, while stacking it upside down on the blue brick would be a more difficult way to get the same shaping reward. The current description of the example in the post does make it sound like the reward is proportional to the height - I'll make a note to fix this in a later version of the post.

2Adam Shimi1yOk, that makes much more sense. I was indeed assuming a proportional reward.
Possible takeaways from the coronavirus pandemic for slow AI takeoff

Thanks Matthew for your interesting points! I agree that it's not clear whether the pandemic is a good analogy for slow takeoff. When I was drafting the post, I started with an analogy with "medium" takeoff (on the time scale of months), but later updated towards the slow takeoff scenario being a better match. The pandemic response in 2020 (since covid became apparent as a threat) is most relevant for the medium takeoff analogy, while the general level of readiness for a coronavirus pandemic prior to 2020 is most relevant for the slow takeof... (read more)

Possible takeaways from the coronavirus pandemic for slow AI takeoff

Thanks Rohin for covering the post in the newsletter!

The summary looks great overall. I have a minor objection to the word "narrow" here: "we may fail to generalize from narrow AI systems to more general AI systems". When I talked about generalizing from less advanced AI systems, I didn't specifically mean narrow AI - what I had in mind was increasingly general AI systems we are likely to encounter on the path to AGI in a slow takeoff scenario.

For the opinion, I would agree that it's not clear how well the covid scenario mat... (read more)

Possible takeaways from the coronavirus pandemic for slow AI takeoff

Thanks Wei! I agree that improving institutions is generally very hard. In a slow takeoff scenario, there would be a new path to improving institutions using powerful (but not fully general) AI, but it's unclear how well we could expect that to work given the generally low priors.

The covid response was a minor update for me in terms of AI risk assessment - it was mildly surprising given my existing sense of institutional competence.

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

I certainly agree that there are problems with the stepwise inaction baseline and it's probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it's an open question how to design a baseline that satisfies a... (read more)

3Rohin Shah2yMaybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it's not clear how to apply that for humans. (I only have this objection when trying to explain what "impact" means to humans; it seems fine in the RL setting. I do think we'll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.) Also, under this inaction baseline, the roads are perpetually empty, and so you're always feeling impact from the fact that you can't zoom down the road at 120 mph, which seems wrong. Sorry, what I meant to imply was "baselines are counterfactuals, and counterfactuals are hard, so maybe no 'natural' baseline exists". I certainly agree that my baseline is a counterfactual. Yes, that's my main point. I agree that there's no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don't always apply (even when interpreted by humans).
Conclusion to 'Reframing Impact'

Thanks! I certainly agree that power-seeking is important to address, and I'm glad you are thinking deeply about it. However, I'm uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.

One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don't rely on someone taking over the world, so a superintelligent AI could relatively easily tr... (read more)

1Alex Turner2yWhat I actually said was: First, the "I think", and second, the "plausibly". I think the "plausibly" was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUPconceptual ("optimize the objective, without becoming more able to optimize the objective"), you don't need additional ideas to get a superintelligence-safe impact measure.
Conclusion to 'Reframing Impact'

Thank you for the clarifications! I agree it's possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure.

Here are some reasons I don't endorse this approach:

1. I have an intuitive sense that defining the auxiliary reward in terms of the... (read more)

2Alex Turner2yI think this makes sense – you come in and wonder "what's going on, this doesn't even pass the basic test cases?!". Some context: in the superintelligent case, I often think about "what agent design would incentivize putting a strawberry on a plate, without taking over the world"? Although I certainly agree SafeLife-esque side effects are important, power-seeking might be the primary avenue to impact for sufficiently intelligent systems. Once a system is smart enough, it might realize that breaking vases would get it in trouble, so it avoids breaking vases as long as we have power over it. If we can't deal with power-seeking, then we can't deal with power-seeking & smaller side effects at the same time. So, I set out to deal with power-seeking for the superintelligent case. Under this threat model, the random reward AUP penalty (and the RR penalty AFAICT) can be avoided with the help of a "delusion box" which holds the auxiliary AUs constant. Then, the agent can catastrophically gain power without penalty. (See also: Stuart's subagent sequence [https://www.lesswrong.com/posts/PmqQKBmt2phMT7YLG/subagents-and-impact-measures-summary-tables] ) I investigated whether we can get an equation which implements the reasoning in my first comment: "optimize the objective, without becoming more able to optimize the objective". As you say, I think Rohin and others have given good arguments that my preliminary equations don't work as well as we'd like. Intuitively, though, it feels like there might be a better way to implement that reasoning. I think the agent-reward equations do help avoid certain kinds of loopholes, and that they expose key challenges for penalizing power seeking. Maybe going back to the random rewards or a different baseline helps overcome those challenges, but it's not clear to me that that's true. I'm pretty curious about that – implementing eg Stuart's power-seeking gridworld [https://www.lesswrong.com/posts/sYjCeZTwA84pHkhBJ/appendix-how-a-subagent-c
AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

I think the previous state is a natural baseline if you are interested in the total impact on the human from all sources. If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).

As an example, suppose I have an unpleasant ride on a crowded bus, where person X steps on my foot and person Y steals my wallet. The total impact on me would be computed relative to the previous state before I got on the bus, whic... (read more)

3Rohin Shah2yTo the extent that there is a natural choice (counterfactuals are hard), I think it would be "what the human expected the agent to do" (the same sort of reasoning that led to the previous state baseline). This gives the same answer as the stepwise inaction baseline in your example (because usually we don't expect a specific person to step on our feet or to steal our wallet). An example where it gives a different answer is in driving. The stepwise inaction baseline says "impact is measured relative to all the other drivers going comatose", so in the baseline state many accidents happen, and you get stuck in a huge traffic jam. Thus, all the other drivers are constantly having a huge impact on you by continuing to drive! In contrast, the baseline of "what the human expected the agent to do" gets the intuitive answer -- the human expected all the other drivers to drive normally, and so normal driving has ~zero impact, whereas if someone actually did fall comatose and cause an accident, that would be quite impactful. EDIT: Tbc, I think this is the "natural choice" if you want to predict what humans would say is impactful; I don't have a strong opinion on what the "natural choice" would be if you wanted to successfully prevent catastrophe via penalizing "impact". (Though in this case the driving example still argues against stepwise inaction.)
Conclusion to 'Reframing Impact'

I am surprised by your conclusion that the best choice of auxiliary reward is the agent's own reward. This seems like a poor instantiation of the "change in my ability to get what I want" concept of impact, i.e. change in the true human utility function. We can expect a random auxiliary reward to do a decent job covering the possible outcomes that matter for the true human utility. However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, ther... (read more)

6Alex Turner2yYou seem to have misunderstood. Impact to a person is change in their AU [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/C74F7QTEAYSTGAytJ]. The agent is not us, and so it's insufficient for the agent to preserve its ability to do what we want [https://www.lesswrong.com/posts/fj8eyc7QzqCaB8Wgm/attainable-utility-landscape-how-the-world-is-changed] – it has to preserve our ability to do we want! The Catastrophic Convergence Conjecture [https://www.lesswrong.com/posts/w6BtMqKRLxG9bNLMr/the-catastrophic-convergence-conjecture] says: Logically framed, the argument is: catastrophe → power-seeking (obviously, this isn't a tautology or absolute rule, but that's the structure of the argument). Attainable Utility Preservation: Concepts [https://www.lesswrong.com/posts/75oMAADr4265AGK3L/attainable-utility-preservation-concepts] takes the contrapositive: no power-seeking → no catastrophe. Then, we ask – "for what purpose does the agent gain power?". The answer is: for its own purpose [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/S8AGyJJsdBFXmxHcb#Auxiliary_loopholes] . Of course.[1] [#fn-dC2WLYyvtKHFppTu2-1] One of the key ideas I have tried to communicate is [https://www.lesswrong.com/posts/75oMAADr4265AGK3L/attainable-utility-preservation-concepts] : AUPconceptual does not try to look out into the world and directly preserve human values. AUPconceptual penalizes the agent for gaining power, which disincentivizes huge catastrophes & huge decreases in our attainable utilities. I agree it would perform poorly, but that's because the CCC does not apply [https://www.lesswrong.com/posts/w6BtMqKRLxG9bNLMr/the-catastrophic-convergence-conjecture#Detailing_the_catastrophic_convergence_conjecture__CCC_] to SafeLife. We don't need to worry about the agent gaining power over other agents. Instead, the agent can be viewed as the exclusive interface through which we can interact with a given SafeLife level, so it should preserve our AU by preserving its own AUs. Where exactl
AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is "change in my ability to get what I want", i.e. change in the true human utility function. This is a broad statement that does not specify how to measure "change", in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not a

4Alex Turner2yAU theory [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/coQCEe962sjbcCqB9] says that people feel impacted as new observations change their on-policy value estimate (so it's the TD error). I agree with Rohin's interpretation as I understand it. However, AU theory is descriptive – it describes when and how we feel impacted, but not how to build agents which don't impact us much. That's what the rest of the sequence talked about.
4Rohin Shah2yThe thing that I believe (irrespective of whether RI says it or not) is: "Humans find new information 'impactful' to themselves when it changes how good they expect their future to be." (In practice, there's a host of complications because humans are messy, e.g. uncertainty about how good the future is also tends to feel impactful.) In particular, if humans had perfect beliefs and knew exactly what would happen at all times, no information could never change how good they expect their future to be, and so nothing could ever be impactful. Since this is tied to new information changing what you expect, it seems like the natural baseline is the previous state. Separately, I also think that RI was trying to argue for this conclusion, but I'll defer to Alex about what he was or wasn't trying to claim / argue for.
Announcing Web-TAISU, May 13-17

Thanks Linda for organizing, looking forward to it!

(In)action rollouts

I don't understand this proposal so far. I'm particularly confused by the last paragraph in the "to get away" section:

• What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following forever, thus not following for any at any point.
• If A follows for turns and then follows , how are and chosen?
• It's not clear to me that SA can act to ensure the baseline value of for all values of and unless it does nothing.

I think it might help to illustrate this proposal in your original

2Stuart Armstrong2yI agree I should re-do these all with examples. But, to answer your questions as far as possible: What A will do: ∅ forever (maybe with some extra info for the SA). What A would counterfactually do (to define Vk(s(t)t+τ)): π0 for τ turns (the π0 rollout, to define s(t)t+τ) followed by π∗R (to define the Vk aspect). Call this π∗0,τ,k. The penalty is defined in terms of these Vk(s(t)t+τ), entirely counterfactual except for τ=0. What SA must do: iff A follows π∗0,τ,k, then SA must ensure that Vk(s(t)t+τ)=Vk( s′(t)t+τ). So a key question is: how early can SA detect if A is following π∗0,τ,k? If it can detect this immediately, then it can immediately implement the policy for ensuring that Vk(s(t)t+τ)=Vk(s′(t)t+τ) (for the stricter impact measure) or the much easier BVk(st)=BVk(s′t) (for the original impact measure). If it has to wait τ turns to be sure, then it needs to preserve its power to control many different k. Now, A will never actually follow π∗0,τ,k, so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of SA will determine the penalty; SA is rewarded for quickly gaining control of the rewards. In practice, A will signal early on that it's not following π∗0,τ,k. Maybe there's a very small impact action, ∅ϵ, with a tiny penalty. Then A would like to use that to signal to SA that its not in the counterfactual world where it is following π∗0,τ,k. So long as π0's first action isn't ∅ϵ, that signal is enough (of course, if π0's first action isn't ∅, then ∅ is a sufficient signal). After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn't in. The one mild constraint remaining comes from s(t)t+0=st, which is not a counterfactually defined state (though Vk(st) is a counterfactually defined property of that state). I'll try and write up a further clarification of this.
Stepwise inaction and non-indexical impact measures

I don't think this requires identifying what a subagent is. You only need to be able to reliably identify the state before the subagent is created (i.e. the starting state), but you don't need to tell apart other states that are not the starting state.

I agree that we need to compare to the penalty if the subagent is not created - I just wanted to show that subagent creation does not avoid penalties. The penalty for subagent creation will reflect any impact the subagent actually causes in the environment (in the inaction rollouts).

1Stuart Armstrong2yAlas, the inaction rollouts don't seem to fix the problem: https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/z9MfmF8gA7SBxGSmb [https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/z9MfmF8gA7SBxGSmb]
Stepwise inaction and non-indexical impact measures

I think this problem is about capturing delayed effects of the agent's actions. The way the stepwise baseline is supposed to penalize delayed effects is using inaction rollouts, which compare the effects of the agent action + k noops and the effects of k+1 noops (for all positive integers k). I don't think it's useful to consider this issue for a stepwise baseline with no rollouts, since that baseline fails to penalize any delayed effects (no subagents needed).

Here, the inaction rollouts don't capture the effects of the subag... (read more)

2Stuart Armstrong2yThis requires identifying what a subagent is in general, a very tricky unsolved problem (which I feel is unsolvable). There's another issue; it's not enough to show that the subagent triggers a penalty. We need to show the penalty is larger than not creating the subagent. Since the penalty is zero after the subagent is created, and since the subagent has very fine control over the rewards (much finer than actions that don't include creating an intelligent being), creating a subagent might be lower penalty than almost any other action. It won't be a lower penalty than the agent doing nothing for ever, of course. But we typically want the agent to do something, so will calibrate the penalty or R_0 for that. And it's plausible that creating the subagent will have lower penalty (and/or higher R_0) than any safe "something".
2Stuart Armstrong2yThe inaction rollouts effectively transforms the stepwise inaction baseline into an inaction baseline (starting from the moment the subagent is created; thus the agent has a bit more control than in a true inaction baseline). Therefore the results on the inaction baseline apply ( https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/M9aoMixFLf8JFLRaP [https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/M9aoMixFLf8JFLRaP] ). This means that restrictions on increased power for the agent ("make sure you never have the power to increase the rewards") become restrictions on the actual policy followed for the subagent ("make sure you never increase these rewards"). Roughly, attainable utility becomes twenty billion questions. For the original example [https://www.lesswrong.com/posts/sYjCeZTwA84pHkhBJ/attainable-utility-has-a-subagent-problem] , this means that the agent cannot press the red button nor gain the ability to teleport. But while the subagent cannot press the red button, it can gain the ability to teleport.
Building and using the subagent

Thanks Stuart for your thought-provoking post! I think your point about the effects of the baseline choice on the subagent problem is very interesting, and it would be helpful to separate it more clearly from the effects of the deviation measure (which are currently a bit conflated in the table). I expect that AU with the inaction baseline would also avoid this issue, similarly to RR with an inaction baseline. I suspect that the twenty billion questions measure with the stepwise baseline would have the subagent issue too.

I'm wondering whether th... (read more)

Specification gaming examples in AI

I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative.

3Oliver Habryka2yThe biggest benefit for me has come from using this list in conversation, when I am trying to explain the basics of AI risk to someone, or am generally discussing the topic. Before this list came out, I would often have to come up with an example of some specification gaming problem on the fly, and even though I would be confident that my example was representative, it couldn't be sure that it actually happened, which often detracted from the point I was trying to make. After this list came out, I just had a pre-cached list of many examples that I could bring up at any point that I knew had actually happened, and where I could easily reference and find the original source if the other person wanted to follow up on that.
Specification gaming examples in AI

Thanks Ben! I'm happy that the list has been a useful resource. A lot of credit goes to Gwern, who collected many examples that went into the specification gaming list: https://www.gwern.net/Tanks#alternative-examples.

Thoughts on "Human-Compatible"

Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.

Classifying specification problems as variants of Goodhart's Law

Thanks Evan, glad you found this useful! The connection with the inner/outer alignment distinction seems interesting. I agree that the inner alignment problem falls in the design-emergent gap. Not sure about the outer alignment problem matching the ideal-design gap though, since I would classify tampering problems as outer alignment problems, caused by flaws in the implementation of the base objective.

Reversible changes: consider a bucket of water

I think the discussion of reversibility and molecules is a distraction from the core of Stuart's objection. I think he is saying that a value-agnostic impact measure cannot distinguish between the cases where the water in the bucket is or isn't valuable (e.g. whether it has sentimental value to someone).

If AUP is not value-agnostic, it is using human preference information to fill in the "what we want" part of your definition of impact, i.e. define the auxiliary utility functions. In this case I would expect you and Stuart to be in agr... (read more)

3Stuart Armstrong2yThat's an excellent summary.
2Alex Turner2yI agree that it's not the core, and I think this is a very cogent summary. There's a deeper disagreement about what we need done that I'll lay out in detail in Reframing Impact.
Reversible changes: consider a bucket of water

Thanks Stuart for the example. There are two ways to distinguish the cases where the agent should and shouldn't kick the bucket:

• Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward. For example, if the agent's goal is to put out a fire on the other end of the pool, you would set a low weight on the impact penalty, which enables the agent to take irreversible actions in order to achieve the goal. This is why impact measures use a reward-penalty tradeoff rather than a c
3michaelcohen2yProposal: in the same way we might try to infer human values from the state of the world, might we be able to infer a high-level set of features such that existing agents like us seem to optimize simple functions of these features? Then we would penalize actions that cause irreversible changes with respect to these high-level features. This might be entirely within the framework of similarity-based reachability. This might also be exactly what you were just suggesting.
2Stuart Armstrong2yYep, I agree :-) Then we are in full agreement :-) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. "some" because of arguments like this one; "not all" because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.
Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

Thanks Abram for this sequence - for some reason I wasn't aware of it until someone linked to it recently.

Would you consider the observation tampering (delusion box) problem as part of the easy problem, the hard problem, or a different problem altogether? I think it must be a different problem, since it is not addressed by observation-utility or approval-direction.

3Abram Demski1yAh, looks like I missed this question for quite a while! I agree that it's not quite one or the other. I think that like wireheading, we can split delusion box into "the easy problem" and "the hard problem". The easy delusion box is solved by making a reward/utility which is model-based, and so, knows that the delusion box isn't real. Then, much like observation-utility functions, the agent won't think entering into the delusion box is a good idea when it's planning -- and also, won't get any reward even if it enters into the delusion box accidentally (so long as it knows this has happened). But the hard problem of delusion box would be: we can't make a perfect model of the real world in order to have model-based avoidance of the delusion box. So how to we guarantee that an agent avoids "generalized delusion boxes"?
TAISU - Technical AI Safety Unconference

Janos and I are coming for the weekend part of the unconference

Risks from Learned Optimization: Introduction

I'm confused about the difference between a mesa-optimizer and an emergent subagent. A "particular type of algorithm that the base optimizer might find to solve its task" or a "neural network that is implementing some optimization process" inside the base optimizer seem like emergent subagents to me. What is your definition of an emergent subagent?

5Evan Hubinger2yI think my concern with describing mesa-optimizers as emergent subagents is that they're not really "sub" in a very meaningful sense, since we're thinking of the mesa-optimizer as the entire trained model, not some portion of it. One could describe a mesa-optimizer as a subagent in the sense that it is "sub" to gradient descent, but I don't think that's the right relationship—it's not like the mesa-optimizer is some subcomponent of gradient descent; it's just the trained model produced by it. The reason we opted for "mesa" is that I think it reflects more of the right relationship between the base optimizer and the mesa-optimizer, wherein the base optimizer is "meta" to the mesa-optimizer rather than the mesa-optimizer being "sub" to the base optimizer. Furthermore, in my experience, when many people encounter "emergent subagents" they think of some portion of the model turning into an agent and (correctly) infer that something like that seems very unlikely, as it's unclear why such a thing would actually be advantageous for getting a model selected by something like gradient descent (unlike mesa-optimization, which I think has a very clear story for why it would be selected for). Thus, we want to be very clear that something like that is not the concern being presented in the paper.
Best reasons for pessimism about impact of impact measures?

Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.

AU is a function of the world state, but intends to capture some general measure of the agent's influence over the environment that does not depend on the state representation.

Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) -> observations (e.g. pixels) -> state repr... (read more)

Best reasons for pessimism about impact of impact measures?

There are various parts of your explanation that I find vague and could use a clarification on:

• "AUP is not about state" - what does it mean for a method to be "about state"? Same goes for "the direct focus should not be on the state" - what does "direct focus" mean here?
• "Overfitting the environment" - I know what it means to overfit a training set, but I don't know what it means to overfit an environment.
• "The long arms of opportunity cost and instrumental convergence" - what do "long ar
1Alex Turner3yHere's a potentially helpful analogy. Imagine I program a calculator. Although its computation is determined by the state of the solar system, the computation isn't "about" the state of the solar system.

I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can't speak for TurnTrout, and there's a decent chance that I'm confused about some of the things here. But here is how I think about AUP and the points raised in this chain:

• "AUP is not about the state" - I'm going to take a step back, and pretend we have an agent working with AUP reasoning. We've specified an arcane set of utility functions (based on air molecule positions, well-defined human happiness, continued exis
3Alex Turner3yThese are good questions. As I mentioned, my goal here isn’t to explain the object level, so I’m going to punt on these for now. I think these will be comprehensible after the sequence, which is being optimized to answer this in the clearest way possible.
Best reasons for pessimism about impact of impact measures?

Thanks for the detailed explanation - I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn't understand the state representation invariance claim in the AUP proposal, though I didn't realize that it is as central to the proposal as you describe here.

I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that d... (read more)

I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that does not depend on the state at all.

It definitely does depend on the state. If the agent moves to a state where it has taken over the world, that's a huge increase in its ability to achieve arbitrary utility functions, and it would get a large penalty.

I think the claim is more that while the penalty does depend on the state, it's no... (read more)

Best reasons for pessimism about impact of impact measures?
Are you thinking of an action observation formalism, or some kind of reward function over inferred state?

I don't quite understand what you're asking here, could you clarify?

If you had to pose the problem of impact measurement as a question, what would it be?

Something along the lines of: "How can we measure to what extent the agent is changing the world in ways that we care about?". Why?

So there's a thing people do when they talk about AUP which I don't understand. They think it's about state, even though I insist it's fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven't been very good; in the given conversation, they acknowledge that it's different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciou... (read more)

Best reasons for pessimism about impact of impact measures?
What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules. This does not apply to the attainable set consisting of the survival utility function, since that is not a random utility function.

What makes you think that?

This is an intuitive claim based on ... (read more)

1Alex Turner3yAre you thinking of an action observation formalism, or some kind of reward function over inferred state? If you had to pose the problem of impact measurement as a question, what would it be?
Best reasons for pessimism about impact of impact measures?

Thanks Alex for starting this discussion and thanks everyone for the thought-provoking answers. Here is my current set of concerns about the usefulness of impact measures, sorted in decreasing order of concern:

Irrelevant factors. When applied to the real world, impact measures are likely to be dominated by things humans don't care about (heat dissipation, convection currents, positions of air molecules, etc). This seems likely to happen to value-agnostic impact measures, e.g. AU with random utility functions, which would mostly end up rewarding specif... (read more)

2Alex Turner3yThanks for the detailed list! What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model? What makes you think that?
Best reasons for pessimism about impact of impact measures?

I don't see how representation invariance addresses this concern. As far as I understand, the concern is about any actions in the real world causing large butterfly effects. This includes effects that would be captured by any reasonable representation, e.g. different people existing in the action and inaction branches of the world. The state representations used by humans also distinguish between these world branches, but humans have limited models of the future that don't capture butterfly effects (e.g. person X can distinguish between the world... (read more)

4Alex Turner3yI think my post was basically saying "representation selection seems like a problem because people are confused about the type signature of impact, which is actually a thing you can figure out no matter what you think the world is made of". I don't want to go into too much detail here (as I explained below), but part of what this implies is that discrete "effects" are fake/fuzzy mental constructs/not something to think about when designing an impact measure. In turn, this would mean we should ask a different question that isn't about butterfly effects.
4DanielFilan3yIndeed - a point I think is illustrated by the Chaotic Hurricanes test case [https://www.alignmentforum.org/posts/wzPzPmAsG3BwrBrwy/test-cases-for-impact-regularisation-methods] . I'm probably most excited about methods that would use transparency techniques to determine when a system is deliberately optimising for a part of the world (e.g. the members of the long-term future population) that we don't want it to care about, but this has a major drawback of perhaps requiring multiple philosophical advances into the meaning of reference in cognition and a greater understanding of what optimisation is.
Specification gaming examples in AI

As a result of the recent attention, the specification gaming list has received a number of new submissions, so this is a good time to check out the latest version :).

Towards a New Impact Measure

Thanks, glad you liked the breakdown!

The agent would have an incentive to stop anyone from doing anything new in response to what the agent did

I think that the stepwise counterfactual is sufficient to address this kind of clinginess: the agent will not have an incentive to take further actions to stop humans from doing anything new in response to its original action, since after the original action happens, the human reactions are part of the stepwise inaction baseline.

The penalty for the original action will take into account human reactions in the inacti... (read more)

2Alex Turner3yI think it’s generally a good property as a reasonable person would execute it. The problem, however, is the bad ex ante clinginess plans, where the agent has an incentive to pre-emptively constrain our reactions as hard as it can (and this could be really hard). The problem is lessened if the agent is agnostic to the specific details of the world, but like I said, it seems like we really need IV (or an improved successor to it) to cleanly cut off these perverse incentives. I’m not sure I understand the connection to scapegoating for the agents we’re talking about; scapegoating is only permitted if credit assignment is explicitly part of the approach and there are privileged "agents" in the provided ontology.

Thanks Rohin for a great summary as always!

I think the property of handling shutdown depends on the choice of absolute value or truncation at 0 in the deviation measure, not the choice of the core part of the deviation measure. RR doesn't handle shutdown because by default it is set to only penalize reductions in reachability (using truncation at 0). I would expect that replacing the truncation with absolute value (thus penalizing increases in reachability as well) would result in handling shutdown (but break the asymmetry property from the RR paper).... (read more)

Towards a New Impact Measure

There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:

Baseline

• Starting state: used by reversibility methods. Results in interference with other agents. Avoids ex post offsetting.
• Inaction (initial branch): default setting in Low Impact AI and RR. Avoids interfering with other agent's actions, but interferes with their reactions. Does not avoid ex post offsetting if the penalty for preventing events i