Ah I see, thanks for the clarification! The 'bottle cap' (block) example is robust to removing any one cell but not robust to adding cells next to it (as mentioned in Oscar's comment). So most random perturbations that overlap with the block will probably destroy it.
Thanks for pointing this out! We realized that if we consider an empty board an optimizing system then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.
The 'bottle cap' example would be an optimizing system if it was robust to cells colliding / interacting with it, e.g. being hit by a glider (similarly to the eater).
Thanks Aryeh for collecting these! I added them to a new Project Ideas section in my AI Safety Resources list.
Writing this post helped clarify my understanding of the concepts in both taxonomies - the different levels of specification and types of Goodhart effects. The parts of the taxonomies that I was not sure how to match up usually corresponded to the concepts I was most confused about. For example, I initially thought that adversarial Goodhart is an emergent specification problem, but upon further reflection this didn't seem right. Looking back, I think I still endorse the mapping described in this post.
I hoped to get more comments on this post... (read more)
It was not my intention to imply that semantic structure is never needed - I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it's unlikely we can get away without it.
There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it's plausible that agents can learn the semantic structure tha... (read more)
Looks great, thanks! Minor point: in the sparse reward case, rather than "setting the baseline to the last state in which a reward was achieved", we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline).
I would say that impact measures don't consider these kinds of judgments. The "doing nothing" baseline can be seen as analogous to the agent never being deployed, e.g. in the Low Impact AI paper. If the agent is never deployed, and someone dies in the meantime, then it's not the agent's responsibility and is not part of the agent's impact on the world.
I think the intuition you are describing partly arises from the choice of language: "killing someone by not doing something" vs "someone dying while you are doing nothing". The word "killing" is an active ver
Thanks Flo for pointing this out. I agree with your reasoning for why we want the Markov property. For the second modification, we can sample a rollout from the agent policy rather than computing a penalty over all possible rollouts. For example, we could randomly choose an integer N, roll out the agent policy and the inaction policy for N steps, and then compare the resulting states. This does require a complete environment model (which does make it more complicated to apply standard RL), while inaction rollouts only require a partial environment model (p
I don't think the pedestrian example shows a need for semantic structure. The example is intended to illustrate that an agent with the stepwise inaction baseline has no incentive to undo the delayed effect that it has set up. We want the baseline to incentivize the agent to undo any delayed effect, whether it involves hitting a pedestrian or making a pigeon fly.
The pedestrian and pigeon effects differ in the magnitude of impact, so it is the job of the deviation measure to distinguish between them and penalize the pedestrian effect more. Optionality-
The baseline is not intended to indicate what should happen, but rather what happens by default. The role of the baseline is to filter out effects that were not caused by the agent, to avoid penalizing the agent for them (which would produce interference incentives). Explicitly specifying what should happen usually requires environment-specific human input, and impact measures generally try to avoid this.
Thanks Koen for your feedback! You make a great point about a clearer call to action for RL researchers. I think an immediate call to action is to be aware of the following:
Then a long-term call to action (if/when they are in the position to deploy an advanced AI system) is to consider the broader scope and look for general solutions to specification prob... (read more)
Thanks John for the feedback! As Oliver mentioned, the target audience is ML researchers (particularly RL researchers). The post is intended as an accessible introduction to the specification gaming problem for an ML audience that connects their perspective with a safety perspective on the problem. It is not intended to introduce novel concepts or a principled breakdown of the problem (I've made a note to clarify this in a later version of the post).
Regarding your specific questions about the breakdown, I think faithfully capturing the human concept o... (read more)
Thanks Adam for the feedback - glad you enjoyed the post!
For the Lego example, the agent received a fixed shaping reward for grasping the red brick if the bottom face was above a certain height (3cm), rather than being rewarded in proportion to the height of the bottom face. Thus, it found an easy way to collect the shaping reward by flipping the brick, while stacking it upside down on the blue brick would be a more difficult way to get the same shaping reward. The current description of the example in the post does make it sound like the reward is proportional to the height - I'll make a note to fix this in a later version of the post.
Thanks Matthew for your interesting points! I agree that it's not clear whether the pandemic is a good analogy for slow takeoff. When I was drafting the post, I started with an analogy with "medium" takeoff (on the time scale of months), but later updated towards the slow takeoff scenario being a better match. The pandemic response in 2020 (since covid became apparent as a threat) is most relevant for the medium takeoff analogy, while the general level of readiness for a coronavirus pandemic prior to 2020 is most relevant for the slow takeof... (read more)
Thanks Rohin for covering the post in the newsletter!
The summary looks great overall. I have a minor objection to the word "narrow" here: "we may fail to generalize from narrow AI systems to more general AI systems". When I talked about generalizing from less advanced AI systems, I didn't specifically mean narrow AI - what I had in mind was increasingly general AI systems we are likely to encounter on the path to AGI in a slow takeoff scenario.
For the opinion, I would agree that it's not clear how well the covid scenario mat... (read more)
Thanks Wei! I agree that improving institutions is generally very hard. In a slow takeoff scenario, there would be a new path to improving institutions using powerful (but not fully general) AI, but it's unclear how well we could expect that to work given the generally low priors.
The covid response was a minor update for me in terms of AI risk assessment - it was mildly surprising given my existing sense of institutional competence.
I certainly agree that there are problems with the stepwise inaction baseline and it's probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it's an open question how to design a baseline that satisfies a... (read more)
Thanks! I certainly agree that power-seeking is important to address, and I'm glad you are thinking deeply about it. However, I'm uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.
One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don't rely on someone taking over the world, so a superintelligent AI could relatively easily tr... (read more)
Thank you for the clarifications! I agree it's possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure.
Here are some reasons I don't endorse this approach:
1. I have an intuitive sense that defining the auxiliary reward in terms of the... (read more)
I think the previous state is a natural baseline if you are interested in the total impact on the human from all sources. If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).
As an example, suppose I have an unpleasant ride on a crowded bus, where person X steps on my foot and person Y steals my wallet. The total impact on me would be computed relative to the previous state before I got on the bus, whic... (read more)
I am surprised by your conclusion that the best choice of auxiliary reward is the agent's own reward. This seems like a poor instantiation of the "change in my ability to get what I want" concept of impact, i.e. change in the true human utility function. We can expect a random auxiliary reward to do a decent job covering the possible outcomes that matter for the true human utility. However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, ther... (read more)
After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is "change in my ability to get what I want", i.e. change in the true human utility function. This is a broad statement that does not specify how to measure "change", in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not a
Thanks Linda for organizing, looking forward to it!
I don't understand this proposal so far. I'm particularly confused by the last paragraph in the "to get away" section:
I think it might help to illustrate this proposal in your original
I don't think this requires identifying what a subagent is. You only need to be able to reliably identify the state before the subagent is created (i.e. the starting state), but you don't need to tell apart other states that are not the starting state.
I agree that we need to compare to the penalty if the subagent is not created - I just wanted to show that subagent creation does not avoid penalties. The penalty for subagent creation will reflect any impact the subagent actually causes in the environment (in the inaction rollouts).
As you mention... (read more)
I think this problem is about capturing delayed effects of the agent's actions. The way the stepwise baseline is supposed to penalize delayed effects is using inaction rollouts, which compare the effects of the agent action + k noops and the effects of k+1 noops (for all positive integers k). I don't think it's useful to consider this issue for a stepwise baseline with no rollouts, since that baseline fails to penalize any delayed effects (no subagents needed).
Here, the inaction rollouts don't capture the effects of the subag... (read more)
Thanks Stuart for your thought-provoking post! I think your point about the effects of the baseline choice on the subagent problem is very interesting, and it would be helpful to separate it more clearly from the effects of the deviation measure (which are currently a bit conflated in the table). I expect that AU with the inaction baseline would also avoid this issue, similarly to RR with an inaction baseline. I suspect that the twenty billion questions measure with the stepwise baseline would have the subagent issue too.
I'm wondering whether th... (read more)
I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative.
Some of the appeal is that it's fun to read about AI cheating at tasks in unexpected ways (I&apo... (read more)
Thanks Ben! I'm happy that the list has been a useful resource. A lot of credit goes to Gwern, who collected many examples that went into the specification gaming list: https://www.gwern.net/Tanks#alternative-examples.
Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.
Thanks Evan, glad you found this useful! The connection with the inner/outer alignment distinction seems interesting. I agree that the inner alignment problem falls in the design-emergent gap. Not sure about the outer alignment problem matching the ideal-design gap though, since I would classify tampering problems as outer alignment problems, caused by flaws in the implementation of the base objective.
I think the discussion of reversibility and molecules is a distraction from the core of Stuart's objection. I think he is saying that a value-agnostic impact measure cannot distinguish between the cases where the water in the bucket is or isn't valuable (e.g. whether it has sentimental value to someone).
If AUP is not value-agnostic, it is using human preference information to fill in the "what we want" part of your definition of impact, i.e. define the auxiliary utility functions. In this case I would expect you and Stuart to be in agr... (read more)
Thanks Stuart for the example. There are two ways to distinguish the cases where the agent should and shouldn't kick the bucket:
Thanks Abram for this sequence - for some reason I wasn't aware of it until someone linked to it recently.
Would you consider the observation tampering (delusion box) problem as part of the easy problem, the hard problem, or a different problem altogether? I think it must be a different problem, since it is not addressed by observation-utility or approval-direction.
Janos and I are coming for the weekend part of the unconference
I'm confused about the difference between a mesa-optimizer and an emergent subagent. A "particular type of algorithm that the base optimizer might find to solve its task" or a "neural network that is implementing some optimization process" inside the base optimizer seem like emergent subagents to me. What is your definition of an emergent subagent?
Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.
AU is a function of the world state, but intends to capture some general measure of the agent's influence over the environment that does not depend on the state representation.
Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) -> observations (e.g. pixels) -> state repr... (read more)
There are various parts of your explanation that I find vague and could use a clarification on:
I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can't speak for TurnTrout, and there's a decent chance that I'm confused about some of the things here. But here is how I think about AUP and the points raised in this chain:
Thanks for the detailed explanation - I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn't understand the state representation invariance claim in the AUP proposal, though I didn't realize that it is as central to the proposal as you describe here.
I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that d... (read more)
I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that does not depend on the state at all.
It definitely does depend on the state. If the agent moves to a state where it has taken over the world, that's a huge increase in its ability to achieve arbitrary utility functions, and it would get a large penalty.
I think the claim is more that while the penalty does depend on the state, it's no... (read more)
Are you thinking of an action observation formalism, or some kind of reward function over inferred state?
I don't quite understand what you're asking here, could you clarify?
If you had to pose the problem of impact measurement as a question, what would it be?
Something along the lines of: "How can we measure to what extent the agent is changing the world in ways that we care about?". Why?
So there's a thing people do when they talk about AUP which I don't understand. They think it's about state, even though I insist it's fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven't been very good; in the given conversation, they acknowledge that it's different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciou... (read more)
What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?
I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules. This does not apply to the attainable set consisting of the survival utility function, since that is not a random utility function.
What makes you think that?
This is an intuitive claim based on ... (read more)
Thanks Alex for starting this discussion and thanks everyone for the thought-provoking answers. Here is my current set of concerns about the usefulness of impact measures, sorted in decreasing order of concern:
Irrelevant factors. When applied to the real world, impact measures are likely to be dominated by things humans don't care about (heat dissipation, convection currents, positions of air molecules, etc). This seems likely to happen to value-agnostic impact measures, e.g. AU with random utility functions, which would mostly end up rewarding specif... (read more)
I don't see how representation invariance addresses this concern. As far as I understand, the concern is about any actions in the real world causing large butterfly effects. This includes effects that would be captured by any reasonable representation, e.g. different people existing in the action and inaction branches of the world. The state representations used by humans also distinguish between these world branches, but humans have limited models of the future that don't capture butterfly effects (e.g. person X can distinguish between the world... (read more)
As a result of the recent attention, the specification gaming list has received a number of new submissions, so this is a good time to check out the latest version :).
Awesome, thanks Oliver!
Thanks, glad you liked the breakdown!
The agent would have an incentive to stop anyone from doing anything new in response to what the agent did
I think that the stepwise counterfactual is sufficient to address this kind of clinginess: the agent will not have an incentive to take further actions to stop humans from doing anything new in response to its original action, since after the original action happens, the human reactions are part of the stepwise inaction baseline.
The penalty for the original action will take into account human reactions in the inacti... (read more)
Thanks Rohin for a great summary as always!
I think the property of handling shutdown depends on the choice of absolute value or truncation at 0 in the deviation measure, not the choice of the core part of the deviation measure. RR doesn't handle shutdown because by default it is set to only penalize reductions in reachability (using truncation at 0). I would expect that replacing the truncation with absolute value (thus penalizing increases in reachability as well) would result in handling shutdown (but break the asymmetry property from the RR paper).... (read more)
There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:
Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it's bad for an impact measure to not distinguish be... (read more)