All of Vika's Comments + Replies

Great post! I especially enjoyed the intuitive visualizations for how the heavy-tailed distributions affect the degree of overoptimization of X. 

As a possibly interesting connection, your set of criteria for an alignment plan can also be thought of as criteria for selecting a model specification that approximates the ideal specification well, especially trying to ensure that the approximation error is light-tailed. 

Thanks Alex for the detailed feedback! I agree that learning a goal from the training-compatible set is a strong assumption that might not hold. 

This post assumes a standard RL setup and is not intended to apply to LLMs (it's possible some version of this result may hold for fine-tuned LLMs, but that's outside the scope of this post). I can update the post to explicitly clarify this, though I was not expecting anyone to assume that this work applies to LLMs given that the post explicitly assumes standard RL and does not mention LLMs at all. 

I agr... (read more)

We expect that an aligned (blue-cloud) model would have an incentive to preserve its goals, though it would need some help from us to generalize them correctly to avoid becoming a misaligned (red-cloud) model. We talk about this in more detail in Refining the Sharp Left Turn (part 2)

Just added some more detail on this to the slides. The idea is that we have various advantages over the model during the training process: we can restart the search, examine and change beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.

While the model has the advantage of only having to "win" once.

Thanks Alex for the detailed feedback! I have updated the post to fix these errors. 

Curious if you have high-level thoughts about the post and whether these definitions have been useful in your work. 

This post provides a maximally clear and simple explanation of a complex alignment scheme. I read the original "learning the prior" post a few times but found it hard to follow. I only understood how the imitative generalization scheme works after reading this post (the examples and diagrams and clear structure helped a lot). 

This post helped me understand the motivation for the Finite Factored Sets work, which I was confused about for a while. The framing of agency as time travel is a great intuition pump. 

I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems. 

+1. This section follows naturally from the rest of the article, and I don't see why it's labeled as an appendix -  this seems like it would unnecessarily discourage people from reading it. 

3Paul Christiano6mo
I'm convinced, I relabeled it.

It's great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI's ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?

I'm also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 ye... (read more)

Bah! :D It's sad to hear he's updated away from ambitions value learning towards corrigiblity-like targets. Eliezer's second-hand argument sounds circular to me; suppose that corrigibility as we'd recognize it isn't a natural abstraction - then generic AIs wouldn't use it to align child agents (instead doing something like value learning, or something even more direct), and so there wouldn't be a bunch of human-independent examples, so it wouldn't show up as a natural abstraction to those AIs.

When talking about whether some physical system "is a utility maximizer", the key questions are "utility over what variables?", "in what model do those variables live? []", and "with respect to what measuring stick []?". My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I'm still highly uncertain what that type-signature will look like, but there's a lot of degrees of freedom to work with. We'll need qualitatively different methods. But that's not new; interpretability researchers already come up with qualitatively new methods pretty regularly.

I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts. 

I agree that a sudden gain in capabilities can make a simulated agent undergo a sharp left turn (coming up with more effective takeover plans is a great example). My original question was about whether the simulator itself could undergo a sharp left turn. My current understanding is that a pure simulator would not become misaligned if its capabilities suddenly increase because it remains myopic, so we only have to worry about a sharp left turn for simulated agents rather than the simulator itself. Of course, in practice, language models are often fine-tune... (read more)

I would say the primary disagreement is epistemic - I think most of us would assign a low probability to a pivotal act defined as "a discrete action by a small group of people that flips the gameboard" being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch's post on this topic. 

Thanks Richard for this post, it was very helpful to read! Some quick comments:

  • I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
  • The architectural assumptions (e.g. the prediction & action heads) don't seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
  • Phase 1 and 2 seem to map to outer and inner alignment res
... (read more)
3Richard Ngo8mo
Thanks for the comments Vika! A few responses: Makes sense, will do. That doesn't quite seem right to me. In particular: * Phase 3 seems like the most direct example of inner misalignment; I basically think of "goal misgeneralization" as a more academically respectable way of talking about inner misalignment. * Phase 1 introduces the reward misspecification problem (which I treat as synonymous with "outer alignment") but also notes that policies might become misaligned by the end of phase 1 because they learn goals which are "robustly correlated with reward because they’re useful in a wide range of environments", which is a type of inner misalignment. * Phase 2 discusses both policies which pursue reward as an instrumental goal (which seems more like inner misalignment) and also policies which pursue reward as a terminal goal. The latter doesn't quite feel like a central example of outer misalignment, but it also doesn't quite seem like a central example of reward tampering (because "deceiving humans" doesn't seem like an example of "tampering" per se). Plausibly we want a new term for this - the best I can come up with after a few minutes' thinking is "reward fixation", but I'd welcome alternatives. It seems very unlikely for an AI to have perfect proxies when it becomes situationally aware, because the world is so big and there's so much it won't know. In general I feel pretty confused about Evan talking about perfect performance, because it seems like he's taking a concept that makes sense in very small-scale supervised training regimes, and extending it to AGIs that are trained on huge amounts of constantly-updating (possibly on-policy) data about a world that's way too complex to predict precisely. Mechanistic interpretability seems helpful in phase 2, but there are other techniques that could help in phase 2, in particular scalable oversight techniques. Whereas interpretability seems like the only thing that's rea

Thank you for the insightful post. What do you think are the implications of the simulator framing for alignment threat models? You claim that a simulator does not exhibit instrumental convergence, which seems to imply that the simulator would not seek power or undergo a sharp left turn. The simulated agents could exhibit power-seeking behavior or rapidly generalizing capabilities or try to break out of the simulation, but this seems less concerning than the top-level model having these properties, and we might develop alignment techniques specifically tar... (read more)

2Vojtech Kovarik8mo
Re sharp left turn: Maybe I misunderstand the "sharp left turn" term, but I thought this just means a sudden extreme gain in capabilities? If I am correct, then I expect you might get "sharp left turn" with a simulator during training --- eg, a user fine-tunes it on one additional dataset, and suddenly FOOOM. (Say, suddenly it can simulate agents that propose takeover plans that would actually work, when previously they failed at this with identical prompting.) One implication I see is that it if the simulator architecture becomes frequently used, it might be really hard to tell whether a thing is dangerous or not. For example might just behave completely fine with most prompts and catastrophically with some other prompts, and you will never know until you try. (Or unless you do some extra interpretability/other work that doesn't yet exist.) It would be rather unfortunate if the Vulnerable World Hypothesis was true because of specific LLM prompts :-).

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more". 

Some corrections for your overall description of the DM alignment team:

  • I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
  • I would put DM alignment in the "fairly hard"
... (read more)

Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.

We don't have the power to shut down projects, but we can make recommendations and provide input into decisions about projects

So you can have non-binding recommendations and input, but no actual binding power over the capabilities researchers, right?

Thanks! For those interested in conducting similar surveys, here is a version of the spreadsheet you can copy (by request elsewhere in the comments). 

Here is a spreadsheet you can copy. This one has a column for each person - if you want to sort the rows by agreement, you need to do it manually after people enter their ratings. I think it's possible to automate this but I was too lazy. 

Ah, I think you intended level 6 as an OR of learning from imitation / imagined experience, while I interpreted it as an AND. I agree that humans learn from imitation on a regular basis (e.g. at school). In my version of the hierarchy, learning from imitation and imagined experience would be different levels (e.g. level 6 and 7) because the latter seems a lot harder. In your decision theory example, I think a lot more people would be able to do the imitation part than the imagined experience part. 

2Daniel Kokotajlo10mo
Well said; I agree it should be split up like that.

I think some humans are at level 6 some of the time (see Humans Who Are Not Concentrating Are Not General Intelligences). I would expect that learning cognitive algorithms from imagined experience is pretty hard for many humans (e.g. examples in the Astral Codex post about conditional hypotheticals). But maybe I have a different interpretation of Level 6 than what you had in mind?

3Daniel Kokotajlo10mo
Good point re learning cognitive algorithms from imagined experience, that does seems pretty hard. From imitation though? We do it all the time. Here's an example of me doing both: I read books about decision theory & ethics, and learn about expected utility maximization & the bounded variants that humans can actually do in practice (back of envelope calculations, etc.) I immediately start implementing this algorithm myself on a few occasions. (Imitation) Then I read more books and learn about "pascal's mugging" and the like. People are arguing about whether or not it's a problem for expected utility maximization. I think through the arguments myself and come up with some new arguments of my own. This involves imagining how the expected utility maximization algorithm would behave in various hypothetical scenarios, and also just reasoning analytically about the properties of the algorithm. I end up concluding that I should continue using the algorithm but with some modifications. (Learning from imagined experience.) Would you agree with this example, or are you thinking about the hierarchy somewhat differently than me? I'm keen to hear more if the latter.

This is an interesting hierarchy! I'm wondering how to classify humans and various current ML systems along this spectrum. My quick take is that most humans are at Levels 4-5, AlphaZero is at level 5, and GPT-3 is at level 4 with the right prompting. Curious if you have specific ML examples in mind for these levels. 

2Daniel Kokotajlo10mo
Thanks! Hmm, I would have thought humans were at Level 6, though of course most of their cognition most of the time is at lower levels.

Thanks Eliezer for writing up this list, it's great to have these arguments in one place! Here are my quick takes (which mostly agree with Paul's response). 

Section A (strategic challenges?):

Agree with #1-2 and #8. Agree with #3 in the sense that we can't iterate in dangerous domains (by definition) but not in the sense that we can't learn from experiments on easier domains (see Paul's Disagreement #1). 

Mostly disagree with #4 - I think that coordination not to build AGI (at least between Western AI labs) is difficult but feasible, especially aft... (read more)

I think our proposal addresses the "simple steganography" problem, as described in "ELK prize results / First counterexample: simple steganography":

By varying the phrasing and syntax of an answer without changing its meaning, a reporter could communicate large amounts of information to the auxiliary model. Similarly, there are many questions where a human is unsure about the answer and the reporter knows it. A reporter could encode information by answering each of these questions arbitrarily. Unless the true answers have maximum entropy, this strategy coul

... (read more)

I generally endorse the claims made in this post and the overall analogy. Since this post was written, there are a few more examples I can add to the categories for slow takeoff properties. 

Learning from experience

  • The UK procrastinated on locking down in response to the Alpha variant due to political considerations (not wanting to "cancel Christmas"), though it was known that timely lockdowns are much more effective.
  • Various countries reacted to Omicron with travel bans after they already had community transmission (e.g. Canada and the UK), while it wa
... (read more)

Really excited to read this sequence as well!

Ah I see, thanks for the clarification! The 'bottle cap' (block) example is robust to removing any one cell but not robust to adding cells next to it (as mentioned in Oscar's comment). So most random perturbations that overlap with the block will probably destroy it. 

  1. Actually, we realized that if we consider an empty board an optimizing system, then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.
4Edouard Harris2y
Great catch. For what it's worth, it actually seems fine to me intuitively that any finite pattern would be an optimizing system for this reason, though I agree most such patterns may not directly be interesting. But perhaps this is a hint that some notion of independence or orthogonality of optimizing systems might help to complete this picture. Here's a real-world example: you could imagine a universe where humans are minding their own business over here on Earth, while at the same time, over there in a star system 20 light-years away, two planets are hurtling towards each other under the pull of their mutual gravitation. No matter what humans may be doing on Earth, this universe as a whole can still reasonably be described as an optimizing system! Specifically, it achieves the property that the two faraway planets will crash into each other under a fairly broad set of contexts. Now suppose we describe the state of this universe as a single point in a gargantuan phase space — let's say it's the phase space of classical mechanics, where we assign three positional and three momentum degrees of freedom to each particle in the universe (so if there are N particles in the universe, we have a 6N-dimensional phase space). Then there is a subspace of this huge phase space that corresponds to the crashing planets, and there is another, orthogonal subspace that corresponds to the Earth and its humans. You could then say that the crashing-planets subspace is an optimizing system that's independent of the human-Earth subspace. In particular, if you imagine that these planets (which are 20 light-years away from Earth) take less than 20 years to crash into each other, then the two subspaces won't come into causal contact before the planet subspace has achieved the "crashed into each other" property. Similarly on the GoL grid, you could imagine having an interesting eater over here, while over there you have a pretty boring, mostly empty grid with just a single live cell in i

Thanks for pointing this out! We realized that if we consider an empty board an optimizing system then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.

The 'bottle cap' example would be an optimizing system if it was robust to cells colliding / interacting with it, e.g. being hit by a glider (similarly to the eater). 

Ah. I interpreted the statement about the empty board as being one of: A small random perturbation, will probably be non-viable/collapse back to the empty board. (Whereas patterns that are viable don't (necessarily) have this property.) I then, asked about whether the bottle cap example, had the same robustness.

Thanks Aryeh for collecting these! I added them to a new Project Ideas section in my AI Safety Resources list.

Writing this post helped clarify my understanding of the concepts in both taxonomies - the different levels of specification and types of Goodhart effects. The parts of the taxonomies that I was not sure how to match up usually corresponded to the concepts I was most confused about. For example, I initially thought that adversarial Goodhart is an emergent specification problem, but upon further reflection this didn't seem right. Looking back, I think I still endorse the mapping described in this post.

I hoped to get more comments on this post... (read more)

It was not my intention to imply that semantic structure is never needed - I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it's unlikely we can get away without it. 

There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it's plausible that agents can learn the semantic structure tha... (read more)

2Koen Holtman3y
Thanks for the clarification, I think our intuitions about how far you could take these techniques may be more similar than was apparent from the earlier comments. You bring up the distinction between semantic structure that is learned via unsupervised learning, and semantic structure that comes from 'explicit human input'. We may be using the term 'semantic structure' in somewhat different ways when it comes to the question of how much semantic structure you are actually creating in certain setups. If you set up things to create an impact metric via unsupervised learning, you still need to encode some kind of impact metric on the world state by hand, to go into the agents's reward function, e.g. you may encode 'bad impact' as the observable signal 'the owner of the agent presses the do-not-like feedback button'. For me, that setup uses a form of indirection to create an impact metric that is incredibly rich in semantic structure. It is incredibly rich because it indirectly incorporates the impact-related semantic structure knowledge that is in the owner's brain. You might say instead that the metric does not have a rich of semantic structure at all, because it is just a bit from a button press. For me, an impact metric that is defined as 'not too different from the world state that already exists' would also encode a huge amount of semantic structure, in case the world we are talking about is not a toy world but the real world.

Looks great, thanks! Minor point: in the sparse reward case, rather than "setting the baseline to the last state in which a reward was achieved", we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline). 

3Rohin Shah3y
Good point, changed to

I would say that impact measures don't consider these kinds of judgments. The "doing nothing" baseline can be seen as analogous to the agent never being deployed, e.g. in the Low Impact AI paper. If the agent is never deployed, and someone dies in the meantime, then it's not the agent's responsibility and is not part of the agent's impact on the world.

I think the intuition you are describing partly arises from the choice of language: "killing someone by not doing something" vs "someone dying while you are doing nothing". The word "killing" is an active ver

... (read more)

Thanks Flo for pointing this out. I agree with your reasoning for why we want the Markov property. For the second modification, we can sample a rollout from the agent policy rather than computing a penalty over all possible rollouts. For example, we could randomly choose an integer N, roll out the agent policy and the inaction policy for N steps, and then compare the resulting states. This does require a complete environment model (which does make it more complicated to apply standard RL), while inaction rollouts only require a partial environment model (p

... (read more)

I don't think the pedestrian example shows a need for semantic structure. The example is intended to illustrate that an agent with the stepwise inaction baseline has no incentive to undo the delayed effect that it has set up. We want the baseline to incentivize the agent to undo any delayed effect, whether it involves hitting a pedestrian or making a pigeon fly. 

The pedestrian and pigeon effects differ in the magnitude of impact, so it is the job of the deviation measure to distinguish between them and penalize the pedestrian effect more. Optionality-

... (read more)
1Koen Holtman3y
Reading the above, I am reminded of a similar exchange about the need for semantic structure between Alex Turner and me here [], so I'd like to get to the bottom of this. Can you clarify your broader intuitions about the need or non-need for semantic structure? (Same question goes to Alex.) Frankly, I expected you would have replied to Stuart's comment with a statement like the following: 'using semantic structure in impact measures is a valid approach, and it may be needed to encode certain values, but in this research we are looking at how far we can get by avoiding any semantic structure'. But I do not see that. Instead, you seem to imply that leveraging semantic structure is never needed when further scaling impact measures. It looks like you feel that we can solve the alignment problem by looking exclusively at 'model-free' impact measures. To make this more specific, take the following example. Suppose a mobile AGI agent has a choice between driving over one human, driving over P pigeons, or driving over C cats. Now, humans have very particular ideas about how they value the lives of humans, pigeons, and cats, and would expect that those ideas are reflected reasonably well in how the agent computes its impact measure. You seem to be saying that we can capture all this detail by just making the right tradeoffs between model-free terms, by just tuning some constants in terms that calculate 'loss of options by driving over X'. Is this really what you are saying? I have done some work myself on loss-of-options impact measures (see e.g. section 12 of my recent paper here []). My intuition about how far you can scale these 'model-free' techniques to produce human-morality-aligned safety properties in complex environments seems to be in complete disagreement with your comments and thos

The baseline is not intended to indicate what should happen, but rather what happens by default. The role of the baseline is to filter out effects that were not caused by the agent, to avoid penalizing the agent for them (which would produce interference incentives). Explicitly specifying what should happen usually requires environment-specific human input, and impact measures generally try to avoid this.

2Adam Shimi3y
I understood that the baseline that you presented was a description of what happens by default, but I wondered if there was a way to differentiate between different judgements on what happens by default. Intuitively, killing someone by not doing something feels different from not killing someone by not doing something. So my question was a check to see if impact measures considered such judgements (which apparently they don't) and if they didn't, what was the problem.

Thanks Koen for your feedback! You make a great point about a clearer call to action for RL researchers. I think an immediate call to action is to be aware of the following:

  • there is a broader scope of aligned RL agent design
  • there are difficult unsolved problems in this broader scope
  • for sufficiently advanced agents, these problems need general solutions rather than ad-hoc ones

Then a long-term call to action (if/when they are in the position to deploy an advanced AI system) is to consider the broader scope and look for general solutions to specification prob... (read more)

Thanks John for the feedback! As Oliver mentioned, the target audience is ML researchers (particularly RL researchers). The post is intended as an accessible introduction to the specification gaming problem for an ML audience that connects their perspective with a safety perspective on the problem. It is not intended to introduce novel concepts or a principled breakdown of the problem (I've made a note to clarify this in a later version of the post).

Regarding your specific questions about the breakdown, I think faithfully capturing the human concept o... (read more)

Thanks Adam for the feedback - glad you enjoyed the post!

For the Lego example, the agent received a fixed shaping reward for grasping the red brick if the bottom face was above a certain height (3cm), rather than being rewarded in proportion to the height of the bottom face. Thus, it found an easy way to collect the shaping reward by flipping the brick, while stacking it upside down on the blue brick would be a more difficult way to get the same shaping reward. The current description of the example in the post does make it sound like the reward is proportional to the height - I'll make a note to fix this in a later version of the post.

2Adam Shimi3y
Ok, that makes much more sense. I was indeed assuming a proportional reward.

Thanks Matthew for your interesting points! I agree that it's not clear whether the pandemic is a good analogy for slow takeoff. When I was drafting the post, I started with an analogy with "medium" takeoff (on the time scale of months), but later updated towards the slow takeoff scenario being a better match. The pandemic response in 2020 (since covid became apparent as a threat) is most relevant for the medium takeoff analogy, while the general level of readiness for a coronavirus pandemic prior to 2020 is most relevant for the slow takeof... (read more)

Thanks Rohin for covering the post in the newsletter!

The summary looks great overall. I have a minor objection to the word "narrow" here: "we may fail to generalize from narrow AI systems to more general AI systems". When I talked about generalizing from less advanced AI systems, I didn't specifically mean narrow AI - what I had in mind was increasingly general AI systems we are likely to encounter on the path to AGI in a slow takeoff scenario.

For the opinion, I would agree that it's not clear how well the covid scenario mat... (read more)

3Rohin Shah3y
Changed narrow/general to weak/strong in the LW version of the newsletter (unfortunately the newsletter had already gone out when your comment was written). There was some worry about supply chain problems for food. Perhaps that didn't materialize, or it did materialize and it was solved without me noticing. I expect that this was the first extended shelter-in-place order for most if not all of the US, and this led to a bunch of problems in deciding what should and shouldn't be included in the order, how stringent to make it, etc. More broadly, I'm not thinking of any specific problem, but the world is clearly very different than it was in any recent epidemic (at least in the US), and I would be shocked if this did not bring with it several challenges that we did not anticipate ahead of time (perhaps someone somewhere had anticipated it, but it wasn't widespread knowledge). I definitely agree that we can decrease the likelihood of pandemics arising, but we can't really hope to eliminate them altogether (with current technology). But really I think this was not my main point, and I summarized my point badly: the point was that given that alignment is about preventing misalignment from arising, the analogous thing for pandemics would be about preventing pandemics from arising; it is unclear to me whether civilization was particularly inadequate along this axis ex ante (i.e. before we knew that COVID was a thing).

Thanks Wei! I agree that improving institutions is generally very hard. In a slow takeoff scenario, there would be a new path to improving institutions using powerful (but not fully general) AI, but it's unclear how well we could expect that to work given the generally low priors.

The covid response was a minor update for me in terms of AI risk assessment - it was mildly surprising given my existing sense of institutional competence.

I certainly agree that there are problems with the stepwise inaction baseline and it's probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it's an open question how to design a baseline that satisfies a... (read more)

3Rohin Shah3y
Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it's not clear how to apply that for humans. (I only have this objection when trying to explain what "impact" means to humans; it seems fine in the RL setting. I do think we'll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.) Also, under this inaction baseline, the roads are perpetually empty, and so you're always feeling impact from the fact that you can't zoom down the road at 120 mph, which seems wrong. Sorry, what I meant to imply was "baselines are counterfactuals, and counterfactuals are hard, so maybe no 'natural' baseline exists". I certainly agree that my baseline is a counterfactual. Yes, that's my main point. I agree that there's no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don't always apply (even when interpreted by humans).

Thanks! I certainly agree that power-seeking is important to address, and I'm glad you are thinking deeply about it. However, I'm uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.

One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don't rely on someone taking over the world, so a superintelligent AI could relatively easily tr... (read more)

1Alex Turner3y
What I actually said was: First, the "I think", and second, the "plausibly". I think the "plausibly" was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUPconceptual ("optimize the objective, without becoming more able to optimize the objective"), you don't need additional ideas to get a superintelligence-safe impact measure.

Thank you for the clarifications! I agree it's possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure.

Here are some reasons I don't endorse this approach:

1. I have an intuitive sense that defining the auxiliary reward in terms of the... (read more)

2Alex Turner3y
I think this makes sense – you come in and wonder "what's going on, this doesn't even pass the basic test cases?!". Some context: in the superintelligent case, I often think about "what agent design would incentivize putting a strawberry on a plate, without taking over the world"? Although I certainly agree SafeLife-esque side effects are important, power-seeking might be the primary avenue to impact for sufficiently intelligent systems. Once a system is smart enough, it might realize that breaking vases would get it in trouble, so it avoids breaking vases as long as we have power over it. If we can't deal with power-seeking, then we can't deal with power-seeking & smaller side effects at the same time. So, I set out to deal with power-seeking for the superintelligent case. Under this threat model, the random reward AUP penalty (and the RR penalty AFAICT) can be avoided with the help of a "delusion box" which holds the auxiliary AUs constant. Then, the agent can catastrophically gain power without penalty. (See also: Stuart's subagent sequence []) I investigated whether we can get an equation which implements the reasoning in my first comment: "optimize the objective, without becoming more able to optimize the objective". As you say, I think Rohin and others have given good arguments that my preliminary equations don't work as well as we'd like. Intuitively, though, it feels like there might be a better way to implement that reasoning. I think the agent-reward equations do help avoid certain kinds of loopholes, and that they expose key challenges for penalizing power seeking. Maybe going back to the random rewards or a different baseline helps overcome those challenges, but it's not clear to me that that's true. I'm pretty curious about that – implementing eg Stuart's power-seeking gridworld [

I think the previous state is a natural baseline if you are interested in the total impact on the human from all sources. If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).

As an example, suppose I have an unpleasant ride on a crowded bus, where person X steps on my foot and person Y steals my wallet. The total impact on me would be computed relative to the previous state before I got on the bus, whic... (read more)

3Rohin Shah3y
To the extent that there is a natural choice (counterfactuals are hard), I think it would be "what the human expected the agent to do" (the same sort of reasoning that led to the previous state baseline). This gives the same answer as the stepwise inaction baseline in your example (because usually we don't expect a specific person to step on our feet or to steal our wallet). An example where it gives a different answer is in driving. The stepwise inaction baseline says "impact is measured relative to all the other drivers going comatose", so in the baseline state many accidents happen, and you get stuck in a huge traffic jam. Thus, all the other drivers are constantly having a huge impact on you by continuing to drive! In contrast, the baseline of "what the human expected the agent to do" gets the intuitive answer -- the human expected all the other drivers to drive normally, and so normal driving has ~zero impact, whereas if someone actually did fall comatose and cause an accident, that would be quite impactful. EDIT: Tbc, I think this is the "natural choice" if you want to predict what humans would say is impactful; I don't have a strong opinion on what the "natural choice" would be if you wanted to successfully prevent catastrophe via penalizing "impact". (Though in this case the driving example still argues against stepwise inaction.)

I am surprised by your conclusion that the best choice of auxiliary reward is the agent's own reward. This seems like a poor instantiation of the "change in my ability to get what I want" concept of impact, i.e. change in the true human utility function. We can expect a random auxiliary reward to do a decent job covering the possible outcomes that matter for the true human utility. However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, ther... (read more)

6Alex Turner3y
You seem to have misunderstood. Impact to a person is change in their AU []. The agent is not us, and so it's insufficient for the agent to preserve its ability to do what we want [] – it has to preserve our ability to do we want! The Catastrophic Convergence Conjecture [] says: Logically framed, the argument is: catastrophe → power-seeking (obviously, this isn't a tautology or absolute rule, but that's the structure of the argument). Attainable Utility Preservation: Concepts [] takes the contrapositive: no power-seeking → no catastrophe. Then, we ask – "for what purpose does the agent gain power?". The answer is: for its own purpose []. Of course.[1] One of the key ideas I have tried to communicate is []: AUPconceptual does not try to look out into the world and directly preserve human values. AUPconceptual penalizes the agent for gaining power, which disincentivizes huge catastrophes & huge decreases in our attainable utilities. I agree it would perform poorly, but that's because the CCC does not apply [] to SafeLife. We don't need to worry about the agent gaining power over other agents. Instead, the agent can be viewed as the exclusive interface through which we can interact with a given SafeLife level, so it should preserve our AU by preserving its own AUs. Where exactly is this boundary drawn? I think
Load More