Victoria Krakovna

Victoria Krakovna. Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog:

Wiki Contributions


I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts. 

I agree that a sudden gain in capabilities can make a simulated agent undergo a sharp left turn (coming up with more effective takeover plans is a great example). My original question was about whether the simulator itself could undergo a sharp left turn. My current understanding is that a pure simulator would not become misaligned if its capabilities suddenly increase because it remains myopic, so we only have to worry about a sharp left turn for simulated agents rather than the simulator itself. Of course, in practice, language models are often fine-tuned with RL, which creates agentic incentives on the simulator level as well. 

You make a good point about the difficulty of identifying dangerous models if the danger is triggered by very specific prompts. I think this may go both ways though, by making it difficult for a simulated agent to execute a chain of dangerous behaviors, which could be interrupted by certain inputs from the user. 

I would say the primary disagreement is epistemic - I think most of us would assign a low probability to a pivotal act defined as "a discrete action by a small group of people that flips the gameboard" being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch's post on this topic. 

Thanks Richard for this post, it was very helpful to read! Some quick comments:

  • I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
  • The architectural assumptions (e.g. the prediction & action heads) don't seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
  • Phase 1 and 2 seem to map to outer and inner alignment respectively. 
  • Supposing there is no misspecification in phase 1, do the problems in phase 2 still occur? How likely is deceptive alignment seems to argue that they may not occur, since a model that has perfect proxies when it becomes situationally aware would not then become deceptively aligned. 
  • I'm confused why mechanistic interpretability is listed under phase 3 in the research directions - surely it would make the most difference for detecting the emergence of situational awareness and deceptive alignment in phase 2, while in phase 3 the deceptively aligned model will get around the interpretability techniques. 

Thank you for the insightful post. What do you think are the implications of the simulator framing for alignment threat models? You claim that a simulator does not exhibit instrumental convergence, which seems to imply that the simulator would not seek power or undergo a sharp left turn. The simulated agents could exhibit power-seeking behavior or rapidly generalizing capabilities or try to break out of the simulation, but this seems less concerning than the top-level model having these properties, and we might develop alignment techniques specifically targeted at simulated agents. For example, a simulated agent might need some level of persistence within the simulation to execute these behaviors, and we may be able to influence the simulator to generate less persistent agents. 


Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more". 

Some corrections for your overall description of the DM alignment team:

  • I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
  • I would put DM alignment in the "fairly hard" bucket (p(doom) = 10-50%) for alignment difficulty, and the "mixed" bucket for "conceptual vs applied"

Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.

We don't have the power to shut down projects, but we can make recommendations and provide input into decisions about projects

Thanks! For those interested in conducting similar surveys, here is a version of the spreadsheet you can copy (by request elsewhere in the comments). 

Here is a spreadsheet you can copy. This one has a column for each person - if you want to sort the rows by agreement, you need to do it manually after people enter their ratings. I think it's possible to automate this but I was too lazy. 

Load More