Paul Christiano paints a vivid and disturbing picture of how AI could go wrong, not with sudden violent takeover, but through a gradual loss of human control as AI systems optimize for the wrong things and develop influence-seeking behaviors.
AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimization" could lead AI systems to pursue unintended and potentially dangerous objectives, even if we tried to design them to be safe and aligned with human values.
A story in nine parts about someone creating an AI that predicts the future, and multiple people who wonder about the implications. What happens when the predictions influence what future happens?
Examining the concept of optimization, Abram Demski distinguishes between "selection" (like search algorithms that evaluate many options) and "control" (like thermostats or guided missiles). He explores how this distinction relates to ideas of agency and mesa-optimization, and considers various ways to define the difference.
Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.
But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.
In thinking about AGI safety, I’ve found it useful to build a collection of different viewpoints from people that I respect, such that I can think from their perspective. I will often try to compare what an idea feels like when I put on my Paul Christiano hat, to when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a "Chris Olah" hat, which often looks at AI through the lens of interpretability.
The goal of this post is to try to give that hat to more people.
Eric Drexler's CAIS model suggests that before we get to a world with monolithic AGI agents, we will already have seen an intelligence explosion due to automated R&D. This reframes the problems of AI safety and has implications for what technical safety researchers should be doing. Rohin reviews and summarizes the model
The strategy-stealing assumption posits that for any strategy an unaligned AI could use to influence the long-term future, there is an analogous strategy that humans could use to capture similar influence. Paul Christiano explores why this assumption might be true, and eleven ways it could potentially fail.
Impact measures may be a powerful safeguard for AI systems - one that doesn't require solving the full alignment problem. But what exactly is "impact", and how can we measure it?
Double descent is a puzzling phenomenon in machine learning where increasing model size/training time/data can initially hurt performance, but then improve it. Evan Hubinger explains the concept, reviews prior work, and discusses implications for AI alignment and understanding inductive biases.
AI safety researchers have different ideas of what success would look like. This post explores five different AI safety "success stories" that researchers might be aiming for and compares them along several dimensions.
Gradient hacking is when a deceptively aligned AI deliberately acts to influence how the training process updates it. For example, it might try to become more brittle in ways that prevent its objective from being changed. This poses challenges for AI safety, as the AI might try to remove evidence of its deception during training.
A general guide for pursuing independent research, from conceptual questions like "how to figure out how to prioritize, learn, and think", to practical questions like "what sort of snacks to should you buy to maximize productivity?"
The credit assignment problem – the challenge of figuring out which parts of a complex system deserve credit for good or bad outcomes – shows up just about everywhere. Abram Demski describes how credit assignment appears in areas as diverse as AI, politics, economics, law, sociology, biology, ethics, and epistemology.
Fun fact: biological systems are highly modular, at multiple different scales. This can be quantified and verified statistically. On the other hand, systems designed by genetic algorithms (aka simulated evolution) are decidedly not modular. They're a mess. This can also be verified statistically (as well as just by qualitatively eyeballing them)
What's up with that?