As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes.
Everyone knows what the deal with natural abstractions is, right? Abstractions are regularities about the world that are really useful for representing its coarse grained behavior - they're building blocks for communicating, compressing, or predicting information about the world. An abstraction is "natural" if it's so easy to learn, and so broadly useful, that most right-thinking agents will have it as part of their toolbox of abstractions.
The dream is to use natural abstractions to pick out what we want from an AI. Suppose "human values" are a natural abstraction: then both humans and a world-modeling AI would have nearly the exact same human values abstraction in their toolboxes of abstractions. If we can just activate the AI's human values abstraction, we can more or less avoid misalignment between what-humans-are-trying-to-pick-out and what-abstraction-the-AI-takes-as-its-target.
One might think that the main challenge to this plan would be if there are too few natural abstractions. If human values (or agency, or corrigibility, or whatever nice thing you want to target) aren't a natural abstraction, you lose that confidence that the human and the AI are pointing at the same thing. But it's also a challenge if there are too many natural abstractions.
Turns out, humans don't just have one abstraction that is "human values," they have a whole lot of 'em. Humans have many different languages / ontologies we use to talk about people, and these use different abstractions as building blocks. More than one of these abstractions gets called "human values," but they're living in different ontologies / get applied in different contexts.
If none of these abstractions we use to talk about human values are natural, then we're back to the first problem. But if any of them are natural, it seems just as plausible that nearly all of them are. Abstractions don't even have to be discrete - it's perfectly possible to have a continuum.
This complicates the easy alignment plan, because it means that the structure of the world is merely doing most of the work for us rather than almost all of the work. The bigger the space of semantically-similar natural abstractions you have to navigate, the more you have to be careful about your extensional definitions, and the higher standards you have to have for telling good from bad results.
It's enough for natural abstraction to work for strawberry alignment, solving a technical task with a good understanding of what it means to not leave any weird side effects, without doing strong optimization of the world in the process and safely shutting down on completion of the task. With uploads, ambitious alignment becomes much more feasible, even if it doesn't have a natural specification.