Thanks to David Udell, Alex Turner, and others for conversations that led to this
Recently in a conversation with David Udell, Nate Soares said:
[David's summary of shard theory] doesn't yet convince me that you know something i don't about the hopefullness of such a plan. the sort of summary that might have that effect is a summary of what needs to be true about the world (in this case, probably about RL models, optimization, and/or human values) for this idea to have hope. in particular, the point where i start to be interested in engaging is the point where it seems to me like you perceive the difficulties and have a plan that you expect to overcome those difficulties."
I'm not as pessimistic as Nate, but I also think shard theory needs more concreteness and exploration of failure modes, so as an initial step I spent an hour brainstorming with David Udell, with each of us trying to think of difficulties, and then another couple of hours writing this up. Our methodology was to write down a plan for the sake of concreteness (even if we think it's unlikely to work), then try to identify as many potential difficulties as possible.
Note that this is just one version of the shard theory alignment plan; other versions might replace RL with large language models or other systems.
Exercise for the reader: Which of these persist with simulator-based plans like GEM?
Note: I've thought about this less than I would like and so am fairly unconfident, but I'm posting it anyway
In a non-shard theory frame like Risks from Learned Optimization, we decompose the alignment problem into outer alignment (finding an outer objective aligned with human values) and inner alignment (finding a training process such that an AI's behavioral objective matches the outer objective).
In the shard theory frame, the observations that reward is not the optimization target motivates a different decomposition: our reward signal no longer has to be an outer objective that we expect the AI to be aligned with. But my understanding is that we still have to:
These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it's not clear that they're easier than inner and outer alignment. Some analogies from humans imply that various core alignment problems really are easy, but there are also reasons why they wouldn't be.
First is understanding the reward -> behavioral shard map. In the RLO frame, we have an outer loss function that updates the agent in almost the correct direction, but has some failure modes like deceptive alignment. In a world where models were perfect Bayesian samplers with no path-dependence, the ideal reward function would just be performance on an outer objective. Every departure from the perfect Bayesian sampler implies, in theory, some way to better shape behavior than just using an outer objective. Despite this advantage over the "inner alignment" framing, some of the problems with inner alignment remain.
If we view the inner alignment problem as distinguishing functions that are identical on the training distribution, it becomes clear that shard theory has not dissolved inner alignment. Selecting on behavior alone is not sufficient to guarantee inner alignment, and for the same reason, we will need good transparency or process-based feedback (as part of our knowledge about the reward -> shards map) to reliably induce behaviors we want.
Work can be traded off between specifying desired behavioral tendencies and value formation, and the part that happens at subhuman capability seems doable. I'll assume that specifying behavioral tendencies happens at subhuman level and value formation is done as the system scales to superhuman level.
In my opinion, the main hope of shard theory is the analogy to humans: human values are somewhat reliably produced by the human reward system, despite the reward system not acting like an outer objective. But when we consider the third subproblem, value formation, the analogy breaks down. Human value formation seems really complex, and it's not clear that human values can be fully described by game-theoretic negotiations between fully agentic, self-preserving shards associated with your different behavioral tendencies.
Corrigibility might be easier to learn, but it's still the case that only some ways shards could exist in a mind cause its goals to scale correctly to superintelligence. For example, if shards are just self-preserving circuits that encode behaviors activated by certain observations, then when the agent goes OOD, the observations that prompted the agent to activate the shard (e.g. be helpful to humans) are no longer present. Or, if shards have goals and a shard doesn't prevent its goals from being modified, then its goals will be overwritten. Or, if training causes the corrigibility shards to be limited in intelligence whereas shards with other goals keep getting smarter, the corrigibility shards will eventually lose influence over the mind. If there are important forces in value formation other than internal competition and negotiation between self-preserving shards (which seems highly likely given how humans work), there are even more failure modes, which is why I think a predictable value formation method is key.
In reality, there is a continuum of coherence levels between behavioral tendencies and values.
I think corrigibility is natural iff robust pointers to it can easily get into the AI's goals, and it's not clear whether this is the case-- this is a disagreement between Eliezer and Paul.
and maybe "be able to identify" as well-- depends how reliable the mapping from (1) is
unless you can get a really strong human prior, which is where the simulators hope comes from
I think I care about animal suffering due to some combination of (a) it's high status in my culture; (b) I did some abstract thinking that formed a similarity between animal suffering and human suffering, and I already decided I care about humans; (c) I wanted to "have moral clarity" (whatever that means), went vegan for a month, and decided that the version of me without associations between animals and food had better moral intuitions. It's not as simple as an "animal suffering bad" shard in my brain outcompeting "animal suffering okay" shards.
I have reasons outside the scope of this post why the particular subagent models shard theory have been using seem unlikely.
I think I have two main complaints still, on a skim.
First, I think the following is wrong:
These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it's not clear that they're easier than inner and outer alignment.
I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.
Second, I'm wary of saying "maybe we can get corrigibility" or "maybe corrigibility doesn't fit into a utility function", because this can map shard theory hopes onto old debates where we already have settled into positions. Whereas I consider myself to be thinking about qualitatively different questions and spreads of values I might hope to get into an AI.
I think corrigibility is natural iff robust pointers to it can easily get into the AI's goals
This doesn't make sense to me. It sounds like saying "liking yellow cubes is natural iff we can get a pointer to 'liking yellow cubes' within the AI's goals." That sounds like a thing which would be said if we had no idea how yellow cubes got liked, directly, and were instead treating liking-yellow-cubeness as a black box which happened to exist in the real world (e.g. how corrigibility, or the desire to help people, could be "pointed to" in a classic corrigibility hope).
I have more thoughts on this post but I don't have time to type more for now.