I broadly agree with a lot of shard theory claims. However, the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI's 'linguistic cortex' which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us 'truly want' specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model -- is necessary in situations where a.) you can't behaviourally clone existing behaviour and b.) you can't self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function.
I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function.
I feel like this is a good point in general but I think there is an important but subtle distinction between the two examples. In the first case of the GAN it is that there is the distinction between the inner optimization loop of the ML algorithm and the outer loop of humans performing an evolutionary search process to get papers/make pretty pictures.
In the wire-heading case this feels different in that you have essentially two separate value functions -- a cortical LM based one which can extrapolate values in linguistic/concept space and a classic RL basal-ganglia value function which is based on your personal experience. The difference here is mostly in training data -- the cortex is trained on a large sensory corpus including linguistic text describing wire heading. The subcortical value function is largely trained on personal rewarding experiences. It would be odd to have them necessarily be always consistent and would lead to strange failure modes exactly like wire heading, or generally being able to be viscerally convinced of anything you read that sounds convincing.
My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is 'guaranteed' to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart's textbook and also Bertsekas' dynamic programming textbook