I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of?
Say that the triggers for pleasure are hardwired. After a pleasurable event, how do only those computations running in the brain that led to pleasure (and not those randomly running computations) get strengthened? After all, the pleasure circuit is hardwired, and can't reason causally about what thoughts led to what outcomes.
(I'm not currently confident that pleasure is exactly the same thing as reinforcement, but the two are probably closely related, and pleasure is a nice and concrete thing to discuss.)
What's to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?
Nothing except those shards fighting for their own interests and succeeding to some extent.
You probably have many contending values that you hang on to now, and would even be pretty careful with write access to your own values, for instrumental convergence reasons. If you mostly expect outcomes where one shard eats all the others, why do you have a complex balance of values rather than a single core value?
And if it looks like this comes in hindsight by carefully reflecting on the situation, that's not without reinforcement. Your thoughts are scored against whatever it is that the brainstem is evaluating. And same as above, earlier or later, you stumble into some thoughts where the pattern is more clearly attributable, and then the weights change.
Maybe. But your subcortical reinforcement circuitry cannot (easily) score your thoughts. What it can score are the mystery computations that led to hardcoded reinforcement triggers, like sugar molecules interfacing with tastebuds. When you're just thinking to yourself, all of that should be a complete black-box to the brainstem.
I did mention that something is going on in the brain with self-supervised learning, and that's probably training your active computations all the time. Maybe shards can be leveraging this training loop? I'm currently quite unclear on this, though.
Reading this was causally responsible for me undoing any updates I made after being disappointed by my playing with GPT-3. Those observations weren't more likely inside a weak-GPT world, because a strong-GPT would just as readily simulate weak-simulacra in my contexts as it would strong-simulacra in other contexts.
I think I had all the pieces to have inferred this... but some subverbal part of my cognition was illegitimately epistemically nudged by the manifest limitations of naïvely prompted GPT. That part of me, I now see, should have only been epistemically pushed around by quite serious, professional toying with GPT!