This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.
I'll continue to include more directions like this in the comments here.
Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.
It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.
No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?