Mailing an envelope to your self does not allow other people to verify whether the envelope wasn't opened in between.
Maybe a quality forensic lab has the ability to tell whether the envelope was opened in between but most people you might show the letter don't.
Error correction codes help a superintelligence to avoid self-modifying but they don't allow goals necessarily to be stable with changing reasoning abilities.
The AI safety community claims it is hard to specify reward functions. If we actually believe this claim, we should be able to create tasks where even if we allow people to specify reward functions, they won't be able to do so. That's what we've tried to do here.
It being hard to specify reward functions for a specific task and it being hard to specify reward functions for a more general AGI seem to me like two very different problems.
Additionally, developing a safe system and developing a nonsafe system are very different. Even if your reward function works 99,9% of the time it can be exploited in those cases where it fails.
Maybe put out some sort of prize for the best ideas for plans?
Pretty sure OpenPhil and OpenAI currently try to fund plans that claim to look like this (e.g. all the ones orthonormal linked in the OP), though I agree that they could try increasing the financial reward by 100x (e.g. a prize) and see what that inspires.
If you want to understand why Eliezer doesn't find the current proposals feasible, his best writeups critiquing them specifically are this long comment containing high level disagreements with Alex Zhu's FAQ on iterated amplification and this response post to the details of Iterated Amplification.
As I und... (read more)
That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN.
I'm not sure why that's terrifying. It seems reassuring to me because it means that there's no way for the NN to suddenly go FOOM because it can't just quickly retrain.
But it can. That's the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can 'train' GPT-3 without even any gradient steps - just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost.
With NNs, 'foom' is not merely possible, it's the default. If you train a model, then as soon as it's done you get, among other things:
the ability to run thousands of copies in parallel on the same hardware
I upvoted the post for the general theory. On the other hand, I think the examples could be more clear and it would be good to find examples that are more commonly faced.