as a weak alignment techniques we might use to bootstrap strong alignment.
Yes, it also reminded me Christiano approach of amplification and distillation.
Maybe we can ask GPT to output English-Klingon dictionary?
If we use median AI timings, we will be 50 per cent dead before that moment. May be it will be useful different measure, like 10 per cent of TAI, before which our protective measures should be prepared?
Also, this model contradicts naive model of GPT growth in which the number of parameters has been growing 2 orders of magnitude a year last couple of years, and if this trend continues, it could reach human level of 100 trillion parameters in 2 years.
It looks like the idea of human values is very contradictional. May be we should dissolve it? What about "AI safety" without human values?
All else equal, I prefer an AI which is not capable to philosophy, as I am afraid of completely alien conclusions which it could come to (e.g. insect are more important than humans).
More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans (the same about meta-ethics and theory of human values). For example, if someone says that he is not able to understand math, but instead will work on meta-mathematical problems, we would be skeptical about his ability to contribute. Why meta-level would be simpler?
My main objection to this idea is that it is a local solution, and doesn't have built-in mechanisms to become global AI safety solution, that is, to prevent other AIs creation, which could be agential superintelligences. One can try to make "AI police" as a service, but it could be less effective than agential police.
Another objection is probably Gwern's idea that any Tool AI "wants" to become agential AI.
This idea also excludes the robotic direction in AI development, which will anyway produce agential AIs.
Also, here is assumed that there are only two types of BBs and that they have similar measure of existence.
However, there is a very large class of the thermodynamic BBs, which was described in Egan's dust theory: that is observer-moments, which appear as the a result of causal interaction of atoms in a thermodynamic gas, if such causal interaction has the same causal structure as of a moment of experience. They may numerically dominate, but additional calculations are needed and seem possible. There could be other types of BBs, like pure mathematical ones or products of quantum mind generators, which i describes in the post about resurrection of the dead.
Also, if we, for example, assume that the measure of existence is proportional to the energy used for calculations, when de Sitter Boltzmann brains will have higher measure as they have non-zero energy, and quantum fluctuation minds may have smaller calculation energy as their externally measurable energy is zero and the time of calculations is very short.
I expected that Lamport paper would be mentioned, as it describes a known catastrophic mode for autonomous systems, connected with Buridan ass problem and infinite recursion about predicting future time of the problem solving. I think that this problem is underexplored for AI Safety, despite previous attempt to present it on LessWrong.