This approach ignores the fact that if we use advanced LLMs to make new paradigm advancements that are extremely effective RL sociopaths, we'll at that point have the help of the relatively harmless but still very powerful LLMs to do safety work on the RL agents — this is a major help with mitigating autonomy risks! Of course, there's always the risk that new RL architecture discoveries create economic incentives to scale the scary RL agents without sufficient safety work, but the prospect of using HHH AI to align scary AI is weirdly under-explored when talking about that exact advanced LLM + advanced RL learner world.
This approach ignores the fact that if we use advanced LLMs to make new paradigm advancements that are extremely effective RL sociopaths, we'll at that point have the help of the relatively harmless but still very powerful LLMs to do safety work on the RL agents — this is a major help with mitigating autonomy risks! Of course, there's always the risk that new RL architecture discoveries create economic incentives to scale the scary RL agents without sufficient safety work, but the prospect of using HHH AI to align scary AI is weirdly under-explored when talking about that exact advanced LLM + advanced RL learner world.