There’s a part of your argument I am confused about. The sharp left turn is a sudden change in capabilities. Even if you can see if things are trending one way or the other, how can you see sharp left turns coming? At the end, you clarify that we can’t predict when a left turn will occur, so how do these findings pertain to them? This seems to be more of an attempt to track trends of alignment/misalignment, but I don’t see what new insights it gives us about sharp left turns specifically.
This has caused me to reconsider what intelligence is and what an AGI could be. It’s difficult to determine if this makes me more or leas optimistic about the future. A question: are humans essentially like GPT? We seem to be running simulations with the attempt to reduce predictive loss. Yes, we have agency; but this that human “agent” actually the intelligence or just generated by it?
Very insightful piece! One small quibble, you state the disclaimer that you’re not assuming only Naive Safety measures is realistic many, many times. While I think doing this might be needed when writing for a more general audience, I think for the audience of this writing, only stating it once or twice is necessary.
One possible idea I had. What if, when training Alex based on human feedback, the first team of human evaluators were intentionally picked to be less knowledgeable, more prone to manipulation, and less likely to question answers Alex gave them. Then, you introduce a second team of the most thoughtful, knowledgeable, and skeptical researchers to evaluate Alex. If Alex was acting deceptively, it might not recognize the change fast-enough, and manipulation that worked in the first team might be caught by the second team. Yes, after a while Alex would probably catch on and improve its tactics, but by then the deceptive behavior would have already been exposed.