Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:
Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).
Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same ... (read more)
It seems like this argument assumes that the model optimizes on the entire 'training process'. Why can't we test (perform inference) using the model on distributions different from the training distribution where SGD can no longer optimize to check if the model was deceptive aligned on the training environment?