LessWrong dev & admin as of July 5th, 2022.
The reasons I like this post:
"That being said, I do think there are some cases where gradient hacking might be quite easy, e.g. cases where we give the model access to a database where it can record its pre-commitments or direct access to its own weights and the ability to modify them.")
Arguably missing is a line or two that backtracks from "we could try to get robust understanding via a non-behavioral source such as mechanistic interpretability evaluated throughout the course of training" to (my claim) "it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment, and we don't actually know when we're going to hit that threshold", but that might be out of scope.