Research engineer at Conjecture (London). My main research interests are mechanistic interpretability and mesa-optimization.
This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn't attempt to interpret the gradient descent process. I'd be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.