Lee Sharkey

Research engineer at Conjecture (London). 

My main research interests are mechanistic interpretability and mesa-optimization. 

Wiki Contributions

Comments

This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn't attempt to interpret the gradient descent process. I'd be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.