Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda
Tl;dr We are attempting to make neural networks (NN) modular, have GPT-N interpret each module for us, in order to catch mesa-alignment and inner-alignment failures. Completed Project Train a neural net with an added loss term that enforces the sort of modularity that we see in well-designed software projects. To...