An adversarial example for Direct Logit Attribution: memory management in gelu-4l
Please check out our notebook for figure recreation and to examine your own model for clean-up behavior. Produced as part of ARENA 2.0 and the SERI ML Alignment Theory Scholars Program - Spring 2023 Cohort Fig 5: Correlation between DLA of writer head and DLA of [clean-up heads output dependent...
Aug 30, 202317