x

AI ALIGNMENT FORUM

AF

James Dao — AI Alignment Forum

James Dao

James Dao

Message

27

1

5

5y

James Dao

27

5y

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

by Can, Yeu-Tong Lau, James Dao, and Jett Janiak

Please check out our notebook for figure recreation and to examine your own model for clean-up behavior. Produced as part of ARENA 2.0 and the SERI ML Alignment Theory Scholars Program - Spring 2023 Cohort Fig 5: Correlation between DLA of writer head and DLA of [clean-up heads output dependent...

Aug 30, 2023•17