Evaluating hidden directions on the utility dataset: classification, steering and removal
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Dan Hendrycks We demonstrate different techniques for finding concept directions in hidden layers of an LLM. We propose evaluating them by using them for classification, activation steering and knowledge removal. You...
Sep 25, 202325