x

AI ALIGNMENT FORUM

AF

shash42 — AI Alignment Forum

shash42

Top postsTop post

shash42

Message

441

7

25

5y

shash42

441

5y

Evaluating hidden directions on the utility dataset: classification, steering and removal

by Annah and shash42

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Dan Hendrycks We demonstrate different techniques for finding concept directions in hidden layers of an LLM. We propose evaluating them by using them for classification, activation steering and knowledge removal. You...

Sep 25, 2023•25