AI ALIGNMENT FORUM
AF

j_we
Ω30210
Message
Dialogue
Subscribe

I'm a PhD student working on AI Safety. I'm thinking about how we can use interpretability techniques to make LLMs more safe.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Representation Tuning
j_we1y00

Hey Christopher, this is really cool work. I think your idea of representation tuning is a very nice way to combine activation steering and fine-tuning. Do you have any intuition as to why fine-tuning towards the steering vector sometimes works better than simply steering towards it?

If you keep on working on this I’d be interested to see a more thorough evaluation of capabilities (more than just perplexity) by running it on some standard LM benchmarks. Whether the model retains its capabilities seems important to understand the safety-capabilities trade-off of this method.

I’m curious whether you tried adding some way to retain general capabilities into the loss function with which you do representation-tuning? E.g. to regularise the activations to stay closer to the original activations or by adding some standard Language Modelling loss?

As a nitpick: I think when measuring the Robustness of Tuned models the comparison advantages the honesty-tuned model. If I understand correctly the honesty-tuned model was specifically trained to be less like the vector used for dishonesty steering, whereas the truth-tuned model hasn’t been. Maybe a more fair comparison would be using automatic adversarial attack methods like GCG?

Again, I think this is a very cool project!

Reply
8Open Challenges in Representation Engineering
5mo
0
23An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
1y
1