AI ALIGNMENT FORUM
AF

Gianluca Calcagni
000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Gianluca Calcagni1y00

Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.

I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition).

Here a proposal of a possible ELK solution using that approach.

Reply
No wikitag contributions to display.
No posts to display.