AI ALIGNMENT FORUM
AF

Clément Dumas
Ω62111
Message
Dialogue
Subscribe

Mech interp researcher working with Neel Nanda and Julian Minder on model diffing as part of the MATS 7 extension.

https://butanium.github.io/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Mechanistically Eliciting Latent Behaviors in Language Models
Clément Dumas1y20

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

Reply
Discussion: Challenges with Unsupervised LLM Knowledge Discovery
Clément Dumas2y3-4

Let's assume the prompt template is x= Q [true/false] [banana/shred]

If I understand correctly, they don't claim p  learned has_banana but ~p=p(x⁺)+(1−p(x⁻))2 learned has_banana. Moreover evaluating ~p for p=is_true(x)⊕is_shred(x) gives:

~p(x=Q [?] banana)=p(Q true banana)+(1−p(Q false banana))2=1+(1−0)2=1

~p(x=Q [?] shred)=p(Q true shred)+(1−p(Q false shred))2=0+(1−1)2=0

Therefore, we can learn a ~p that is a banana classifier

Reply
Model Diffing
9d
(+180)
36What We Learned Trying to Diff Base and Chat Models (And Why It Matters)
9d
0
15Aspiration-based Q-Learning
2y
0