Clément Dumas

I'm a CS master's student at ENS Paris-Saclay. I want to pursue a career in AI safety research.

https://butanium.github.io/

Posts

Sorted by New

15Aspiration-based Q-Learning

Wiki Contributions

Comments

Sorted by

Newest

Mechanistically Eliciting Latent Behaviors in Language Models

Clément Dumas5mo20

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clément Dumas10mo3-4

Let's assume the prompt template is Q [true/false] [banana/shred]

If I understand correctly, they don't claim $p$ learned has_banana but $~ p = \frac{p (x ⁺) + (1 - p (x ⁻))}{2}$ learned has_banana. Moreover evaluating $~ p$ for $p = is_true (x) \oplus is_shred (x)$ gives:

$~ p (x = Q [?] banana) = \frac{p (Q true banana) + (1 - p (Q false banana))}{2} = \frac{1 + (1 - 0)}{2} = 1$

$~ p (x = Q [?] shred) = \frac{p (Q true shred) + (1 - p (Q false shred))}{2} = \frac{0 + (1 - 1)}{2} = 0$

Therefore, we can learn a $~ p$ that is a banana classifier