AI ALIGNMENT FORUM
AF

Pranav Gade
Ω53000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Pranav Gade3y10

I ended up throwing this(https://github.com/pranavgade20/causal-verifier) together over the weekend - it's probably very limited compared to redwood's thing, but seems to work on the one example I've tried.

Reply
52unRLHF - Efficiently undoing LLM safeguards
2y
2