AI ALIGNMENT FORUM
AF

Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"

21Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

22Causal scrubbing: results on induction heads

20Causal scrubbing: results on a paren balance checker

11Causal scrubbing: Appendix

103Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

59Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

65High-stakes alignment via adversarial training [Redwood Research report]

21We're Redwood Research, we do applied alignment research, AMA