AI ALIGNMENT FORUM
AF

Nate Thomas
Ω52230
Message
Dialogue
Subscribe

Redwood Research and Constellation

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter
Nate Thomas2y10

Thanks, Neel! It should be fixed now.

Reply
Takeaways from our robust injury classifier project [Redwood Research]
Nate Thomas3y912

Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"

Reply
No wikitag contributions to display.
21Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter
2y
2
22Causal scrubbing: results on induction heads
3y
0
20Causal scrubbing: results on a paren balance checker
3y
2
11Causal scrubbing: Appendix
3y
1
103Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
3y
29
59Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
3y
5
65High-stakes alignment via adversarial training [Redwood Research report]
3y
15
21We're Redwood Research, we do applied alignment research, AMA
4y
2