x

AI ALIGNMENT FORUM

AF

Kshitij Sachan — AI Alignment Forum

Kshitij Sachan

Kshitij Sachan

Message

Redwood Research

349

Ω

5

1

24

4y

Kshitij Sachan

Redwood Research

AI Control: Improving Safety Despite Intentional Subversion

by Buck, Fabien Roger, ryan_greenblatt, and Kshitij Sachan

We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post: * We summarize the paper; * We compare our methodology to the methodology of other safety papers....

Dec 13, 2023•240

Polysemanticity and Capacity in Neural Networks

by Buck, Adam Jermyn, and Kshitij Sachan

Elhage et al at Anthropic recently published a paper, Toy Models of Superposition (previous Alignment Forum discussion here) exploring the observation that in some cases, trained neural nets represent more features than they “have space for”--instead of choosing one feature per direction available in their embedding space, they choose more...

Oct 7, 2022•87