Filip Sondej

Unlearning Needs to be More Selective [Progress Report]

Summary We’d like to share our ongoing work on improving LLM unlearning. [arXiv] [github] There’s a myriad of approaches for unlearning, so over the past 8 months we conducted hundreds of small-scale experiments, comparing many loss functions, variants of meta-learning, various neuron or weight ablations, representation engineering and many exotic...

Jun 27, 202524

How LLM Beliefs Change During Chain-of-Thought Reasoning

Summary We tried to figure out how a model's beliefs change during a chain-of-thought (CoT) when solving a logical problem. Measuring this could reveal which parts of the CoT actually causally influence the final answer and which are just fake reasoning manufactured to sound plausible. (Note that prevention of such...

Jun 16, 202532

Testing which LLM architectures can do hidden serial reasoning

Summary * Recurrence enables hidden serial reasoning. * Not every recurrence though - connections between channels are needed. Notably Mamba architecture isn't capable of hidden reasoning. * Non-linearity isn’t needed for hidden reasoning. * It’s hard for transformers to learn to use all the layers for serial computation. For example...

Dec 16, 202484

Boomerang - protocol to dissolve some commitment races

Work done during SERI MATS 3.0 with mentorship from Jesse Cliffton. Huge thanks for all the feedback and discussions to Anthony DiGiovanni, Daniel Kokotajlo, Martín Soto, Rubi J. Hudson and Jan Betley! Also posted to EA forum. Daniel's post about commitment races motivates why they may be a severe problem....

May 30, 202337

Spooky action at a distance in the loss landscape

> Produced under the mentorship of Evan Hubinger as part of the SERI ML Alignment Theory Scholars Program Winter 2022 Cohort. Not all global minima of the (training) loss landscape are created equal. Even if they achieve equal performance on the training set, different solutions can perform very differently on...

Jan 28, 202362

New tool for exploring EA Forum, LessWrong and Alignment Forum - Tree of Tags

Explore the hidden treasures of the forums With this tool you can zoom in on your favorite topics from EA Forum, LessWrong, and Alignment Forum. Or you can just wander around and see what you find. You start by seeing the whole forum split into two main topics. Choose the...

Sep 13, 202231

Filip Sondej

Filip Sondej

Testing which LLM architectures can do hidden serial reasoning

Unlearning Needs to be More Selective [Progress Report]

Mind is uncountable

Shaping the exploration of the motivation-space matters for AI safety

Filip Sondej

Testing which LLM architectures can do hidden serial reasoning

Unlearning Needs to be More Selective [Progress Report]

Mind is uncountable

Shaping the exploration of the motivation-space matters for AI safety

Unlearning Needs to be More Selective [Progress Report]

How LLM Beliefs Change During Chain-of-Thought Reasoning

Testing which LLM architectures can do hidden serial reasoning

Boomerang - protocol to dissolve some commitment races

Spooky action at a distance in the loss landscape

New tool for exploring EA Forum, LessWrong and Alignment Forum - Tree of Tags