Summary We’d like to share our ongoing work on improving LLM unlearning. [arXiv] [github] There’s a myriad of approaches for unlearning, so over the past 8 months we conducted hundreds of small-scale experiments, comparing many loss functions, variants of meta-learning, various neuron or weight ablations, representation engineering and many exotic...
Summary We tried to figure out how a model's beliefs change during a chain-of-thought (CoT) when solving a logical problem. Measuring this could reveal which parts of the CoT actually causally influence the final answer and which are just fake reasoning manufactured to sound plausible. (Note that prevention of such...
Summary * Recurrence enables hidden serial reasoning. * Not every recurrence though - connections between channels are needed. Notably Mamba architecture isn't capable of hidden reasoning. * Non-linearity isn’t needed for hidden reasoning. * It’s hard for transformers to learn to use all the layers for serial computation. For example...
Work done during SERI MATS 3.0 with mentorship from Jesse Cliffton. Huge thanks for all the feedback and discussions to Anthony DiGiovanni, Daniel Kokotajlo, Martín Soto, Rubi J. Hudson and Jan Betley! Also posted to EA forum. Daniel's post about commitment races motivates why they may be a severe problem....
> Produced under the mentorship of Evan Hubinger as part of the SERI ML Alignment Theory Scholars Program Winter 2022 Cohort. Not all global minima of the (training) loss landscape are created equal. Even if they achieve equal performance on the training set, different solutions can perform very differently on...
Explore the hidden treasures of the forums With this tool you can zoom in on your favorite topics from EA Forum, LessWrong, and Alignment Forum. Or you can just wander around and see what you find. You start by seeing the whole forum split into two main topics. Choose the...