Simon Lermen

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. I'm grateful to Palisade Research for their support throughout this project. tl;dr: demonstrating that we can cheaply undo safety finetuning from open-source models to remove refusals - thus making...

Oct 12, 2023•117

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models[1] maintain helpful capabilities without refusing...

Oct 12, 2023•151

Robustness of Model-Graded Evaluations and Automated Interpretability

TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a...

Jul 15, 2023•47