Top postsTop post
Tony Wang
293
Ω
54
9
9
This post discusses our recent paper Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation and comments on its implications for AI safety. What is Covert Malicious Finetuning? Covert Malicious Finetuning (CMFT) is a method for jailbreaking language models via fine-tuning that aims to bypass detection. The following diagram gives an...
In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be...