x

AI ALIGNMENT FORUM

AF

Tony Wang — AI Alignment Forum

Tony Wang

Top postsTop post

Tony Wang

Message

298

Ω

56

9

9

8y

Tony Wang

298

Ω

56

8y

Covert Malicious Finetuning

This post discusses our recent paper Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation and comments on its implications for AI safety. What is Covert Malicious Finetuning? Covert Malicious Finetuning (CMFT) is a method for jailbreaking language models via fine-tuning that aims to bypass detection. The following diagram gives an...

Jul 2, 2024•103

Even Superhuman Go AIs Have Surprising Failure Modes

by AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, and MichaelDennis

In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be...

Jul 20, 2023•131