x

AI ALIGNMENT FORUM

AF

dannyhalawi — AI Alignment Forum

dannyhalawi

dannyhalawi

Message

149

3y

dannyhalawi

149

3y

Covert Malicious Finetuning

by Tony Wang and dannyhalawi

This post discusses our recent paper Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation and comments on its implications for AI safety. What is Covert Malicious Finetuning? Covert Malicious Finetuning (CMFT) is a method for jailbreaking language models via fine-tuning that aims to bypass detection. The following diagram gives an...

Jul 2, 2024•103

Approaching Human-Level Forecasting with Language Models

by Fred Zhang, dannyhalawi, and jsteinhardt

TL;DR: We present a retrieval-augmented LM system that nears the human crowd performance on judgemental forecasting. Paper: https://arxiv.org/abs/2402.18563 (Danny Halawi*, Fred Zhang*, Chen Yueh-Han*, and Jacob Steinhardt) Twitter thread: https://twitter.com/JacobSteinhardt/status/1763243868353622089 Abstract Forecasting future events is important for policy and decision-making. In this work, we study whether language models (LMs) can...

Feb 29, 2024•60