x

AI ALIGNMENT FORUM

AF

Zhijing Jin — AI Alignment Forum

Zhijing Jin

Top postsTop post

Zhijing Jin

Message

167

Ω

1

9

1

3y

Zhijing Jin

167

Ω

1

3y

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

TL;DR This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems....

Jun 11, 2025•6

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

by David Guzman Piedrahita, Yongjin Yang, and Zhijing Jin

Summary: * Traditional LLMs outperform reasoning models in cooperative Public Goods tasks. Models like Llama-3.3-70B maintain ~90% contribution rates in public goods games, while reasoning-focused models (o1, o3 series) average only ~40%. * We observe an "increased tendency to escape regulations" in reasoning models. As models improve in analytical capabilities,...

Apr 22, 2025•24