x

AI ALIGNMENT FORUM

AF

Adversarial Training — AI Alignment Forum

Adversarial Training

This page is a stub.

Add Posts

1

1

Posts tagged Adversarial Training

3

52Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming

2y

1

2

20Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing

4y

0

1

57Takeaways from our robust injury classifier project [Redwood Research]

4y

7

1

65High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC, Nate Thomas

4y

15

1

57Deep Forgetting & Unlearning for Safely-Scoped LLMs

2y

6

0

42Solving adversarial attacks in computer vision as a baby version of general AI alignment

2y

2

1

14Adversarial Robustness Could Help Prevent Catastrophic Misuse

2y

15

1

15Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

2y

0

2

10AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

4y

0

1

6Some thoughts on why adversarial training might be useful

4y

4

1

25Latent Adversarial Training

4y

4

1

23Beyond the Board: Exploring AI Robustness Through Go

2y

1

1

15EIS IX: Interpretability and Adversaries

3y

5

1

14Oversight Leagues: The Training Game as a Feature

4y

0

1

6EIS XI: Moving Forward

3y

2

Load More (15/18)

Add Posts