AI ALIGNMENT FORUM
Wikitags
AF

Subscribe
Discussion0
1

Adversarial Training

Subscribe
Discussion0
1
This page is a stub.
Posts tagged Adversarial Training
3
50Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck Shlegeris
7mo
1
2
20Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck Shlegeris
3y
0
1
57Takeaways from our robust injury classifier project [Redwood Research]
dmz
3y
7
1
65High-stakes alignment via adversarial training [Redwood Research report]
dmz, Lawrence Chan, Nate Thomas
3y
15
1
57Deep Forgetting & Unlearning for Safely-Scoped LLMs
Stephen Casper
1y
6
0
39Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort
9mo
1
1
14Adversarial Robustness Could Help Prevent Catastrophic Misuse
ao
1y
15
1
15Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
Stephen Casper
10mo
0
2
10AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
3y
0
1
6Some thoughts on why adversarial training might be useful
Beth Barnes
3y
4
1
24Latent Adversarial Training
Adam Jermyn
3y
4
1
23Beyond the Board: Exploring AI Robustness Through Go
AdamGleave
11mo
1
1
15EIS IX: Interpretability and Adversaries
Stephen Casper
2y
5
1
14Oversight Leagues: The Training Game as a Feature
Paul Bricman
3y
0
1
6EIS XI: Moving Forward
Stephen Casper
2y
2
Load More (15/18)
Add Posts