AI ALIGNMENT FORUM
AF

Wikitags

Adversarial Training

This page is a stub.
Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Adversarial Training
50Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck
11mo
1
20Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck
3y
0
57Takeaways from our robust injury classifier project [Redwood Research]
dmz
3y
7
65High-stakes alignment via adversarial training [Redwood Research report]
dmz, LawrenceC, Nate Thomas
3y
15
57Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper
2y
6
39Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort
1y
1
14Adversarial Robustness Could Help Prevent Catastrophic Misuse
aog
2y
15
15Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper
1y
0
10AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
3y
0
6Some thoughts on why adversarial training might be useful
Beth Barnes
4y
4
24Latent Adversarial Training
Adam Jermyn
3y
4
23Beyond the Board: Exploring AI Robustness Through Go
AdamGleave
1y
1
15EIS IX: Interpretability and Adversaries
scasper
3y
5
14Oversight Leagues: The Training Game as a Feature
Paul Bricman
3y
0
6EIS XI: Moving Forward
scasper
3y
2
Load More (15/18)
Add Posts