This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Adversarial Examples
•
Applied to
An examination of GPT-2's boring yet effective glitch
by
Miguel de Guzman
12d
ago
•
Applied to
Adversarial Robustness Could Help Prevent Catastrophic Misuse
by
Aidan O'Gara
4mo
ago
•
Applied to
Deep Forgetting & Unlearning for Safely-Scoped LLMs
by
Stephen Casper
5mo
ago
•
Applied to
Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.
by
Haem
5mo
ago
•
Applied to
Features and Adversaries in MemoryDT
by
Tassilo Neubauer
6mo
ago
•
Applied to
RAIN: Your Language Models Can Align Themselves without Finetuning - Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!
by
Noosphere89
7mo
ago
•
Applied to
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
by
Scott Emmons
7mo
ago
•
Applied to
An adversarial example for Direct Logit Attribution: memory management in gelu-4l
by
Can
8mo
ago
•
Applied to
There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs
by
Noosphere89
9mo
ago
•
Applied to
Even Superhuman Go AIs Have Surprising Failure Modes
by
AdamGleave
9mo
ago
•
Applied to
A Search for More ChatGPT / GPT-3.5 / GPT-4 "Unspeakable" Glitch Tokens
by
Martin Fell
1y
ago
•
Applied to
SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4
by
Adam Yedidia
1y
ago
•
Applied to
AI Safety in a World of Vulnerable Machine Learning Systems
by
AdamGleave
1y
ago
•
Applied to
EIS XI: Moving Forward
by
Stephen Casper
1y
ago
•
Applied to
EIS XII: Summary
by
Stephen Casper
1y
ago
•
Applied to
EIS X: Continual Learning, Modularity, Compression, and Biological Brains
by
Stephen Casper
1y
ago
•
Applied to
EIS IX: Interpretability and Adversaries
by
Stephen Casper
1y
ago
•
Applied to
Human beats SOTA Go AI by learning an adversarial policy
by
Vanessa Kosoy
1y
ago