This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
RLHF
•
Applied to
DIY RLHF: A simple implementation for hands on experience
by
Mike Vaiana
17d
ago
•
Applied to
A "Bitter Lesson" Approach to Aligning AGI and ASI
by
Roger Dearnaley
21d
ago
•
Applied to
AXRP Episode 33 - RLHF Problems with Scott Emmons
by
DanielFilan
1mo
ago
•
Applied to
Run evals on base models too!
by
orthonormal
4mo
ago
•
Applied to
Why do we need RLHF? Imitation, Inverse RL, and the role of reward
by
Ran W
6mo
ago
•
Applied to
The case for more ambitious language model evals
by
Arun Jose
6mo
ago
•
Applied to
The True Story of How GPT-2 Became Maximally Lewd
by
Writer
6mo
ago
•
Applied to
Interpreting the Learning of Deceit
by
Roger Dearnaley
7mo
ago
•
Applied to
Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.
by
Haem
9mo
ago
•
Applied to
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
by
Soroush Pour
9mo
ago
•
Applied to
Paul Christiano on Dwarkesh Podcast
by
ESRogs
9mo
ago
•
Applied to
Wireheading and misalignment by composition on NetHack
by
pierlucadoro
9mo
ago
•
Applied to
Compositional preference models for aligning LMs
by
Tomek Korbak
9mo
ago
•
Applied to
Towards Understanding Sycophancy in Language Models
by
Ethan Perez
9mo
ago