AI ALIGNMENT FORUMTags
AF

RLHF

•

Applied to DIY RLHF: A simple implementation for hands on experience by Mike Vaiana 17d ago

•

Applied to A "Bitter Lesson" Approach to Aligning AGI and ASI by Roger Dearnaley 21d ago

•

Applied to AXRP Episode 33 - RLHF Problems with Scott Emmons by DanielFilan 1mo ago

•

Applied to Run evals on base models too! by orthonormal 4mo ago

•

Applied to Why do we need RLHF? Imitation, Inverse RL, and the role of reward by Ran W 6mo ago

•

Applied to The case for more ambitious language model evals by Arun Jose 6mo ago

•

Applied to The True Story of How GPT-2 Became Maximally Lewd by Writer 6mo ago

•

Applied to Interpreting the Learning of Deceit by Roger Dearnaley 7mo ago

•

Applied to Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks. by Haem 9mo ago

•

Applied to Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation by Soroush Pour 9mo ago

•

Applied to Paul Christiano on Dwarkesh Podcast by ESRogs 9mo ago

•

Applied to Wireheading and misalignment by composition on NetHack by pierlucadoro 9mo ago

•

Applied to Compositional preference models for aligning LMs by Tomek Korbak 9mo ago

•

Applied to Towards Understanding Sycophancy in Language Models by Ethan Perez 9mo ago