AI ALIGNMENT FORUMTags
AF

RLHF

EditHistorySubscribe
Discussion (0)
Help improve this page (2 flags)
EditHistorySubscribe
Discussion (0)
Help improve this page (2 flags)
RLHF
Random Tag
Contributors
1Tassilo Neubauer
1Multicore

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF
Most Relevant
3
103Thoughts on the impact of RLHF research
Paul Christiano
2mo
39
0
17[Link] Why I’m excited about AI-assisted human feedback
Jan Leike
1y
0
2
77The Waluigi Effect (mega-post)
Cleo Nardo
19d
18
2
46Trying to disambiguate different questions about whether RLHF is “good”
Buck Shlegeris
3mo
34
0
85Mysteries of mode collapse
janus
4mo
20
0
40[Link] Why I’m optimistic about OpenAI’s alignment approach
Jan Leike
4mo
10
1
26Update to Mysteries of mode collapse: text-davinci-002 not RLHF
janus
4mo
6
1
36Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
Lawrence Chan
3mo
5
1
30Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)
Lawrence Chan
1mo
0
1
23Take 13: RLHF bad, conditioning good.
Charlie Steiner
3mo
0
1
23Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy, Megan Kinniment
4mo
5
1
16Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.
Charlie Steiner
3mo
1
1
23Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
3mo
5
1
4Model-driven feedback could amplify alignment failures
Aidan O'Gara
2mo
0
1
53Pretraining Language Models with Human Preferences
Tomek Korbak, Sam Bowman, Ethan Perez
1mo
4
Load More (15/16)
Add Posts