AI ALIGNMENT FORUMTags
AF

RLHF

EditHistorySubscribe
Discussion (0)
Help improve this page (2 flags)
EditHistorySubscribe
Discussion (0)
Help improve this page (2 flags)
RLHF
Random Tag
Contributors
1Tassilo Neubauer
1Multicore

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF
3
104Thoughts on the impact of RLHF research
Paul Christiano
8mo
39
0
17[Link] Why I’m excited about AI-assisted human feedback
Jan Leike
1y
0
2
61The Waluigi Effect (mega-post)
Cleo Nardo
7mo
24
2
46Trying to disambiguate different questions about whether RLHF is “good”
Buck Shlegeris
10mo
34
0
89Mysteries of mode collapse
janus
1y
22
0
40[Link] Why I’m optimistic about OpenAI’s alignment approach
Jan Leike
10mo
12
1
26Update to Mysteries of mode collapse: text-davinci-002 not RLHF
janus
10mo
6
1
35Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
Lawrence Chan
10mo
5
1
24MetaAI: less is less for alignment.
Cleo Nardo
4mo
3
1
30Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)
Lawrence Chan
8mo
0
1
23Take 13: RLHF bad, conditioning good.
Charlie Steiner
9mo
0
2
24Mode collapse in RL may be fueled by the update equation
Alex Turner, MichaelEinhorn
3mo
7
1
24Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy, Megan Kinniment
10mo
5
1
16Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.
Charlie Steiner
10mo
1
1
23Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
10mo
5
Load More (15/23)
Add Posts