AI ALIGNMENT FORUM
AF

Wikitags

RLHF

Edited by Multicore, et al. last updated 2nd Oct 2024

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged RLHF
110Thoughts on the impact of RLHF research
paulfchristiano
3y
39
17[Link] Why I’m excited about AI-assisted human feedback
janleike
3y
0
58The Waluigi Effect (mega-post)
Cleo Nardo
3y
26
89Mysteries of mode collapse
janus
3y
25
13Interpreting the Learning of Deceit
RogerDearnaley
2y
2
46Trying to disambiguate different questions about whether RLHF is “good”
Buck
3y
40
40[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
3y
12
33On the functional self of LLMs
eggsyntax
2mo
0
24MetaAI: less is less for alignment.
Cleo Nardo
2y
3
26Update to Mysteries of mode collapse: text-davinci-002 not RLHF
janus
3y
7
35Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
3y
5
39Towards Understanding Sycophancy in Language Models
Ethan Perez, mrinank_sharma, Meg, Tomek Korbak
2y
0
30Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)
LawrenceC
3y
0
18A "Bitter Lesson" Approach to Aligning AGI and ASI
RogerDearnaley
1y
0
24Take 13: RLHF bad, conditioning good.
Charlie Steiner
3y
0
Load More (15/41)
Add Posts