RLHF

Edited by Multicore, et al. last updated 2nd Oct 2024

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF

110Thoughts on the impact of RLHF research

paulfchristiano

3y

39

17[Link] Why I’m excited about AI-assisted human feedback

janleike

4y

0

58The Waluigi Effect (mega-post)

Cleo Nardo

3y

26

89Mysteries of mode collapse

janus

3y

25

13Interpreting the Learning of Deceit

RogerDearnaley

2y

2

46Trying to disambiguate different questions about whether RLHF is “good”

Buck

3y

40

39On the functional self of LLMs

eggsyntax

3mo

0

40[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

3y

12

24MetaAI: less is less for alignment.

Cleo Nardo

2y

3

26Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus

3y

7

35Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

3y

5

39Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg, Tomek Korbak

2y

0

30Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

LawrenceC

3y

0

18A "Bitter Lesson" Approach to Aligning AGI and ASI

RogerDearnaley

1y

0

24Take 13: RLHF bad, conditioning good.

Charlie Steiner

3y

0

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

RLHF