AI ALIGNMENT FORUMTags
AF

RLHF

EditHistorySubscribe

Help improve this page (2 flags)

EditHistorySubscribe

Help improve this page (2 flags)

Contributors

1Tassilo Neubauer

You are viewing revision 1.2.0, last edited by Tassilo Neubauer

RLHF stands for Reinforcement Learning from Human Feedback

Posts tagged RLHF

3

104Thoughts on the impact of RLHF research

Paul Christiano

1y

39

0

17[Link] Why I’m excited about AI-assisted human feedback

2y

0

2

63The Waluigi Effect (mega-post)

1y

25

2

13Interpreting the Learning of Deceit

Roger Dearnaley

4mo

2

2

45Trying to disambiguate different questions about whether RLHF is “good”

1y

40

0

89Mysteries of mode collapse

1y

25

0

40[Link] Why I’m optimistic about OpenAI’s alignment approach

1y

12

1

26Update to Mysteries of mode collapse: text-davinci-002 not RLHF

1y

7

1

35Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

1y

5

1

24MetaAI: less is less for alignment.

10mo

3

1

39Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg, Tomek Korbak

6mo

0

1

30Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

1y

0

1

23Take 13: RLHF bad, conditioning good.

Charlie Steiner

1y

0

2

24Mode collapse in RL may be fueled by the update equation

Alex Turner, MichaelEinhorn

10mo

7

1

19Run evals on base models too!

16d

5