This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
RLHF
Edit
History
Subscribe
Discussion
(0)
Help improve this page (2 flags)
Edit
History
Subscribe
Discussion
(0)
Help improve this page (2 flags)
RLHF
Random Tag
Contributors
1
Tassilo Neubauer
You are viewing revision 1.2.0, last edited by
Tassilo Neubauer
RLHF stands for
R
einforcement
L
earning from
H
uman
F
eedback
Posts tagged
RLHF
Most Relevant
3
104
Thoughts on the impact of RLHF research
Paul Christiano
1y
39
0
17
[Link] Why I’m excited about AI-assisted human feedback
Jan Leike
2y
0
2
63
The Waluigi Effect (mega-post)
Cleo Nardo
1y
25
2
13
Interpreting the Learning of Deceit
Roger Dearnaley
4mo
2
2
45
Trying to disambiguate different questions about whether RLHF is “good”
Buck Shlegeris
1y
40
0
89
Mysteries of mode collapse
janus
1y
25
0
40
[Link] Why I’m optimistic about OpenAI’s alignment approach
Jan Leike
1y
12
1
26
Update to Mysteries of mode collapse: text-davinci-002 not RLHF
janus
1y
7
1
35
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
Lawrence Chan
1y
5
1
24
MetaAI: less is less for alignment.
Cleo Nardo
10mo
3
1
39
Towards Understanding Sycophancy in Language Models
Ethan Perez
,
mrinank_sharma
,
Meg
,
Tomek Korbak
6mo
0
1
30
Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)
Lawrence Chan
1y
0
1
23
Take 13: RLHF bad, conditioning good.
Charlie Steiner
1y
0
2
24
Mode collapse in RL may be fueled by the update equation
Alex Turner
,
MichaelEinhorn
10mo
7
1
19
Run evals on base models too!
orthonormal
16d
5