Why do we need RLHF? Imitation, Inverse RL, and the role of reward
This is a cross-post from my Substack to get some feedback from the alignment community. I have condensed some sections to get the message across. It's rarely questioned in the mainstream why we need RLHF in the first place? If we can do supervised fine-tuning (SFT) sufficiently well, then can...
Feb 3, 202416