Khanh Nguyen — AI Alignment Forum

Nice work! I agree with many of the opinions. Nevertheless, the survey is still missing several key citations.

Personally, I have made several contributions to RLHF:

Reinforcement learning for bandit neural machine translation with simulated human feedback (Nguyen et al., 2017) is the first paper that shows the potential of using noisy rating feedback to train text generators.
Interactive learning from activity description (Nguyen et al., 2021) is one of the first frameworks for learning from descriptive language feedback with theoretical guarantees
Language models are bounded pragmatic speakers: Understanding RLHF from a Bayesian cognitive modeling perspective (Nguyen 2023) proposes a theoretical framework that makes it easy to understand the limitations of RLHF and derive improve directions.

Recently, there are also several interesting papers on fine-tuning LLMs that was not cited:

Ruibo Liu has authored as series of paper on this topic. Especially, Training Socially Aligned Language Models in Simulated Human Society allows generation and incorporation of rich language feedback.
Hao Liu also has also proposed promising ideas (e.g., Chain of Hindsight Aligns Language Models with Feedback)

I hope that you will find this post useful. Happy to chat more :)

AI ALIGNMENT FORUM
AF