This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
RLHF
•
Applied to
Beginner's question about RLHF
by
Ruben Bloom
1mo
ago
•
Applied to
AI #23: Fundamental Problems with RLHF
by
Tobias D.
2mo
ago
•
Applied to
Open Problems and Fundamental Limitations of RLHF
by
Stephen Casper
2mo
ago
•
Applied to
Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI
by
Benaya Koren
2mo
ago
•
Applied to
Challenge proposal: smallest possible self-hardening backdoor for RLHF
by
Christopher King
3mo
ago
•
Applied to
MetaAI: less is less for alignment.
by
Cleo Nardo
3mo
ago
•
Applied to
Is behavioral safety "solved" in non-adversarial conditions?
by
Ruben Bloom
4mo
ago
•
Applied to
The Compleat Cybornaut
by
ukc10014
4mo
ago
•
Applied to
Mode collapse in RL may be fueled by the update equation
by
Alex Turner
5mo
ago
•
Applied to
Proposal: Using Monte Carlo tree search instead of RLHF for alignment research
by
Christopher King
5mo
ago
•
Applied to
An alternative of PPO towards alignment
by
ml hkust
5mo
ago
•
Applied to
Natural language alignment
by
Jacy Reese Anthis
5mo
ago
•
Applied to
Exploratory Analysis of RLHF Transformers with TransformerLens
by
Curt Tigges
6mo
ago
•
Applied to
GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2
by
Christopher King
6mo
ago
•
Applied to
Imitation Learning from Language Feedback
by
Jérémy Scheurer
6mo
ago
•
Applied to
A crazy hypothesis: GPT-4 already is agentic and is trying to take over the world!
by
Christopher King
6mo
ago
•
Applied to
RLHF does not appear to differentially cause mode-collapse
by
Raymond Arnold
6mo
ago