This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
RLHF
•
Applied to
Is behavioral safety "solved" in non-adversarial conditions?
by
Ruben Bloom
8d
ago
•
Applied to
The Compleat Cybornaut
by
ukc10014
15d
ago
•
Applied to
Proposal: Using Monte Carlo tree search instead of RLHF for alignment research
by
Christopher King
1mo
ago
•
Applied to
An alternative of PPO towards alignment
by
ml hkust
2mo
ago
•
Applied to
Natural language alignment
by
Jacy Reese Anthis
2mo
ago
•
Applied to
Exploratory Analysis of RLHF Transformers with TransformerLens
by
Curt Tigges
2mo
ago
•
Applied to
GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2
by
Christopher King
2mo
ago
•
Applied to
Imitation Learning from Language Feedback
by
Jérémy Scheurer
2mo
ago
•
Applied to
A crazy hypothesis: GPT-4 already is agentic and is trying to take over the world!
by
Christopher King
2mo
ago
•
Applied to
RLHF does not appear to differentially cause mode-collapse
by
Raymond Arnold
2mo
ago
•
Applied to
Human preferences as RL critic values - implications for alignment
by
Seth Herd
3mo
ago
•
Applied to
Reflections On The Feasibility Of Scalable-Oversight
by
Felix Hofstätter
3mo
ago
•
Applied to
The Waluigi Effect (mega-post)
by
Cleo Nardo
3mo
ago
•
Applied to
A library for safety research in conditioning on RLHF tasks
by
Raymond Arnold
3mo
ago
•
Applied to
[Preprint] Pretraining Language Models with Human Preferences
by
Giulio Starace
3mo
ago
•
Applied to
Validator models: A simple approach to detecting goodharting
by
Beren Millidge
3mo
ago