This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Reward Functions
Edit
History
Subscribe
Discussion
(0)
Help improve this page (2 flags)
Edit
History
Subscribe
Discussion
(0)
Help improve this page (2 flags)
Reward Functions
Random Tag
Contributors
Posts tagged
Reward Functions
Most Relevant
3
95
Reward is not the optimization target
Alex Turner
2y
88
3
28
Draft papers for REALab and Decoupled Approval on tampering
Jonathan Uesato
,
Ramana Kumar
4y
2
0
53
Scaling Laws for Reward Model Overoptimization
leogao
,
John Schulman
,
Jacob Hilton
2y
5
2
40
Seriously, what goes wrong with "reward the agent when it makes you smile"?
Q
Alex Turner
,
johnswentworth
2y
Q
13
1
39
Interpreting Preference Models w/ Sparse Autoencoders
Logan Riggs Smith
,
Jannik Brinkmann
2mo
10
2
20
Four usages of "loss" in AI
Alex Turner
2y
16
0
15
Language Agents Reduce the Risk of Existential Catastrophe
Cameron Domenico Kirk-Giannini
,
Simon Goldstein
1y
1
1
11
$100/$50 rewards for good references
Stuart Armstrong
3y
2
1
55
Utility ≠ Reward
Vladimir Mikulik
5y
16
0
41
Reward hacking behavior can generalize across tasks
Kei
,
Isaac Dunn
,
Henry Sleight
,
Miles Turpin
,
Evan Hubinger
,
Carson Denison
,
Ethan Perez
4mo
1
0
13
Shutdown-Seeking AI
Simon Goldstein
1y
5
1
27
A Short Dialogue on the Meaning of Reward Functions
Leon Lang
,
Quintin Pope
,
peligrietzer
2y
0
1
10
Thoughts on reward engineering
Paul Christiano
6y
19
1
6
The reward engineering problem
Paul Christiano
6y
1
0
0
Reward model hacking as a challenge for reward learning
Erik Jenner
2y
0