AI ALIGNMENT FORUM
AF

Wikitags

Reward Functions

Edited by Dakara last updated 30th Dec 2024

Reward Function is a mathematical function in reinforcement learning that defines what actions or outcomes are desirable for an AI system by assigning numerical values (rewards) to different states or state-action pairs. It essentially encodes the goals and preferences we want the AI to optimize for, though specifying appropriate reward functions that avoid unintended consequences is a significant challenge in AI development.

Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Reward Functions
94Reward is not the optimization target
Alex Turner
3y
88
28Draft papers for REALab and Decoupled Approval on tampering
Jonathan Uesato, Ramana Kumar
5y
2
43Reward hacking behavior can generalize across tasks
Kei, Isaac Dunn, Henry Sleight, Miles Turpin, Evan Hubinger, Carson Denison, Ethan Perez
1y
1
53Scaling Laws for Reward Model Overoptimization
leogao, John Schulman, Jacob Hilton
3y
5
40Seriously, what goes wrong with "reward the agent when it makes you smile"?
Q
Alex Turner, johnswentworth
3y
Q
13
39Interpreting Preference Models w/ Sparse Autoencoders
Logan Riggs Smith, Jannik Brinkmann
1y
10
20Four usages of "loss" in AI
Alex Turner
3y
16
29A quick list of reward hacking interventions
Alex Mallen
1mo
2
15Language Agents Reduce the Risk of Existential Catastrophe
Cameron Domenico Kirk-Giannini, Simon Goldstein
2y
1
11$100/$50 rewards for good references
Stuart Armstrong
4y
2
55Utility ≠ Reward
Vladimir Mikulik
6y
16
13Shutdown-Seeking AI
Simon Goldstein
2y
5
27A Short Dialogue on the Meaning of Reward Functions
Leon Lang, Quintin Pope, peligrietzer
3y
0
10Thoughts on reward engineering
Paul Christiano
6y
19
15The Theoretical Reward Learning Research Agenda: Introduction and Motivation
Joar Skalse
5mo
4
Load More (15/26)
Add Posts