AI ALIGNMENT FORUM
AF

Wikitags

Reinforcement learning

Edited by pedrochaves, TurnTrout, Roman Leventov, et al. last updated 30th Dec 2024

Reinforcement Learning is the study of how to train agents to complete tasks by updating ("reinforcing") the agents with feedback signals.

Related: Inverse Reinforcement Learning, Machine learning, Friendly AI, Game Theory, Prediction, Agency.

Consider an agent that receives an input informing the agent of the environment's state. Based only on that information, the agent has to make a decision regarding which action to take, from a set, which will influence the state of the environment. This action will in itself change the state of the environment, which will result in a new input, and so on, each time also presenting the agent with the reward (or reinforcement signal) relative to its actions in the environment. In "policy gradient" approaches, the reinforcement signal is often used to update the agent (the "policy"), although sometimes an agent will do limited online (model-based) heuristic search to instead optimize the reward signal + heuristic evaluation. 

RL is distinguished from energy-based architectures such as Active Inference, Joint Embedded Predictive Architectures (JEPA), and GFlowNets.

Exploration and Optimization

Knowing that randomly selecting the actions will result in poor performances, one of the biggest problems in reinforcement learning is exploring the avaliable set of responses to avoid getting stuck in sub-optimal choices and proceed to better ones.

This is the problem of exploration, which is best described in the most studied reinforcement learning problem - the k-armed bandit. In it, an agent has to decide which sequence of levers to pull in a gambling room, not having any information about the probabilities of winning in each machine besides the reward it receives each time. The problem revolves about deciding which is the optimal lever and what criteria defines the lever as such.

Parallel with an exploration implementation, it is still necessary to chose the criteria which makes a certain action optimal when compared to another. This study of this property has led to several methods, from brute forcing to taking into account temporal differences in the received reward. Despite this and the great results obtained by reinforcement methods in solving small problems, it suffers from a lack of scalability, having difficulties solving larger, close-to-human scenarios.

Further Reading & References

  • Sutton, Richard S.; Barto, Andrew G. (1998). Reinforcement Learning: An Introduction. MIT Press. ISBN 0-262-19398-1.
  • Kaelbling, L. P. , Littman, M. L. , Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, Vol 4, (1996), 237-285

See Also

  • Machine learning
  • Friendly AI
  • Game theory
  • Prediction
Subscribe
Subscribe
Discussion0
Discussion0
Posts tagged Reinforcement learning
50Think carefully before calling RL policies "agents"
TurnTrout
2y
10
94Reward is not the optimization target
TurnTrout
3y
88
28Draft papers for REALab and Decoupled Approval on tampering
Jonathan Uesato, Ramana Kumar
5y
2
7The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
RogerDearnaley
3mo
0
79Models Don't "Get Reward"
Sam Ringer
3y
5
64EfficientZero: How It Works
1a3orn
4y
2
13AGI will have learnt utility functions
beren
3y
1
14Remaking EfficientZero (as best I can)
Hoagy
3y
0
14Why almost every RL agent does learned optimization
Lee Sharkey
3y
3
13Interpreting the Learning of Deceit
RogerDearnaley
2y
2
32Jitters No Evidence of Stupidity in RL
1a3orn
4y
0
76DeepMind: Generally capable agents emerge from open-ended play
Daniel Kokotajlo
4y
11
45Shard Theory: An Overview
David Udell
3y
2
42LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem
Steven Byrnes
2y
23
43Reward Is Not Enough
Steven Byrnes
4y
12
Load More (15/91)
Add Posts