Reinforcement Learning (RL)
- is the science of decision making
- is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize some notion of cumulative reward
- almost all RL problems can be formulated as a Markov Decision Process (MDP)
RL - Learning Paradigms
- there is no supervisor, only a scalar reward signal
- feedback may not be instantaneous (i.e. delayed)
- time-series related (sequential, not i.i.d. data)
- agent’s action affects the subsequent data it receives
RL - Components
rewards
- a reward 𝑅𝑡 is a scalar feedback signal
- indicates how well the agent is doing at timestep 𝑡
- the agent’s job is to maximize cumulative reward
RL is based on the reward hypothesis - all goals can be described by the maximization of expected cumulative reward
sequential decision making
- the goal is to select actions to maximize total future reward
- actions may have long-term consequences
- the reward may not be instantaneous (i.e. delayed)
- it may be better to sacrifice immediate reward at the cost of long-term reward
at each time-step 𝑡 the agent:
- receives reward 𝑅𝑡
- receives observation 𝑂𝑡
- does an action 𝐴𝑡
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/reinforcement-learning-(rl)/reinforcement-learning-agent-and-environment.png)
history
- is the sequence of rewards, observations, and actions from time-step 1 to 𝑡
- 𝐻1𝑡 = [𝑅1, 𝑂1, 𝐴1, 𝑅2, 𝑂2, 𝐴2, …, 𝑅𝑡, 𝑂𝑡, 𝐴𝑡]
what happens next depends on the history
- agent selects actions
- the environment selects observations & rewards
state
- is the summary of history used to determine what happens next
- is a function of history:
- 𝑆𝑡 = 𝑓(𝐻1𝑡)
2 state types:
- environment state 𝑆𝑡𝑒
- is the environment’s internal state representation
- is whatever data the environment uses to pick the next observation & reward
- has the Markov property
- agent state 𝑆𝑡𝑎
- is the agent’s internal state representation
- is whatever data the agent uses to pick the next action
- is the information used by RL algorithms
- it can be any function of history
- 𝑆𝑡𝑎 = 𝑓(𝐻1𝑡)
state with Markov property
- a state 𝑆𝑖 has Markov property iff: 𝐏(𝑆𝑡+1|𝑆1, …, 𝑆𝑡) = 𝐏(𝑆𝑡+1|𝑆𝑡)
- the entire history from time 1 to 𝑡 (i.e. 𝐻1𝑡) has the Markov property
information/markov state:
- has the Markov property
- contains all useful information from the history 𝐻1𝑡
- once the information state is known, the history is no longer needed
- is a sufficient statistic that can be used in determining the future
|
Environment Types |
Description | ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
Fully Observable Environment State |
| ||||||||
|
Partially Observable Environment State |
|
RL Agent Components
|
component |
description |
|---|---|
|
policy |
a function that tells what action the agent should take in a given state
|
|
value function |
a function that tells how good each state and/or action is
|
|
model |
the agent’s representation of the environment
|
RL Agent - Types
containing value function and/or policy function:
- value-based - an agent that stores the value function (policy is implicit, just readout the value function)
- policy-based - an agent that stores the policy (no value function)
- actor-critic - stores both the policy and reward
containing a model of the environment:
- model-free - policy and/or value function
- model-based - policy and/or value function
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/reinforcement-learning-(rl)/reinforcement-learning-agent-types.png)
RL - Dichotomies
|
Dichotomy |
Description |
|---|---|
|
Reinforcement Learning |
reinforcement learning
planning
|
|
Exploration |
|
|
Prediction |
In RL you solve prediction-problem in order to solve the control-problem |
RL - Other
- AlphaGo Fan/Lee/Master/Zero
- Contrastive Reinforcement Learning (CRL)
- Deep Q Networks (DQN)
- Multi/K-Armed Bandit Problem
- Policy Gradient Methods
- Proximal Policy Optimization (PPO)
- Q-Function
- Q-Learning
- Reinforcement Learning from Human Feedback (RLHF)
- RL - Applications
- RL - Example (Tic-Tac-Toe)
- RL - Human Priors for Playing Video Games
- RL Chapters
- Selective Bootstrap Adaptation
RL - Resources
- Reinforcement Learning: An Introduction (2017) ~ Richard S. Sutton and Andrew G. Barto
- Reinforcement Learning: An Introduction (2018) 2nd Edition ~ Richard S. Sutton and Andrew G. Barto
- Hado Van Hasselt - YouTube Lectures
- David Silver - YouTube Lectures
- Deep Reinforcement Learning: Pong from Pixels