Reinforcement Learning (RL)
  • is the science of decision making
  • is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize some notion of cumulative reward
  • almost all RL problems can be formulated as a Markov Decision Process (MDP)

RL - Learning Paradigms

  • there is no supervisor, only a scalar reward signal
  • feedback may not be instantaneous (i.e. delayed)
  • time-series related (sequential, not i.i.d. data)
  • agent’s action affects the subsequent data it receives

RL - Components

rewards

  • a reward 𝑅𝑡 is a scalar feedback signal
  • indicates how well the agent is doing at timestep 𝑡
  • the agent’s job is to maximize cumulative reward

RL is based on the reward hypothesis - all goals can be described by the maximization of expected cumulative reward

sequential decision making

  • the goal is to select actions to maximize total future reward
  • actions may have long-term consequences
  • the reward may not be instantaneous (i.e. delayed)
  • it may be better to sacrifice immediate reward at the cost of long-term reward

at each time-step 𝑡 the agent:

  • receives reward 𝑅𝑡
  • receives observation 𝑂𝑡
  • does an action 𝐴𝑡

history

  • is the sequence of rewards, observations, and actions from time-step 1 to 𝑡
  • 𝐻1𝑡 = [𝑅1, 𝑂1, 𝐴1, 𝑅2, 𝑂2, 𝐴2, …, 𝑅𝑡, 𝑂𝑡, 𝐴𝑡]

what happens next depends on the history

  • agent selects actions
  • the environment selects observations & rewards

state

  • is the summary of history used to determine what happens next
  • is a function of history:
    • 𝑆𝑡 = 𝑓(𝐻1𝑡)

2 state types:

  • environment state 𝑆𝑡𝑒
    • is the environment’s internal state representation
    • is whatever data the environment uses to pick the next observation & reward
    • has the Markov property
  • agent state 𝑆𝑡𝑎
    • is the agent’s internal state representation
    • is whatever data the agent uses to pick the next action
    • is the information used by RL algorithms
    • it can be any function of history
      • 𝑆𝑡𝑎 = 𝑓(𝐻1𝑡)

state with Markov property

  • a state 𝑆𝑖 has Markov property iff: 𝐏(𝑆𝑡+1|𝑆1, …, 𝑆𝑡) = 𝐏(𝑆𝑡+1|𝑆𝑡)
  • the entire history from time 1 to 𝑡 (i.e. 𝐻1𝑡) has the Markov property

information/markov state:

  • has the Markov property
  • contains all useful information from the history 𝐻1𝑡
  • once the information state is known, the history is no longer needed
  • is a sufficient statistic that can be used in determining the future

Environment Types

Description

Fully Observable Environment State

  • agent directly observes environment state:
    • 𝑂𝑡 = 𝑆𝑡𝑎 = 𝑆𝑡𝑒
    • agent state = environment state
  • formally this is a Markov Decision Process (MDP)

Partially Observable Environment State

  • agent indirectly observes environment state
    • 𝑆𝑡𝑎 ≠ 𝑆𝑡𝑒
    • agent state ≠ environment state
  • formally this is a Partially Observable Markov Decision Process (POMDP)
  • agent must construct its own environment state representation 𝑆𝑡𝑒, such as:
    Representation Type Description
    complete history 𝑆𝑡𝑎 = 𝐻1𝑡
    beliefs of environment state 𝑆𝑡𝑎 = (𝐏(𝑆𝑡𝑒=𝑠1), …, 𝐏(𝑆𝑡𝑒=𝑠𝑛))
    recurrent neural network 𝑆𝑡𝑎 = 𝜎(𝑆𝑡-1𝑎𝑊𝑠 + 𝑂𝑡𝑊𝑜) take a linear combination of agent-state at previous timestep with the current observation

RL Agent Components

component

description

policy

a function that tells what action the agent should take in a given state

  • is a map from state to action
  • policy types:
    • deterministic policy: 𝑎 = 𝜋(𝑠)
    • stochastic policy: 𝜋(𝑎|𝑠) = 𝐏(𝐴=𝑎|𝑆=𝑠)

value function

a function that tells how good each state and/or action is

  • is a prediction of future reward
  • used to evaluate the goodness/badness of states, and therefore used to select between actions
  • 𝑉𝜋(𝑠) = 𝐄𝜋[ 𝛾0𝑅𝑡+0 + 𝛾1𝑅𝑡+1+ 𝛾2𝑅𝑡+2 + … | 𝑆𝑡=𝑠 ]

model

the agent’s representation of the environment

  • optional: there are model-free agents
  • a model predicts what the environment will do next
  • transitions 𝑇 predict the next state
    • 𝑇𝑠𝑠’𝑎 = 𝐏(𝑆’=𝑠’|𝑆=𝑠,𝐴=𝑎)
  • rewards 𝑅 predicts the next reward
    • 𝑅𝑠𝑎 = 𝐄[𝑅|𝑆=𝑠,𝐴=𝑎]

RL Agent - Types

containing value function and/or policy function:

  • value-based - an agent that stores the value function (policy is implicit, just readout the value function)
  • policy-based - an agent that stores the policy (no value function)
  • actor-critic - stores both the policy and reward

containing a model of the environment:

  • model-free - policy and/or value function
  • model-based - policy and/or value function

RL - Dichotomies

Dichotomy

Description

Reinforcement Learning
vs
Planning

reinforcement learning

  • the environment is initially unknown
  • the agent interacts with the environment
  • the agent improves its policy

planning

  • the model of the environment is known
  • the agent performs computations with the model (without any external interaction)
  • the agent improves its policy
  • aka: reasoning and search

Exploration
vs
Exploitation

  • exploration - explores unknown information about the environment which would give up rewards
  • exploitation - exploit known information about the environment to maximize reward

Prediction
vs
Control

  • prediction - given a policy, evaluate the future
  • control - find the best policy that optimizes the future rewards

In RL you solve prediction-problem in order to solve the control-problem

RL - Other

RL - Resources