Value Function

$v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$
$v_{π} (s) = E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s]$
$v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ E_{π} [G_{t + 1} ∣ S_{t + 1} = s^{'}]]$
$v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]$

Action-Value Function

$q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]$

Optimal Value Function

$v_{*} (s) = ma x_{a} q_{π_{*}} (s, a)$
$v_{*} (s) = ma x_{a} E_{π_{*}} [G_{t} ∣ S_{t} = s, A_{t} = a]$
$v_{*} (s) = ma x_{a} E_{π_{*}} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s, A_{t} = a]$
$v_{*} (s) = ma x_{a} E [R_{t + 1} + γ v_{*} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a]$
$v_{*} (s) = ma x_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{*} (s^{'})]$

Optimal Action-Value Function

$q_{*} (s, a) = E [R_{t + 1} + γ ma x_{a^{'}} q_{*} (S_{t + 1}, a^{'}) ∣ S_{t} = s, A_{t} = a]$
$q_{*} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ ma x_{a^{'}} q_{*} (s^{'}, a^{'})]$

Policy Evalution

TODO

Finding New Greedy Policy

$π^{'} (s) = a r g ma x_{a} q_{π} (s, a)$
$π^{'} (s) = a r g ma x_{a} E [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a]$
$π^{'} (s) = a r g ma x_{a} \sum_{s^{'}, r} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]$

Policy Iteration

$π_{0} - E - > v_{π_{0}} - I - > π_{1} - E - > v_{π_{1}} - I - > π_{2} - E - > ... - I - > π_{*} - E - > v_{*}$

Value Iteration

policy evaluation is stopped after just one sweep (one update of each state).
$v_{k + 1} (s) = ma x_{a} E [R_{t + 1} + γ v_{k} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a]$
$v_{k + 1} (s) = ma x_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{k} (s^{'})]$

Asynchronous Dynamic Programming

are in-place iterative DP algorithms that do not sweep through entire state set. examples include:

update the value of ONLY one state at each value iteration update

Generalized Policy Iteration (GPI)

policy iteration:

policy evaluation (PE)
policy improvement (PI)

GPI refer to the general of letting PE and PI interact

GPI is the family that consist of value iteration and asynchronous dynamic programming

／var／log marcus chiu

Explorer

RL Chapter 4 (Dynamic Programming)

Value Function

Action-Value Function

Optimal Value Function

Optimal Action-Value Function

Policy Evalution

Finding New Greedy Policy

Policy Iteration

Value Iteration

Asynchronous Dynamic Programming

Generalized Policy Iteration (GPI)

／var／logmarcus chiu

Explorer

RL Chapter 4 (Dynamic Programming)

Value Function

Action-Value Function

Optimal Value Function

Optimal Action-Value Function

Policy Evalution

Finding New Greedy Policy

Policy Iteration

Value Iteration

Asynchronous Dynamic Programming

Generalized Policy Iteration (GPI)

／var／log marcus chiu