parametric approximation of πΜ(π ,π,π°) β π(π ,π)
Episodic Semi-Gradient Control
general gradient-descent update for action-value prediction is:
Episodic Semi-Gradient One-Step SARSA
One-Step SARSA update:
Episodic Semi-Gradient SARSA for Estimating πΜ β π*:
Input: a differentiable action-value function parameterization πΜ : SxAxβ^d -> β
Algorithm parameters: step size πΌ>0, small π>0
Initialize value-function weights wββ^d arbitrarily (e.g., w=0)
Loop for each episode:
S, A = initial state and action of episode (e.g., π-greedy)
Loop for each step of episode:
Take action A, observe R, S'
If S' is terminal:
w = w + πΌ [R - πΜ(S,A,w)] βπΜ(S,A,w)
go to next episode
Choose A' as a function of πΜ(S',,w) e.g. ??-greedy
w = w + πΌ [R + ??πΜ(S',A',w) - ??Μ(S,A,w)] βπΜ(S,A,w)
S = S'
A = A'
Semi-Gradient n-step SARSA
n-step return:
the n-step update equation is:
Episodic semi-gradient n-step SARSA for estimating πΜ β π* or ππ:
Input: a differentiable action-value function parameterization πΜ : SxAxβ^d -> β
Input: a policy π (if estimating q_π)
Algorithm parameters: step size πΌ>0, small π>0, positive integer π
Initialize value-function weights wββ^d arbitrarily (e.g., w=0)
All store and access operations (St, At, and Rt) can take their index mod n + 1
Loop for each episode:
Initialize and store S0 != terminal
Select and store an action A0 ~ π(Β·|S0) or "π-greedy wrt πΜ(S0,Β·,w)
T = infinity
Loop for t = 0, 1, 2, ...:
If t < T , then:
Take action At
Observe and store the next reward as Rt+1 and the next state as St+1
If St+1 is terminal, then:
T = t + 1
else:
Select and store At+1 ~ π(Β·|St+1) or "π-greedy wrt πΜ(S0,Β·,w)
π = t - n + 1 (π is the time whose estimate is being updated)
If π >= 0:
G = \sum_{i=π+1}^{min(π+n,T)} πΎ^{i-π-1} R_i
If π + n < T
G = G + πΎ^n πΜ(S_{π+n},A_{π+n},w)
w = w + πΌ [G - πΜ(S_??,A_π,w)] \nabla πΜ(S_π,A_π,w)
until π = T - 1
Average Reward: A New Problem Setting for Continuing Tasks
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/reinforcement-learning-(rl)/rl-chapters/rl-chapter-10-(on-policy-control-with-approximation)/1.png)
Differential semi-gradient SARSA for estimating πΜ β π*:
One-Step SARSA update:
Input: a differentiable action-value function parameterization πΜ : SxAxβ^d -> β
Algorithm parameters: step sizes πΌ,π½ > 0
Initialize value-function weights wββ^d arbitrarily (e.g., w = 0)
Initialize average reward estimate \overline{R}ββ arbitrarily (e.g. \overline{R} = 0)
Initialize state S, and action A
Loop for each step:
Take action A, observe R, S0
Choose A' as a function of πΜ(S',,w) (e.g., π-greedy)
πΏ = R - \overline{R} + πΜ(S',A',w) - πΜ(S,A,w)
\overline{R} = \overline{R} + π½πΏ
w = w + πΌπΏ βπΜ(S,A,w)
S = S'
A = A'
Deprecating the Discounted Setting
TO READ
Differential Semi-Gradient n-step SARSA
generalize n-step return to its differential form:
where:
the n-step TD error is then:
Differential semi-gradient n-step SARSA for estimating πΜ β π*or ππ:
Input:
- a differentiable action-value function parameterization πΜ : SxAxβ^d -> β
- a policy π
Initialize
- value-function weights wββ^d arbitrarily (e.g., w = 0)
- average reward estimate \overline{R}ββ arbitrarily (e.g. \overline{R} = 0)
Algorithm parameters: step size πΌ,π½ > 0, a positive integer π
All store and access operations (St, At, and Rt) can take their index mod n+1
Initialize state S0 and action A0
Loop for each step, t = 0, 1, 2, ...:
Take action At
observe and store R_{t+1} S_{t+1}
Select and store A_{t+1} ~ π(|S_{t+1}) or π-greedy wrt πΜ(S_{t+1},,w)
π = π‘ - π + 1
if π >= 0:
πΏ = \sum_{i=π+1}^{π+n} (R_i - \overline{R}) + πΜ(S_{π+n},A_{π+n},w) - πΜ(S_π,A_π,w)
\overline{R} = \overline{R} + π½πΏ
w = w + πΌπΏ βπΜ(S_π,A_π,w)