parametric approximation of π‘žΜ‚(𝑠,π‘Ž,𝐰) β‰ˆ π‘ž(𝑠,π‘Ž)

Episodic Semi-Gradient Control

general gradient-descent update for action-value prediction is:

Episodic Semi-Gradient One-Step SARSA

One-Step SARSA update:

Episodic Semi-Gradient SARSA for Estimating π‘žΜ‚ β‰ˆ π‘ž*:

Input: a differentiable action-value function parameterization π‘žΜ‚ : SxAxℝ^d  -> ℝ
Algorithm parameters: step size 𝛼>0, small πœ€>0
Initialize value-function weights wβˆŠβ„^d arbitrarily (e.g., w=0)

Loop for each episode:
	S, A = initial state and action of episode (e.g., πœ€-greedy)
	Loop for each step of episode:
		Take action A, observe R, S'
		If S' is terminal:
			w = w + 𝛼 [R - π‘žΜ‚(S,A,w)] βˆ‡π‘žΜ‚(S,A,w)
			go to next episode
		Choose A' as a function of π‘žΜ‚(S',,w) e.g. ??-greedy
		w = w + 𝛼 [R + ??π‘žΜ‚(S',A',w) - ??Μ‚(S,A,w)] βˆ‡π‘žΜ‚(S,A,w)
		S = S'
		A = A'

Semi-Gradient n-step SARSA

n-step return:

the n-step update equation is:

Episodic semi-gradient n-step SARSA for estimating π‘žΜ‚ β‰ˆ π‘ž* or π‘žπœ‹:

Input: a differentiable action-value function parameterization π‘žΜ‚ : SxAxℝ^d  -> ℝ
Input: a policy πœ‹ (if estimating q_πœ‹)
Algorithm parameters: step size 𝛼>0, small πœ€>0, positive integer 𝑛
Initialize value-function weights wβˆŠβ„^d arbitrarily (e.g., w=0)
All store and access operations (St, At, and Rt) can take their index mod n + 1

Loop for each episode:
	Initialize and store S0 != terminal
	Select and store an action A0 ~ πœ‹(Β·|S0) or "πœ€-greedy wrt π‘žΜ‚(S0,Β·,w)
	T = infinity
	Loop for t = 0, 1, 2, ...:
		If t < T , then:
			Take action At
			Observe and store the next reward as Rt+1 and the next state as St+1
			If St+1 is terminal, then:
				T = t + 1
			else:
				Select and store At+1 ~ πœ‹(Β·|St+1) or "πœ€-greedy wrt π‘žΜ‚(S0,Β·,w)
		𝜏 = t - n + 1 (𝜏 is the time whose estimate is being updated)
		If 𝜏 >= 0:
			G = \sum_{i=𝜏+1}^{min(𝜏+n,T)} 𝛾^{i-𝜏-1} R_i		
		If 𝜏 + n < T
			G = G + 𝛾^n π‘žΜ‚(S_{𝜏+n},A_{𝜏+n},w)
		w = w + 𝛼 [G - π‘žΜ‚(S_??,A_𝜏,w)] \nabla π‘žΜ‚(S_𝜏,A_𝜏,w)
	until 𝜏 = T - 1

Average Reward: A New Problem Setting for Continuing Tasks

Differential semi-gradient SARSA for estimating π‘žΜ‚ β‰ˆ π‘ž*:

One-Step SARSA update:

Input: a differentiable action-value function parameterization π‘žΜ‚ : SxAxℝ^d  -> ℝ
Algorithm parameters: step sizes 𝛼,𝛽 > 0
Initialize value-function weights wβˆŠβ„^d arbitrarily (e.g., w = 0)
Initialize average reward estimate \overline{R}βˆŠβ„ arbitrarily (e.g. \overline{R} = 0)

Initialize state S, and action A
Loop for each step:
	Take action A, observe R, S0
	Choose A' as a function of π‘žΜ‚(S',,w) (e.g., πœ€-greedy)
	𝛿 = R - \overline{R} + π‘žΜ‚(S',A',w) - π‘žΜ‚(S,A,w)
	\overline{R} = \overline{R} + 𝛽𝛿
	w = w + 𝛼𝛿 βˆ‡π‘žΜ‚(S,A,w)
	S = S'
	A = A'

Deprecating the Discounted Setting

TO READ

Differential Semi-Gradient n-step SARSA

generalize n-step return to its differential form:

where:

the n-step TD error is then:

Differential semi-gradient n-step SARSA for estimating π‘žΜ‚ β‰ˆ π‘ž*or π‘žπœ‹:

Input:
- a differentiable action-value function parameterization π‘žΜ‚ : SxAxℝ^d  -> ℝ
- a policy πœ‹
Initialize
- value-function weights wβˆŠβ„^d arbitrarily (e.g., w = 0)
- average reward estimate \overline{R}βˆŠβ„ arbitrarily (e.g. \overline{R} = 0)
Algorithm parameters: step size 𝛼,𝛽 > 0, a positive integer 𝑛
All store and access operations (St, At, and Rt) can take their index mod n+1

Initialize state S0 and action A0
Loop for each step, t = 0, 1, 2, ...:
	Take action At
	observe and store R_{t+1} S_{t+1}
	Select and store A_{t+1} ~ πœ‹(|S_{t+1}) or πœ€-greedy wrt π‘žΜ‚(S_{t+1},,w)
	𝜏 = 𝑑 - 𝑛 + 1
	if 𝜏 >= 0:
		𝛿 = \sum_{i=𝜏+1}^{𝜏+n} (R_i - \overline{R}) + π‘žΜ‚(S_{𝜏+n},A_{𝜏+n},w) - π‘žΜ‚(S_𝜏,A_𝜏,w)
	\overline{R} = \overline{R} + 𝛽𝛿
	w = w + 𝛼𝛿 βˆ‡π‘žΜ‚(S_𝜏,A_𝜏,w)