n-step TD methods

unifies MC and TD(0)
introduces the idea of eligibility traces (chapter 12)

n-step Prediction

complete return:

$G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + ... + γ^{T - t - 1} R_{T}$

estimates of complete return:

one-step return:
- $G_{t : t + 1} = R_{t + 1} + γ V_{t} (S_{t + 1})$
two-step return:
- $G_{t : t + 2} = R_{t + 1} + γ R_{t + 2} + γ^{2} V_{t + 1} (S_{t + 2})$
n-step return:
- $G_{t : t + n} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + ... + γ^{n - 1} R_{t + n} + γ^{n} V_{t + n - 1} (S_{t + n})$

N-Step TD

$V_{t + n} (S_{t}) = V_{t + n - 1} (S_{t}) + α [G_{t : t + n} - V_{t + n - 1} (S_{t})]$

n-step TD for estimating 𝑣_𝜋:

Input: a policy 𝜋
Algorithm parameters: step size 𝛼∊(0, 1], a positive integer n
Initialize V(s) arbitrarily, for all s∊S
All store and access operations (for St and Rt) can take their index mod n + 1

loop for each episode:
	initialize and store S0 != terminal
	T = infinity
	loop for t = 0,1,2,...
		if t < T
			take action according to 𝜋(|St)
			observe and store R_{t+1} and S_{t+1}
			if S_{t+1} is terminal:
				T = t + 1
		𝜏 = t - n + 1 (𝜏 is the time whose state's estimate is being updated)
		if 𝜏 >= 0:
			G = \sum_{i=𝜏+1}^{min(𝜏+n,T)} 𝛾^{i-𝜏-1} R_i
			if 𝜏+n < T
				G = G + 𝛾^n V(S_{𝜏+n})
			V(S𝜏) = V(S𝜏) + 𝛼 [G - V(S𝜏)]
	until 𝜏 = T - 1

N-Step SARSA

redefine n-step returns in terms of action values:

$G_{t : t + n} = R_{t + 1} + γ R_{t + 2} + γ^{2} V_{t + 1} (S_{t + 2}) + ... + γ^{n - 1} R_{t + n} + γ^{n} Q_{t + n - 1} (S_{t + n}, A_{t + n})$

natural algorithm is then:

$Q_{t + n} (S_{t}, A_{t}) = Q_{t + n - 1} (S_{t}, A_{t}) + α [G_{t : t + n} - Q_{t + n - 1} (S_{t}, A_{t})]$

n-step SARSA for estimating 𝑞_* or 𝑞_𝜋:

Initialize Q(s, a) arbitrarily, for all s∊S, a∊A
Initialize 𝜋 to be "𝜀-greedy with respect to Q, or to a fixed given policy
Algorithm parameters: step size 𝛼∊(0, 1], small 𝜀>0, a positive integer n
All store and access operations (for St, At, and Rt) can take their index mod n+1

Loop for each episode:
	innitialize and store S0
	select and store an action A0 ~ 𝜋(|S0)
	T = infinity
	loop for t = 0, 1, 2, ...
		if t < T
			take action At
			observe and store R_{t+1} and S_{t+1}
			if S_{t+1} is terminal:
				T = t + 1
			else
				select and store action A_{t+1} ~ 𝜋(|S_{t+1})
		𝜏 = t - n + 1   (𝜏 is the time whose estimate is being updated)
		if 𝜏 >= 0:
 			G = \sum_{i=𝜏+1}^{min(𝜏+n,T)} 𝛾^{i-𝜏-1} R_i
			if 𝜏+n < T
				G = G + 𝛾^n Q(S_{𝜏+n}, A_{𝜏+n})
			Q(S𝜏,A𝜏) = Q(S𝜏,A𝜏) + 𝛼 [G - Q(S𝜏,A𝜏)]
			if 𝜋 is being learned, then ensure 𝜋(|S𝜏) is 𝜀-greedy wrt Q
	until 𝜏 = T - 1

Expected SARSA

redefine n-step returns in terms of action values:

$G_{t : t + n} = R_{t + 1} + γ R_{t + 2} + γ^{2} V_{t + 1} (S_{t + 2}) + ... + γ^{n - 1} R_{t + n} + γ^{n} \overline{V}_{t + n - 1} (S_{t + n})$

where:

$\overline{V}_{t + n - 1} (S_{t + n}) = \sum_{a} π (a ∣ s) Q_{t} (s, a)$

Backup Diagrams (n-step TD, n-step SARSA, n-step Expected SARSA)

N-Step Off-Policy Learning

N-Step TD Off-Policy - simply weighted by 𝜌_{𝑡:𝑡+𝑛-1}:

$V_{t + n} (S_{t}) = V_{t + n - 1} (S_{t}) + α ρ_{t : t + n - 1} [G_{t : t + n} - V_{t + n - 1} (S_{t})]$

N-Step SARSA Off-Policy - simply weighted by :

$Q_{t + n} (S_{t}, A_{t}) = Q_{t + n - 1} (S_{t}, A_{t}) + α ρ_{t : t + n} [G_{t : t + n} - Q_{t + n - 1} (S_{t}, A_{t})]$

where, the IS ratio is the relative probability under the 2 policies of taking the n actions from 𝐴_𝑡 to 𝐴_𝑡+𝑛-1:

$ρ_{t : h} = \prod_{k = t}^{min (h, T - 1)} \frac{π ( A - k ∣ S _{k} )}{b ( A _{k} ∣ S _{k} )}$

Off-Policy n-step SARSA for estimating 𝑞_* or 𝑞_𝜋:

Input: an arbitrary behavior policy b such that b(a|s) > 0, for all s∊S, a∊A
Initialize Q(s, a) arbitrarily, for all s∊S, a∊A
Initialize 𝜋 to be greedy with respect to Q, or as a fixed given policy
Algorithm parameters: step size 𝛼∊(0, 1], a positive integer n
All store and access operations (for St, At, and Rt) can take their index mod n + 1

loop for each episode:
	initialize and store S0
	select and store action A0 ~ b(|S0)
	T = infinity
	loop for t = 0, 1, 2, ...
		if t < T:
			take action At
			observe and store R_{t+1} and S_{t+1}
			if S_{t+1} is terminal
				T = t + 1
			else
				select and store action A_{t+1} ~ b(|S_{t+1})
		𝜏 = t - n + 1
		if 𝜏 >= 0
			𝜌 = \prod_{i=𝜏+1}^{min(𝜏+n-1,T-1)} [𝜋(Ai|Si) / b(Ai|Si)]
 			G = \sum_{i=𝜏+1}^{min(𝜏+n,T)} 𝛾^{i-𝜏-1} R_i
			if 𝜏+n < T
				G = G + 𝛾^n Q(S_{𝜏+n}, A_{𝜏+n})
			Q(S𝜏,A𝜏) = Q(S𝜏,A𝜏) + 𝛼 𝜌 [G - Q(S𝜏,A𝜏)]
			if 𝜋 is being learned, then ensure 𝜋(|S𝜏) is 𝜀-greedy wrt Q
	until 𝜏 = T - 1

Off-Policy Learning w/o IS: The n-step Tree Backup Algorithm

NOT READ

n-step Tree Backup for estimating 𝑞_* or 𝑞_𝜋:

Initialize Q(s, a) arbitrarily, for all s∊S, a∊A
Initialize 𝜋 to be greedy with respect to Q, or as a fixed given policy
Algorithm parameters: step size 𝛼∊(0, 1], a positive integer n
All store and access operations (for St, At, and Rt) can take their index mod n + 1

loop for each episode:
	initialize and store S0
	select and store action A0 arbitrarily as a function of S0
	T = infinity
	loop for t = 0, 1, 2, ...
		if t < T:
			take action At
			observe and store R_{t+1} and S_{t+1}
			if S_{t+1} is terminal
				T = t + 1
			else
				select and store action A_{t+1} as a function of S_{t+1}
		𝜏 = t - n + 1
		if 𝜏 >= 0
			if 𝜏+1 >= T
				G = R_T
			else
				G = R_{t+1} + 𝛾 sum_a 𝜋(a|S_{t+1}) Q(S_{t+1},a)
			loop for k = min(t,T-1) down to 𝜏+1:
				G = R_k + 𝛾 \sum_{a!=A_k} 𝜋(a|S_k) Q(S_k,a) + 𝛾 𝜋(A_k|S_k)*G
			Q(S𝜏,A𝜏) = Q(S𝜏,A𝜏) + 𝛼 [G - Q(S𝜏,A𝜏)]
			if 𝜋 is being learned, then ensure 𝜋(|S𝜏) is 𝜀-greedy wrt Q
	until 𝜏 = T - 1

A Unifying Algorithm: N-Step 𝑄(𝜎)

NOT READ

／var／log marcus chiu

Explorer

RL Chapter 7 (n-step Bootstrapping)

n-step TD methods

n-step Prediction

N-Step TD

N-Step SARSA

Expected SARSA

Backup Diagrams (n-step TD, n-step SARSA, n-step Expected SARSA)

N-Step Off-Policy Learning

Off-Policy Learning w/o IS: The n-step Tree Backup Algorithm

A Unifying Algorithm: N-Step 𝑄(𝜎)

／var／logmarcus chiu

Explorer

RL Chapter 7 (n-step Bootstrapping)

n-step TD methods

n-step Prediction

N-Step TD

N-Step SARSA

Expected SARSA

Backup Diagrams (n-step TD, n-step SARSA, n-step Expected SARSA)

N-Step Off-Policy Learning

Off-Policy Learning w/o IS: The n-step Tree Backup Algorithm

A Unifying Algorithm: N-Step 𝑄(𝜎)

／var／log marcus chiu