n-step TD methods
  • unifies MC and TD(0)
  • introduces the idea of eligibility traces (chapter 12)

n-step Prediction

complete return:

estimates of complete return:

  • one-step return:
  • two-step return:
  • n-step return:

N-Step TD

n-step TD for estimating 𝑣𝜋:

Input: a policy 𝜋
Algorithm parameters: step size 𝛼∊(0, 1], a positive integer n
Initialize V(s) arbitrarily, for all s∊S
All store and access operations (for St and Rt) can take their index mod n + 1

loop for each episode:
	initialize and store S0 != terminal
	T = infinity
	loop for t = 0,1,2,...
		if t < T
			take action according to 𝜋(|St)
			observe and store R_{t+1} and S_{t+1}
			if S_{t+1} is terminal:
				T = t + 1
		𝜏 = t - n + 1 (𝜏 is the time whose state's estimate is being updated)
		if 𝜏 >= 0:
			G = \sum_{i=𝜏+1}^{min(𝜏+n,T)} 𝛾^{i-𝜏-1} R_i
			if 𝜏+n < T
				G = G + 𝛾^n V(S_{𝜏+n})
			V(S𝜏) = V(S𝜏) + 𝛼 [G - V(S𝜏)]
	until 𝜏 = T - 1

N-Step SARSA

redefine n-step returns in terms of action values:

natural algorithm is then:

n-step SARSA for estimating 𝑞* or 𝑞𝜋:

Initialize Q(s, a) arbitrarily, for all s∊S, a∊A
Initialize 𝜋 to be "𝜀-greedy with respect to Q, or to a fixed given policy
Algorithm parameters: step size 𝛼∊(0, 1], small 𝜀>0, a positive integer n
All store and access operations (for St, At, and Rt) can take their index mod n+1

Loop for each episode:
	innitialize and store S0
	select and store an action A0 ~ 𝜋(|S0)
	T = infinity
	loop for t = 0, 1, 2, ...
		if t < T
			take action At
			observe and store R_{t+1} and S_{t+1}
			if S_{t+1} is terminal:
				T = t + 1
			else
				select and store action A_{t+1} ~ 𝜋(|S_{t+1})
		𝜏 = t - n + 1   (𝜏 is the time whose estimate is being updated)
		if 𝜏 >= 0:
 			G = \sum_{i=𝜏+1}^{min(𝜏+n,T)} 𝛾^{i-𝜏-1} R_i
			if 𝜏+n < T
				G = G + 𝛾^n Q(S_{𝜏+n}, A_{𝜏+n})
			Q(S𝜏,A𝜏) = Q(S𝜏,A𝜏) + 𝛼 [G - Q(S𝜏,A𝜏)]
			if 𝜋 is being learned, then ensure 𝜋(|S𝜏) is 𝜀-greedy wrt Q
	until 𝜏 = T - 1

Expected SARSA

redefine n-step returns in terms of action values:

where:

Backup Diagrams (n-step TD, n-step SARSA, n-step Expected SARSA)

N-Step Off-Policy Learning

N-Step TD Off-Policy - simply weighted by 𝜌𝑡:𝑡+𝑛-1:

N-Step SARSA Off-Policy - simply weighted by :

where, the IS ratio is the relative probability under the 2 policies of taking the n actions from 𝐴𝑡 to 𝐴𝑡+𝑛-1:

Off-Policy n-step SARSA for estimating 𝑞* or 𝑞𝜋:

Input: an arbitrary behavior policy b such that b(a|s) > 0, for all s∊S, a∊A
Initialize Q(s, a) arbitrarily, for all s∊S, a∊A
Initialize 𝜋 to be greedy with respect to Q, or as a fixed given policy
Algorithm parameters: step size 𝛼∊(0, 1], a positive integer n
All store and access operations (for St, At, and Rt) can take their index mod n + 1

loop for each episode:
	initialize and store S0
	select and store action A0 ~ b(|S0)
	T = infinity
	loop for t = 0, 1, 2, ...
		if t < T:
			take action At
			observe and store R_{t+1} and S_{t+1}
			if S_{t+1} is terminal
				T = t + 1
			else
				select and store action A_{t+1} ~ b(|S_{t+1})
		𝜏 = t - n + 1
		if 𝜏 >= 0
			𝜌 = \prod_{i=𝜏+1}^{min(𝜏+n-1,T-1)} [𝜋(Ai|Si) / b(Ai|Si)]
 			G = \sum_{i=𝜏+1}^{min(𝜏+n,T)} 𝛾^{i-𝜏-1} R_i
			if 𝜏+n < T
				G = G + 𝛾^n Q(S_{𝜏+n}, A_{𝜏+n})
			Q(S𝜏,A𝜏) = Q(S𝜏,A𝜏) + 𝛼 𝜌 [G - Q(S𝜏,A𝜏)]
			if 𝜋 is being learned, then ensure 𝜋(|S𝜏) is 𝜀-greedy wrt Q
	until 𝜏 = T - 1

Off-Policy Learning w/o IS: The n-step Tree Backup Algorithm

NOT READ

n-step Tree Backup for estimating  𝑞* or 𝑞𝜋:

Initialize Q(s, a) arbitrarily, for all s∊S, a∊A
Initialize 𝜋 to be greedy with respect to Q, or as a fixed given policy
Algorithm parameters: step size 𝛼∊(0, 1], a positive integer n
All store and access operations (for St, At, and Rt) can take their index mod n + 1

loop for each episode:
	initialize and store S0
	select and store action A0 arbitrarily as a function of S0
	T = infinity
	loop for t = 0, 1, 2, ...
		if t < T:
			take action At
			observe and store R_{t+1} and S_{t+1}
			if S_{t+1} is terminal
				T = t + 1
			else
				select and store action A_{t+1} as a function of S_{t+1}
		𝜏 = t - n + 1
		if 𝜏 >= 0
			if 𝜏+1 >= T
				G = R_T
			else
				G = R_{t+1} + 𝛾 sum_a 𝜋(a|S_{t+1}) Q(S_{t+1},a)
			loop for k = min(t,T-1) down to 𝜏+1:
				G = R_k + 𝛾 \sum_{a!=A_k} 𝜋(a|S_k) Q(S_k,a) + 𝛾 𝜋(A_k|S_k)*G
			Q(S𝜏,A𝜏) = Q(S𝜏,A𝜏) + 𝛼 [G - Q(S𝜏,A𝜏)]
			if 𝜋 is being learned, then ensure 𝜋(|S𝜏) is 𝜀-greedy wrt Q
	until 𝜏 = T - 1

A Unifying Algorithm: N-Step 𝑄(𝜎)

NOT READ