Temporal Difference Learning

is a combination of DP and MC methods

TD Prediction

Constant-𝛼 MC

TD(0) or one-step TD (simplest):

Input: the policy πœ‹ to be evaluated
Algorithm parameter: step size π›ΌβˆŠ(0, 1]
Initialize V(s), for all s∊S+, arbitrarily except that V(terminal) = 0

Loop for each episode:
	Initialize S
	Loop until S is terminal:
		A action given by πœ‹ for S
		Take action A, observe R, S'
		V(S) = V(S) + 𝛼 [R + 𝛾 V(S') - V(S)]
		S = S'

TD error:

Advantages of TD Prediction Methods

  • does not need model of reward and next-state distributions like DP
  • does not need to wait until end of episode for an update like MC

Optimality of TD(0)

Batch Updating - given a batch experience, compute increments for every time step, but value function is changed ONLY once, by sum of all increments. then process another batch.

Sarsa: On-Policy TD Control

  1. learn action-value function π‘žπœ‹(𝑠,π‘Ž)

SARSA (on-policy TD control) for estimating Q ~ q*

Algorithm parameters: step size π›ΌβˆŠ(0, 1], small πœ€ > 0
Initialize Q(s, a), for all s∊S+, a∊A(s), arbitrarily except that Q(terminal,·) = 0

Loop for each episode:
	Initialize non-terminal S
	Choose A from S using policy derived from Q (e.g., πœ€-greedy)
	Loop until S is terminal:
		Take action A, observe R, S'
		Choose A' from S' using policy derived from Q (e.g., πœ€-greedy)
		Q(S, A) = Q(S, A) + 𝛼 [R + 𝛾 Q(S', A') - Q(S, A)]
		S = S'
		A = A'

Q-Learning: Off-Policy TD Control

Q-Learning:

Q-Learning (off-policy TD control) for estimating πœ‹*

Algorithm parameters: step size π›ΌβˆŠ(0, 1], small πœ€>0
Initialize Q(s, a), for all s∊S+, a∊A(s), arbitrarily except that Q(terminal,·) = 0

loop for each episode:
	initialize S
	loop for until S is terminal:
		choose A from S using policy derived from Q (e.g. πœ€-greedy)
		take action A, observe R,S'
		Q(S,A) = Q(S,A) + 𝛼 [R + 𝛾 max_a Q(S',a) - Q(S,A)]
		S = S'

Expected SARSA

like Q-learning but instead of max we take expected:

Maximization Bias & Double Learning

Double SARSA

TODO

Double Q-Learning

  • if a flipped coin comes up heads, the update is:
  • else if tails:

Double Q-Learning for estimating π‘ž*:

Algorithm parameters: step size π›ΌβˆŠ(0, 1], small πœ€>0
Initialize Q1(s, a) and Q2(s, a), for all s∊S+, a∊A(s), such that Q(terminal,·) = 0

loop for each episode:
	initialize S
	loop until S is terminal:
		choose A from S using the policy πœ€-greedy in Q1 + Q2
		take action A, observe R,S'
		with probability 0.5:
 			Q_1(S, A) = Q_1(S, A) + 𝛼 [R + 𝛾 Q_2(S', argmax_a Q_1(S',a)) - Q_1(S, A)] 
		else:
			Q_2(S, A) = Q_2(S, A) + 𝛼 [R + 𝛾 Q_1(S', argmax_a Q_2(S',a)) - Q_2(S, A)]

Double Expected SARSA

TODO

Games, Afterstates, and Other Special Cases

NOT READ