Temporal Difference Learning
is a combination of DP and MC methods
TD Prediction
Constant-πΌ MC
TD(0) or one-step TD (simplest):
Input: the policy π to be evaluated
Algorithm parameter: step size πΌβ(0, 1]
Initialize V(s), for all sβS+, arbitrarily except that V(terminal) = 0
Loop for each episode:
Initialize S
Loop until S is terminal:
A action given by π for S
Take action A, observe R, S'
V(S) = V(S) + πΌ [R + πΎ V(S') - V(S)]
S = S'
TD error:
Advantages of TD Prediction Methods
- does not need model of reward and next-state distributions like DP
- does not need to wait until end of episode for an update like MC
Optimality of TD(0)
Batch Updating - given a batch experience, compute increments for every time step, but value function is changed ONLY once, by sum of all increments. then process another batch.
Sarsa: On-Policy TD Control
- learn action-value function ππ(π ,π)
SARSA (on-policy TD control) for estimating Q ~ q*
Algorithm parameters: step size πΌβ(0, 1], small π > 0
Initialize Q(s, a), for all sβS+, aβA(s), arbitrarily except that Q(terminal,Β·) = 0
Loop for each episode:
Initialize non-terminal S
Choose A from S using policy derived from Q (e.g., π-greedy)
Loop until S is terminal:
Take action A, observe R, S'
Choose A' from S' using policy derived from Q (e.g., π-greedy)
Q(S, A) = Q(S, A) + πΌ [R + πΎ Q(S', A') - Q(S, A)]
S = S'
A = A'
Q-Learning: Off-Policy TD Control
Q-Learning:
Q-Learning (off-policy TD control) for estimating π*
Algorithm parameters: step size πΌβ(0, 1], small π>0
Initialize Q(s, a), for all sβS+, aβA(s), arbitrarily except that Q(terminal,Β·) = 0
loop for each episode:
initialize S
loop for until S is terminal:
choose A from S using policy derived from Q (e.g. π-greedy)
take action A, observe R,S'
Q(S,A) = Q(S,A) + πΌ [R + πΎ max_a Q(S',a) - Q(S,A)]
S = S'
Expected SARSA
like Q-learning but instead of max we take expected:
Maximization Bias & Double Learning
Double SARSA
TODO
Double Q-Learning
- if a flipped coin comes up heads, the update is:
- else if tails:
Double Q-Learning for estimating π*:
Algorithm parameters: step size πΌβ(0, 1], small π>0
Initialize Q1(s, a) and Q2(s, a), for all sβS+, aβA(s), such that Q(terminal,Β·) = 0
loop for each episode:
initialize S
loop until S is terminal:
choose A from S using the policy π-greedy in Q1 + Q2
take action A, observe R,S'
with probability 0.5:
Q_1(S, A) = Q_1(S, A) + πΌ [R + πΎ Q_2(S', argmax_a Q_1(S',a)) - Q_1(S, A)]
else:
Q_2(S, A) = Q_2(S, A) + πΌ [R + πΎ Q_1(S', argmax_a Q_2(S',a)) - Q_2(S, A)]
Double Expected SARSA
TODO
Games, Afterstates, and Other Special Cases
NOT READ