Eligibility traces unify and generalize:

MC methods
TD methods

When TD methods are augmented with eligibility traces, they produce a family of methods spanning a spectrum that has:

MC methods at one end (𝜆=1)
one-step TD methods at the other (𝜆=0)
in between are intermediate methods that are often better than either extreme method

Eligibility Traces

For each time step 𝑡, eligibility traces are updated as:

Accumulate Trace Form (most common)

$e_{t} (s) = {1 λγ e_{t - 1} (s) if s = S_{t} otherwise$

Or, more compactly:

$e_{t} (s) = γ λ e_{t - 1} (s) + 1 (s = S_{t})$

Where:

γ = discount factor
λ = trace decay parameter (0 → short memory, 1 → long memory)
1(s=St)\mathbf{1}(s = S_t)1(s=St) = 1 if you visited the state, 0 otherwise

Thus:

If you just visited the state, its trace spikes to 1
On every following step, it shrinks by factor 𝛾𝜆

λ-return

𝜆-return is defined as:

$G_{t}^{λ} = (1 - λ) \sum_{n = 1}^{i n f} λ^{n - 1} G_{t : t + n}$

𝜆-return (separated post-termination terms):

$G_{t}^{λ} = (1 - λ) \sum_{n = 1}^{T - t - 1} λ^{n - 1} G_{t : t + n} + λ^{T - t - 1} G_{t}$

offline 𝜆-return algorithm:

$w_{t + 1} = w α [G_{t}^{λ} - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t}), for t = 0, ..., T - 1$

TD(λ)

TD(λ) is literally averaging all n-step TD methods with geometric weights λ
it approximates the offline λ-return algorithm

the eligibility trace vector is initiated to zero at the beginning of the episode, is incremented on each time step by the value gradient, and then fades away by 𝛾𝜆:

$z_{- 1} = 0$
$z_{t} = γ λ z_{t - 1} + \nabla \overset{v}{^} (S_{t}, w_{t}), 0 <= t <= T$

The TD error for state-value prediction is:

$δ_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t}) - \overset{v}{^} (S_{t}, w_{t})$

weight vector is updated proportional to the scalar TD error and the vector eligibility trace:

$w_{t + 1} = w_{t} + α δ_{t} z_{t}$

Semi-Gradient TD(𝜆) for estimating 𝑣̂ ~ 𝑣_𝜋:

Input: the policy 𝜋 to be evaluated
Input: a di↵erentiable function 𝑣̂ : Sxℝ^d -> ℝ such that 𝑣̂(terminal,·) = 0
Algorithm parameters: step size 𝛼 > 0, trace decay rate  𝜆=[0, 1]
Initialize value-function weights 𝐰 arbitrarily (e.g., 𝐰 = 0)

loop for each episode:
	initialize 𝑆
	𝐳 = 0
	loop until 𝑆 is terminal:
		choose 𝐴 ~ 𝜋(|𝑆)
		take action 𝐴, observe 𝑅,𝑆'
		𝐳 = 𝛾𝜆𝐳 + ????̂(𝑆,𝐰)
		𝛿 = 𝑅 + 𝛾𝑣̂(𝑆',𝐰) - 𝑣̂(𝑆,𝐰)
	 	𝐰 = 𝐰 + 𝛼𝛿𝐳
		𝑆 = 𝑆'

for values of 𝜆:

𝜆=0 then algorithm simplifies to TD(0)
𝜆=? then
𝜆=1 then algorithm BEHAVES like MC

SARSA(λ)

If you use them for action-values:

$e_{t} (s, a) = γ λ e_{t - 1} (s, a) + 1 (s = S_{t}, a = A_{t})$

Update:

$Q (s, a) \leftarrow Q (s, a) + α δ_{t} e_{t} (s, a)$

／var／log marcus chiu

Explorer

RL Chapter 12 (Eligibility Traces)

Eligibility Traces

Accumulate Trace Form (most common)

λ-return

TD(λ)

Semi-Gradient TD(𝜆) for estimating 𝑣̂ ~ 𝑣_𝜋:

SARSA(λ)

／var／logmarcus chiu

Explorer

RL Chapter 12 (Eligibility Traces)

Eligibility Traces

Accumulate Trace Form (most common)

λ-return

TD(λ)

Semi-Gradient TD(𝜆) for estimating 𝑣̂ ~ 𝑣𝜋:

SARSA(λ)

／var／log marcus chiu

Semi-Gradient TD(𝜆) for estimating 𝑣̂ ~ 𝑣_𝜋: