Eligibility traces unify and generalize:

  • MC methods
  • TD methods

When TD methods are augmented with eligibility traces, they produce a family of methods spanning a spectrum that has:

  • MC methods at one end (πœ†=1)
  • one-step TD methods at the other (πœ†=0)
  • in between are intermediate methods that are often better than either extreme method

Eligibility Traces

For each time step 𝑑, eligibility traces are updated as:

Accumulate Trace Form (most common)

Or, more compactly:

Where:

  • Ξ³ = discount factor
  • Ξ» = trace decay parameter (0 β†’ short memory, 1 β†’ long memory)
  • 1(s=St)\mathbf{1}(s = S_t)1(s=St​) = 1 if you visited the state, 0 otherwise

Thus:

  • If you just visited the state, its trace spikes to 1
  • On every following step, it shrinks by factor π›Ύπœ†

Ξ»-return

πœ†-return is defined as:

πœ†-return (separated post-termination terms):

offline πœ†-return algorithm:

TD(Ξ»)

  • TD(Ξ») is literally averaging all n-step TD methods with geometric weights Ξ»
  • it approximates the offline Ξ»-return algorithm

the eligibility trace vector is initiated to zero at the beginning of the episode, is incremented on each time step by the value gradient, and then fades away by π›Ύπœ†:

The TD error for state-value prediction is:

weight vector is updated proportional to the scalar TD error and the vector eligibility trace:

Semi-Gradient TD(πœ†) for estimating 𝑣̂ ~ π‘£πœ‹:
Input: the policy πœ‹ to be evaluated
Input: a di↡erentiable function 𝑣̂ : Sxℝ^d -> ℝ such that 𝑣̂(terminal,Β·) = 0
Algorithm parameters: step size 𝛼 > 0, trace decay rate  πœ†=[0, 1]
Initialize value-function weights 𝐰 arbitrarily (e.g., 𝐰 = 0)

loop for each episode:
	initialize 𝑆
	𝐳 = 0
	loop until 𝑆 is terminal:
		choose 𝐴 ~ πœ‹(|𝑆)
		take action 𝐴, observe 𝑅,𝑆'
		𝐳 = π›Ύπœ†π³ + ????Μ‚(𝑆,𝐰)
		𝛿 = 𝑅 + 𝛾𝑣̂(𝑆',𝐰) - 𝑣̂(𝑆,𝐰)
	 	𝐰 = 𝐰 + 𝛼𝛿𝐳
		𝑆 = 𝑆'

for values of πœ†:

  • πœ†=0 then algorithm simplifies to TD(0)
  • πœ†=? then
  • πœ†=1 then algorithm BEHAVES like MC

SARSA(Ξ»)

If you use them for action-values:

Update: