Eligibility traces unify and generalize:
- MC methods
- TD methods
When TD methods are augmented with eligibility traces, they produce a family of methods spanning a spectrum that has:
- MC methods at one end (π=1)
- one-step TD methods at the other (π=0)
- in between are intermediate methods that are often better than either extreme method
Eligibility Traces
For each time step π‘, eligibility traces are updated as:
Accumulate Trace Form (most common)
Or, more compactly:
Where:
- Ξ³ = discount factor
- Ξ» = trace decay parameter (0 β short memory, 1 β long memory)
- 1(s=St)\mathbf{1}(s = S_t)1(s=Stβ) = 1 if you visited the state, 0 otherwise
Thus:
- If you just visited the state, its trace spikes to 1
- On every following step, it shrinks by factor πΎπ
Ξ»-return
π-return is defined as:
π-return (separated post-termination terms):
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/reinforcement-learning-(rl)/rl-chapters/rl-chapter-12-(eligibility-traces)/1.png)
offline π-return algorithm:
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/reinforcement-learning-(rl)/rl-chapters/rl-chapter-12-(eligibility-traces)/2.png)
TD(Ξ»)
- TD(Ξ») is literally averaging all n-step TD methods with geometric weights Ξ»
- it approximates the offline Ξ»-return algorithm
the eligibility trace vector is initiated to zero at the beginning of the episode, is incremented on each time step by the value gradient, and then fades away by πΎπ:
The TD error for state-value prediction is:
weight vector is updated proportional to the scalar TD error and the vector eligibility trace:
Semi-Gradient TD(π) for estimating π£Μ ~ π£π:
Input: the policy π to be evaluated
Input: a diβ΅erentiable function π£Μ : Sxβ^d -> β such that π£Μ(terminal,Β·) = 0
Algorithm parameters: step size πΌ > 0, trace decay rate π=[0, 1]
Initialize value-function weights π° arbitrarily (e.g., π° = 0)
loop for each episode:
initialize π
π³ = 0
loop until π is terminal:
choose π΄ ~ π(|π)
take action π΄, observe π
,π'
π³ = πΎππ³ + ????Μ(π,π°)
πΏ = π
+ πΎπ£Μ(π',π°) - π£Μ(π,π°)
π° = π° + πΌπΏπ³
π = π'
for values of π:
- π=0 then algorithm simplifies to TD(0)
- π=? then
- π=1 then algorithm BEHAVES like MC
SARSA(Ξ»)
If you use them for action-values:
Update: