／var／log marcus chiu

❯

❯

Artificial Intelligence (AI) - Cognitive Computing - Machine Intelligence

❯

❯

Machine Learning (ML) - Pattern Recognition

❯

❯

Reinforcement Learning (RL)

❯

Sandboxxxx

Created on Dec 11, 2025

According to the question the weights 𝐰 are initialized as:

$w = ⟨ 1, 1, 1, 1, 1, 1, 5 ⟩$

According to the question, we update weights 𝐰 using semi-gradient TD(0):

$w \leftarrow w + α [R + γ \overset{v}{^} (S^{'}, w) - \overset{v}{^} (S, w)] \nabla \overset{v}{^} (S, w)$

According to the question, 𝛼=0.1 𝛾=0.95 𝑅=0, thus:

$w \leftarrow w + 0.1 * [0.95 * \overset{v}{^} (S^{'}, w) - \overset{v}{^} (S, w)] \nabla \overset{v}{^} (S, w)$

According to the question, the diagram contains 6 states/nodes, thus 𝑣̂(𝑆,𝐰) is defined as:

$\overset{v}{^} (s_{1}, w) = w_{0} + 2 w_{1}$
$\overset{v}{^} (s_{2}, w) = w_{0} + 2 w_{2}$
$\overset{v}{^} (s_{3}, w) = w_{0} + 2 w_{3}$
$\overset{v}{^} (s_{4}, w) = w_{0} + 2 w_{4}$
$\overset{v}{^} (s_{5}, w) = w_{0} + 2 w_{5}$
$\overset{v}{^} (s_{6}, w) = 2 w_{0} + w_{6}$

According to the question, each transition/arrow is denoted as 1 training example in the batch. Thus they are a total of 6 training examples.

𝑠₁ → 𝑠₆
𝑠₂ → 𝑠₆
𝑠₃ → 𝑠₆
𝑠₄ → 𝑠₆
𝑠₅ → 𝑠₆
𝑠₆ → 𝑠₆

So given training example #1 (𝑠₁ → 𝑠₆) the update becomes:

$w \leftarrow w + 0.1 * [0.95 * (2 w_{0} + w_{6}) - (w_{0} + 2 w_{1})] \nabla (w_{0} + 2 w_{1})$

Since this is a batch update, 𝐰 will NOT be updated until all training examples have been processed:

$0.1 * [0.95 * (2 * 1 + 5) - (1 + 2 * 1)] \nabla (w_{0} + 2 w_{1})$
$0.365 * \nabla (w_{0} + 2 w_{1})$
$0.365 * [1, 2, 0, 0, 0, 0, 0]$
$[0.365, 0.730, 0, 0, 0, 0, 0]$

Do this for the rest of the training samples, add them up and update the weight vector 𝐰 accordingly.