According to the question the weights 𝐰 are initialized as:

According to the question, we update weights 𝐰 using semi-gradient TD(0):

According to the question, 𝛼=0.1 𝛾=0.95 𝑅=0, thus:

According to the question, the diagram contains 6 states/nodes, thus 𝑣̂(𝑆,𝐰) is defined as:

According to the question, each transition/arrow is denoted as 1 training example in the batch. Thus they are a total of 6 training examples.

  1. 𝑠1 β†’ 𝑠6
  2. 𝑠2 β†’ 𝑠6
  3. 𝑠3 β†’ 𝑠6
  4. 𝑠4 β†’ 𝑠6
  5. 𝑠5 β†’ 𝑠6
  6. 𝑠6 β†’ 𝑠6

So given training example #1 (𝑠1 β†’ 𝑠6) the update becomes:

Since this is a batch update, 𝐰 will NOT be updated until all training examples have been processed:

Do this for the rest of the training samples, add them up and update the weight vector 𝐰 accordingly.