According to the question the weights π° are initialized as:
According to the question, we update weights π° using semi-gradient TD(0):
According to the question, πΌ=0.1 πΎ=0.95 π =0, thus:
According to the question, the diagram contains 6 states/nodes, thus π£Μ(π,π°) is defined as:
According to the question, each transition/arrow is denoted as 1 training example in the batch. Thus they are a total of 6 training examples.
- π 1 β π 6
- π 2 β π 6
- π 3 β π 6
- π 4 β π 6
- π 5 β π 6
- π 6 β π 6
So given training example #1 (π 1 β π 6) the update becomes:
Since this is a batch update, π° will NOT be updated until all training examples have been processed:
Do this for the rest of the training samples, add them up and update the weight vector π° accordingly.