Initial State
π[π ,π£] - table of all possible states π paired with a probability π£ of winning from that state
- states with 3 Xs in a row are assigned π£=1.0
- states with 3 Os in a row are assigned π£=0.0
- all other states are assigned π£=0.5
Training
the policy is selected from 2 of the following:
- greedy move - select move that leads to the state with greatest value π£
- random exploratory move - select move among other possible states
Learning
each greedy move we backup the value with update rule called temporal-difference learning method
- π(π ) β π(π ) + πΌ [π(π β) - π(π )]
where:
- πΌ step-size parameter - which influences the rate of learning.
- π(π ) - estimated value of state π
- π - state before greedy move
- π β - state after greedy move
If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing