Initial State

𝑇[𝑠,𝑣] - table of all possible states 𝑠 paired with a probability 𝑣 of winning from that state

states with 3 Xs in a row are assigned 𝑣=1.0
states with 3 Os in a row are assigned 𝑣=0.0
all other states are assigned 𝑣=0.5

Training

the policy is selected from 2 of the following:

greedy move - select move that leads to the state with greatest value 𝑣
random exploratory move - select move among other possible states

Learning

each greedy move we backup the value with update rule called temporal-difference learning method

𝑉(𝑠) ← 𝑉(𝑠) + 𝛼 [𝑉(𝑠’) - 𝑉(𝑠)]

where:

𝛼 step-size parameter - which influences the rate of learning.
𝑉(𝑠) - estimated value of state 𝑠
𝑠 - state before greedy move
𝑠’ - state after greedy move

If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing

／var／log marcus chiu

Explorer

RL - Example (Tic-Tac-Toe)

Initial State

Training

Learning

／var／logmarcus chiu

Explorer

RL - Example (Tic-Tac-Toe)

Initial State

Training

Learning

／var／log marcus chiu