Value-Function Approximation

individual update notation:

𝑠 ↦ 𝑢

where:

𝑠 - state updated
𝑢 - update target

for example:

MC	$S_{t} \mapsto G_{t}$
TD(0)	$S_{t} \mapsto R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t})$
n-step TD	$S_{t} \mapsto G_{t : t + n}$
DP policy evaluation	$S_{t} \mapsto E_{π} [R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t}) ∣ S_{t} = s]$

Prediction Objective (Mean Squared Value Error (VE))

denoted as:

$\overline{V E} (w) = \sum_{s \in S} μ (s) [v_{π} (s) - \overset{v}{^} (s, w)]^{2}$

where:

$μ (s) >= 0, \sum_{s} μ (s) = 1$
$error in state s = v_{π} (s) - \overset{v}{^} (s, w)$
$\overset{v}{^} (s, w) approximate value$
$v_{π} (s) true value$

Stochastic Gradient Descent (SGD)

Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:

Indent

Gradient of 𝑓 wrt 𝑤:

Indent

𝑣_𝜋(𝑆_𝑡) is usually noise corrupted, so 𝑈_𝑡:

$w_{t + 1} = w_{t} + α [U_{t} - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t})$

Gradient MC Algorithm for estimating 𝑣ˆ ~ 𝑣_𝜋:

Input: the policy 𝜋 to be evaluated
Input: a differentiable function vˆ : S x ℝ^d -> ℝ
Algorithm parameter: step size 𝛼 > 0
Initialize value-function weights w∊ℝ^d arbitrarily (e.g., w = 0)

Loop forever (for each episode):
	Generate an episode S0, A0, R1, S1, A1, ..., RT, ST using 𝜋
	Loop for each step of episode, t = 0, 1, ..., T1:
		w = w + 𝛼 [G_t - vˆ(S_t,w)] 𝛻vˆ(S_t,w)

Semi-Gradient TD(0) for estimating 𝑣ˆ ~ 𝑣_𝜋:

use the following as target:

$U_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)$

Input: the policy 𝜋 to be evaluated
Input: a differentiable function vˆ : S x ℝ^d -> ℝ such that vˆ(terminal,·) = 0
Algorithm parameter: step size 𝛼 > 0
Initialize value-function weights w∊ℝ^d arbitrarily (e.g., w = 0)

Loop for each episode:
	Initialize S
	Loop until S is terminal:
		Choose A ~ 𝜋(·|S)
		Take action A, observe R, S'
		w = w + 𝛼 [R + 𝛾vˆ(S',w) - vˆ(S,w)] 𝛻vˆ(S,w)
		S = S'

Linear Methods

given an x(s) real-valued vector

in the linear case SGD update simplifies to the form:

$w_{t + 1} = w_{t} + α [U_{t} - \overset{v}{^} (S_{t}, w_{t})] x (S_{t})$

Feature Construction for Linear Methods

polynomials
fourier basis
coarse coding - it sits conceptually between:
- tabular representation (each state has its own feature)
- global approximations like polynomial or Fourier bases
tile coding - a form of course coding
radial basis functions

coarse coding vs tile coding

coarse coding - arbitrary overlapping shapes, varying sizes
tile coding - grids of tiles, arranged in “tilings,” each covering space evenly

Selecting Step-Size Parameters Manually

／var／log marcus chiu

Explorer

RL Chapter 9 (On-Policy Prediction with Approximation)

Value-Function Approximation

Prediction Objective (Mean Squared Value Error (VE))

Stochastic Gradient Descent (SGD)

Gradient MC Algorithm for estimating 𝑣ˆ ~ 𝑣_𝜋:

Semi-Gradient TD(0) for estimating 𝑣ˆ ~ 𝑣_𝜋:

Linear Methods

Feature Construction for Linear Methods

Selecting Step-Size Parameters Manually

／var／logmarcus chiu

Explorer

RL Chapter 9 (On-Policy Prediction with Approximation)

Value-Function Approximation

Prediction Objective (Mean Squared Value Error (VE))

Stochastic Gradient Descent (SGD)

Gradient MC Algorithm for estimating 𝑣ˆ ~ 𝑣𝜋:

Semi-Gradient TD(0) for estimating 𝑣ˆ ~ 𝑣𝜋:

Linear Methods

Feature Construction for Linear Methods

Selecting Step-Size Parameters Manually

／var／log marcus chiu

Gradient MC Algorithm for estimating 𝑣ˆ ~ 𝑣_𝜋:

Semi-Gradient TD(0) for estimating 𝑣ˆ ~ 𝑣_𝜋: