Value-Function Approximation

individual update notation:

  • 𝑠 ↦ 𝑒

where:

  • 𝑠 - state updated
  • 𝑒 - update target

for example:

MC

TD(0)

n-step TD

DP policy evaluation

Prediction Objective (Mean Squared Value Error (VE))

denoted as:

where:

Stochastic Gradient Descent (SGD)

Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:

Indent

Gradient of 𝑓 wrt 𝑀:

Indent

π‘£πœ‹(𝑆𝑑) is usually noise corrupted, so π‘ˆπ‘‘:

Gradient MC Algorithm for estimating 𝑣ˆ ~ π‘£πœ‹:
Input: the policy πœ‹ to be evaluated
Input: a differentiable function vΛ† : S x ℝ^d -> ℝ
Algorithm parameter: step size 𝛼 > 0
Initialize value-function weights wβˆŠβ„^d arbitrarily (e.g., w = 0)

Loop forever (for each episode):
	Generate an episode S0, A0, R1, S1, A1, ..., RT, ST using πœ‹
	Loop for each step of episode, t = 0, 1, ..., T1:
		w = w + 𝛼 [G_t - vΛ†(S_t,w)] 𝛻vΛ†(S_t,w)
Semi-Gradient TD(0) for estimating 𝑣ˆ ~ π‘£πœ‹:

use the following as target:

Input: the policy πœ‹ to be evaluated
Input: a differentiable function vΛ† : S x ℝ^d -> ℝ such that vΛ†(terminal,Β·) = 0
Algorithm parameter: step size 𝛼 > 0
Initialize value-function weights wβˆŠβ„^d arbitrarily (e.g., w = 0)

Loop for each episode:
	Initialize S
	Loop until S is terminal:
		Choose A ~ πœ‹(Β·|S)
		Take action A, observe R, S'
		w = w + 𝛼 [R + 𝛾vΛ†(S',w) - vΛ†(S,w)] 𝛻vΛ†(S,w)
		S = S'

Linear Methods

given an x(s) real-valued vector

in the linear case SGD update simplifies to the form:

Feature Construction for Linear Methods

  • polynomials
  • fourier basis
  • coarse coding - it sits conceptually between:
    • tabular representation (each state has its own feature)
    • global approximations like polynomial or Fourier bases
  • tile coding - a form of course coding
  • radial basis functions

coarse coding vs tile coding

  • coarse coding - arbitrary overlapping shapes, varying sizes
  • tile coding - grids of tiles, arranged in β€œtilings,” each covering space evenly

Selecting Step-Size Parameters Manually