Monte Carlo Methods

learn from sampled sequences of (states, actions, rewards)

Monte Carlo Prediction

first-visit MC estimates v_\pi(s) as the average of the returns following first visits to s
every-visit MC estimates v_\pi(s) as the average of the returns following all visits to s

First-visit MC prediction - for estimating V ~ v_pi:

Input: a policy ⇡ to be evaluated
Initialize:
- V(s) 2R, arbitrarily, for all s 2S
- Returns(s) an empty list, for all s 2S

Loop forever (for each episode):
    Generate an episode following ⇡: S0, A0, R1, S1, A1, R2, ..., ST1, AT1, RT
    G = 0
    Loop for each step of episode, t = T-1, T-2, ..., 0:
        G = G + R_{t+1}
        Unless S_t appears in S0, S1, . . . , St-1:
            Append G to Returns(St)
            V(S_t) = average(Returns(S_t))

Monte Carlo Estimation of Action Values

w/o model state values alone are not sufficient for estimation, we need action-values.

Monte Carlo Control

to approximate optimal policies

for each s deterministically choose an action with maximal action-value:

$π (s) = a r g ma x_{a} q (s, a)$

policy improvement:

$q_{π_{k}} (s, π_{k + 1} (s)) = q_{π_{k}} (s, a r g ma x_{a} q_{π_{k}} (s, a))$
$q_{π_{k}} (s, π_{k + 1} (s)) = ma x_{a} q_{π_{k}} (s, a)$
$q_{π_{k}} (s, π_{k + 1} (s)) >= q_{π_{k}} (s, π_{k} (a))$
$q_{π_{k}} (s, π_{k + 1} (s)) >= v_{π_{k}} (s)$

MC (Exploring Starts) for estimating \pi ~ \pi*

Initialize:
    \pi(s) ∊ A(s) (arbitrarily), for all s∊S
    Q(s, a) ∊ ℝ (arbitrarily), for all s∊S, a∊A(s)
    Returns(s, a) empty list, for all s∊S, a∊A(s)

Loop forever (for each episode):
    Choose S_0 ∊ S, A_0 ∊ A(S_0) randomly such that all pairs have probability > 0
    Generate an episode from S_0, A_0, following \pi: S_0, A_0, R_1, ..., S_{T-1}, A_{T-1}, R_{T-1}
    G = 0
    Loop for each step of episode, t = T-1, T-2, ..., 0:
        G = 𝛾G + R_{t+1}
        Unless the pair (S_t, A_t) appears in S_0, A_0, S_1, A_1 ..., S_{t-1}, A_{t-1}:
            Append G to Returns(S_t, A_t)
            Q(S_t, A_t) = average(Returns(S_t,A_t))
            \pi(S_t) = argmax_a Q(S_t,a)

Monte Carlo w/o Exploring Starts

on-policy methods - evaluate/improve the policy used to generate data
off-policy methods - evaluate/improve the policy different from another policy used to generate data
- behavior policy - generate behavior
- target policy - being evaluated or improved

Algorithm parameter: small " > 0
Initialize:
    𝜋 an arbitrary "-soft policy
    Q(s, a) ∊ ℝ (arbitrarily), for all s∊S, a∊A(s)
    Returns(s, a) empty list, for all s∊S, a∊A(s)

Repeat forever (for each episode):
    Generate an episode following 𝜋: S0, A0, R1, ..., ST1, AT1, RT
    G = 0
    Loop for each step of episode, t = T-1, T-2, ..., 0:
    G = 𝛾G + R_{t+1}
    Unless the pair St, At appears in S0, A0, S1, A1 . . . , St-1, At-1:
        Append G to Returns(St, At)
		Q(St, At) = average(Returns(St, At))
		A* = argmax_a Q(St, a)
		For all a ∊ A(St):
			if a = A^*
                𝜋(a|St) = 1 - 𝜀 + 𝜀/|A(St)|
            if a != A^*
		        𝜋(a|St) = 𝜀/|A(St)|

Off-Policy Prediction via Importance Sampling

Given starting state S_t, the probability of the subsequent state-action trajectory (i.e. A_t, S_{t+1}, A_{t+1}, …, S_T) occurring under any policy 𝜋 is:

$P (A_{t}, S_{t + 1}, A_{t + 1}, ..., S_{T} ∣ S_{t}, A_{t : T - 1} 𝜋) = 𝜋 (A_{t} ∣ S_{t}) p (S_{t + 1} ∣ S_{t}, A_{t}) 𝜋 (A_{t + 1} ∣ S_{t + 1}) ... p (S_{T} ∣ S_{T - 1}, A_{T - 1})$
$P (A_{t}, S_{t + 1}, A_{t + 1}, ..., S_{T} ∣ S_{t}, A_{t : T - 1} 𝜋) = \prod_{k = t}^{T - 1} 𝜋 (A_{k} ∣ S_{k}) p (S_{k + 1} ∣ S_{k}, A_{k})$

Thus, the relative probability of the trajectory under the target and behavior policies (importance sampling ratio) is:

$ρ_{t : T - 1} = \frac{\prod _{k = t}^{T - 1} π ( A _{k} ∣ S _{k} ) p ( S _{k + 1} ∣ S _{k} , A _{k} )}{\prod _{k = t}^{T - 1} b ( A _{k} ∣ S _{k} ) p ( S _{k + 1} ∣ S _{k} , A _{k} )}$
$ρ_{t : T - 1} = \frac{\prod _{k = t}^{T - 1} π ( A _{k} ∣ S _{k} )}{\prod _{k = t}^{T - 1} b ( A _{k} ∣ S _{k} )}$

The returns have wrong expectation

$E [G_{t} ∣ S_{t} = s] = v_{b} (s)$

and so cannot be averaged to obtain v_𝜋. Thus we add the ratio:

$E [ρ_{t : T - 1} G_{t} ∣ S_{t} = s] = v_{π} (s)$

Ordinary Importance Sampling

To estimate 𝑣_𝜋(𝑠) scale the returns by the ratios and average the results:

$V (s) = \frac{\sum _{t ∊ T (s)} ρ _{t : T (t) - 1} G _{t}}{∣ T ( s ) ∣}$

where:

$T (s) is the set of all time steps in which state s is visited$

Weighted Importance Sampling

$V (s) = \frac{\sum _{t ∊ T (s)} ρ _{t : T (t) - 1} G _{t}}{\sum _{t ∊ T (s)} ρ _{t : T (t) - 1}}$

Incremental Implementation

Ordinary Importance Sampling

Can reuse incremental methods of chapter 2.

Weighted Importance Sampling

Off-Policy MC prediction (policy evaluation) for estimating Q ~ q_𝜋:

Input: an arbitrary target policy ⇡
Initialize, for all s 2S, a 2A(s):
    Q(s, a) ∊ ℝ (arbitrarily)
    C(s, a) = 0

Loop forever (for each episode):
	b any policy with coverage of ⇡
	Generate an episode following b: S0, A0, R1, ..., ST1, AT1, RT
	G = 0
	W = 1
	Loop for each step of episode, t= T1, T2, ..., 0, while W != 0:
		G = 𝛾G + R_{t+1}
		C(St, At) = C(St, At) + W
		Q(St, At) = Q(St, At) + [W / C(St,At)] * [G - Q(St,At)]
        W = W * [𝜋(At|St) / b(At|St)]

Off-Policy Monte Carlo Control

Off-policy weighted IS for estimating 𝜋_* and 𝑞_*

Initialize, for all s∊S, a∊A(s):
	Q(s, a) ∊ ℝ (arbitrarily)
	C(s, a) = 0
	𝜋(s) = argmax_a Q(s, a) (with ties broken consistently)

Loop forever (for each episode):
	b = any soft policy
	Generate an episode using b: S0, A0, R1, ..., ST1, AT1, RT
	G = 0
	W = 1
	Loop for each step of episode, t= T1, T2, ..., 0:
		G = 𝛾G + R_{t+1}
		C(St, At) = C(St, At) + W
		Q(St, At) = Q(St, At) + [W / C(St,At)] * [G - Q(St,At)]
		𝜋(St) = argmax_a Q(St, a) (with ties broken consistently)
		If A_t != 𝜋(S_t) then exit inner Loop (proceed to next episode)
		W = W * [1 / b(At|St)]

／var／log marcus chiu

Explorer

RL Chapter 5 (Monte Carlo Methods)

Monte Carlo Methods

Monte Carlo Prediction

Monte Carlo Estimation of Action Values

Monte Carlo Control

Monte Carlo w/o Exploring Starts

Off-Policy Prediction via Importance Sampling

Ordinary Importance Sampling

Weighted Importance Sampling

Incremental Implementation

Ordinary Importance Sampling

Weighted Importance Sampling

Off-Policy Monte Carlo Control

／var／logmarcus chiu

Explorer

RL Chapter 5 (Monte Carlo Methods)

Monte Carlo Methods

Monte Carlo Prediction

Monte Carlo Estimation of Action Values

Monte Carlo Control

Monte Carlo w/o Exploring Starts

Off-Policy Prediction via Importance Sampling

Ordinary Importance Sampling

Weighted Importance Sampling

Incremental Implementation

Ordinary Importance Sampling

Weighted Importance Sampling

Off-Policy Monte Carlo Control

／var／log marcus chiu