unified view of:

model-based methods - rely on planning
- DP & heuristic search
model-free methods - rely on learning
- MC & TD

Models and Planning

all state-space planning methods share a common structure:

all involve computing value functions
all compute value functions by updates or backup operations applied to simulated experience

model --> simulated experience -backups-> values --> policy

planning methods uses simulated experience generated by a model
learning methods use real experience generated by environment

Random-sample one-step tabular Q-planning

loop forever:
- 
	1. select a state s in S, and action a in A(s) at random
	2. send s,a to a sample model and obtain reward r and next state s'
	3. apply one-step tabular Q-learning to s,a,r,s':
	       Q(s,a) = Q(s,a) + 𝛼 [r + 𝛾 * max_a' Q(s',a') - Q(s,a)]

Dyna: Integrated Planning, Acting, and Learning

Tabular Dyna-Q

Initialize Q(s,a) and Model(s,a) for all s∊S and a∊A(s)

loop forever:
	(a) S = current state
	(b) A = 𝜀-greedy(S,Q)
	(c) take action A, observe reward R and state S'
	(d) Q(S,A) = Q(S,A) + 𝛼 [R + 𝛾 max_a Q(S',a) - Q(S,A)]
	(e) Model(S,A) = R,S' (assuming deterministic environment)
	(f) loop repeat n times:
			S = random previously observed state
			A = random action previously taken in S
			R,S' = Model(S,A)
			Q(S,A) = Q(S,A) + 𝛼 [R +  𝛾max_a Q(S',a) - Q(S,A)]

where:

(d) - direct RL
(e) - model learning
(f) - planning

if (e) and (f) were omitted, the remaining algorithm would be one-step tabular Q-learning

In Dyna-Q, learning & planning is accomplished by exactly the same algorithm:

operating on real experience for learning
operating on simulated experience for planning

When the Model is Wrong

Dyna-Q does NOT work well when environment changes

Dyna-Q+

this agent keeps track for each state-action pair of how many time steps has passed since the pair was last tried in a real interaction.

if modeled reward for a transition is r, and the transition has not been tried in 𝜏 time steps, then planning updates are done as if the transition produced reward of (𝑟 + 𝜅√𝜏), for some small 𝜅.

Prioritized Sweeping

a queue of every state-action pair whose estimated value would change non-trivially if updated, prioritized by the size of the change. When top pair in queue is updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some threshold, then the pair is inserted in the queue with the new priority (if that pair already exists in queue, then one with higher priority remains).

Prioritized Sweeping for a Deterministic Environment

initialize:
- Q(s,a), Model(s,a) for all s,a
- PQueue to empty

loop forever:
	(a) S = current state
    (b) A = policy(S,Q)
	(c) take action A, observe R and S'
	(d) Model(S,A) = R,S'
	(e) P = |R + 𝛾 max_a Q(S',a) - Q(S,A)|
	(f) if P > 𝜃
			insert S,A into PQueue with priority P
	(g) loop n times, while PQueue is not empty
			S,A = first(PQueue)
			R,S' = Model(S,A)
			Q(S,A) = Q(S,A) + 𝛼 [R + 𝛾 max_a Q(S',a) - Q(S,A)]
			loop for all \overline{S} \overline{A} predicated to lead to S:
				\overline{R} = predicted reward for \overline{S}, \overline{A}, S
				P = |\overline{R} + 𝛾 max_a Q(S,a) - Q(\overline{S}, \overline{A})|
				if P > ??
					insert (\overline{S} \overline{A}) into PQueue with priority P

Expected Updates vs Sample Updates

Different kinds of value-function updates

focusing on 1-step updates there are 3 dimensions:

update state values or action values
estimate value of optimal-policy or arbitrary-policy
updates are:
- expected updates - considering all possible events
- sample updates - considering 1 event

Example

updates for approximating 𝑞_*:

expected update
- $Q (s, a) = \sum_{s^{'}, r} \overset{p}{^} (s^{'}, r ∣ s, a) [r + γ * ma x_{a} Q (s^{'}, a^{'})]$
sample update
- $Q (s, a) = Q (s, a) + α [R + γ * ma x_{a^{'}} Q (S^{'}, a^{'}) - Q (s, a)]$

Given a unit of computational effort, is it better devoted to a few expected updates or to b times as many sample updates?

Trajectory Sampling

two ways of distributing updates:

sweep through entire state space, updating each state once per sweep
sample state space according to some distribution and update states encountered along the way, i.e.:
- sample uniformly like Dyna-Q

Real-Time Dynamic Programming (RTDP)

RTDP is an on-policy trajectory-sampling version of the value-iteration algorithm of dynamic programming

optimal partial policy - a policy optimal for relevant states, but specify arbitrary actions for irrelevant states

for episodic tasks with exploring starts - RTDP is an asynchronous value-iteration algorithm that converges to optimal policies for discounted finite MDPs

RTDP converges with probability one to a policy that is optimal for all the relevant states provided:

1. the initial value of every goal state is zero
1. there exists at least one policy that guarantees that a goal state will be reached with probability one from any start state
1. all rewards for transitions from non-goal states are strictly negative
1. all the initial values are equal to, or greater than, their optimal values (which can be satisfied by simply setting the initial values of all states to zero).

Planning at Decision Time

Decision-time planning is most useful in applications in which fast responses are not required. In chess playing programs, for example, one may be permitted seconds or minutes of computation for each move, and strong programs may plan dozens of moves ahead within this time.

Heuristic Search

are state-space planning methods
are decision-time planning methods

Rollout Algorithms

are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state
produce Monte Carlo estimates of action values only for each current state and for a given policy usually called the rollout policy

Monte Carlo Tree Search (MCTS)

a type of decision time planning method
is a rollout algorithm enhanced by accumulating value estimates from MC simulations
MCTS is executed after encountering a new state to select action for that state, and executed again for that new state
actions in simulated trajectories are generated by a simple rollout policy

Four steps:

Selection - starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node
Expansion - on some iterations (depending on details of the application), the tree is expanded from the selected leaf node by adding one or more child nodes reached from the selected node via unexplored actions
Simulation - from the selected node, or from one of its newly-added child nodes (if any), simulation of a complete episode is run with actions selected by the rollout policy. The result is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout policy
Backup - the return generated by the simulated episode is backed up to update, or to initialize, the action values attached to the edges of the tree traversed by the tree policy in this iteration of MCTS. No values are saved for the states and actions visited by the rollout policy beyond the tree. Figure 8.10 illustrates this by showing a backup from the terminal state of the simulated trajectory directly to the state–action node in the tree where the rollout policy began (though in general, the entire return over the simulated trajectory is backed up to this state–action node)

／var／log marcus chiu

Explorer

RL Chapter 8 (Planning & Learning with Tabular Methods)

Models and Planning

Random-sample one-step tabular Q-planning

Dyna: Integrated Planning, Acting, and Learning

Tabular Dyna-Q

When the Model is Wrong

Dyna-Q+

Prioritized Sweeping

Expected Updates vs Sample Updates

Example

Trajectory Sampling

Real-Time Dynamic Programming (RTDP)

Planning at Decision Time

Heuristic Search

Rollout Algorithms

Rollout Algorithms

Monte Carlo Tree Search (MCTS)

／var／logmarcus chiu

Explorer

RL Chapter 8 (Planning & Learning with Tabular Methods)

Models and Planning

Random-sample one-step tabular Q-planning

Dyna: Integrated Planning, Acting, and Learning

Tabular Dyna-Q

When the Model is Wrong

Dyna-Q+

Prioritized Sweeping

Expected Updates vs Sample Updates

Example

Trajectory Sampling

Real-Time Dynamic Programming (RTDP)

Planning at Decision Time

Heuristic Search

Rollout Algorithms

Rollout Algorithms

Monte Carlo Tree Search (MCTS)

／var／log marcus chiu