unified view of:

  • model-based methods - rely on planning
    • DP & heuristic search
  • model-free methods - rely on learning
    • MC & TD

Models and Planning

all state-space planning methods share a common structure:

  • all involve computing value functions
  • all compute value functions by updates or backup operations applied to simulated experience
model --> simulated experience -backups-> values --> policy
  • planning methods uses simulated experience generated by a model
  • learning methods use real experience generated by environment

Random-sample one-step tabular Q-planning

loop forever:
- 
	1. select a state s in S, and action a in A(s) at random
	2. send s,a to a sample model and obtain reward r and next state s'
	3. apply one-step tabular Q-learning to s,a,r,s':
	       Q(s,a) = Q(s,a) + 𝛼 [r + 𝛾 * max_a' Q(s',a') - Q(s,a)]

Dyna: Integrated Planning, Acting, and Learning

Tabular Dyna-Q
Initialize Q(s,a) and Model(s,a) for all s∊S and a∊A(s)

loop forever:
	(a) S = current state
	(b) A = 𝜀-greedy(S,Q)
	(c) take action A, observe reward R and state S'
	(d) Q(S,A) = Q(S,A) + 𝛼 [R + 𝛾 max_a Q(S',a) - Q(S,A)]
	(e) Model(S,A) = R,S' (assuming deterministic environment)
	(f) loop repeat n times:
			S = random previously observed state
			A = random action previously taken in S
			R,S' = Model(S,A)
			Q(S,A) = Q(S,A) + 𝛼 [R +  𝛾max_a Q(S',a) - Q(S,A)]

where:

  • (d) - direct RL
  • (e) - model learning
  • (f) - planning

if (e) and (f) were omitted, the remaining algorithm would be one-step tabular Q-learning

In Dyna-Q, learning & planning is accomplished by exactly the same algorithm:

  • operating on real experience for learning
  • operating on simulated experience for planning

When the Model is Wrong

Dyna-Q does NOT work well when environment changes

Dyna-Q+

this agent keeps track for each state-action pair of how many time steps has passed since the pair was last tried in a real interaction.

if modeled reward for a transition is r, and the transition has not been tried in 𝜏 time steps, then planning updates are done as if the transition produced reward of (𝑟 + 𝜅√𝜏), for some small 𝜅.

Prioritized Sweeping

a queue of every state-action pair whose estimated value would change non-trivially if updated, prioritized by the size of the change. When top pair in queue is updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some threshold, then the pair is inserted in the queue with the new priority (if that pair already exists in queue, then one with higher priority remains).

Prioritized Sweeping for a Deterministic Environment

initialize:
- Q(s,a), Model(s,a) for all s,a
- PQueue to empty

loop forever:
	(a) S = current state
    (b) A = policy(S,Q)
	(c) take action A, observe R and S'
	(d) Model(S,A) = R,S'
	(e) P = |R + 𝛾 max_a Q(S',a) - Q(S,A)|
	(f) if P > 𝜃
			insert S,A into PQueue with priority P
	(g) loop n times, while PQueue is not empty
			S,A = first(PQueue)
			R,S' = Model(S,A)
			Q(S,A) = Q(S,A) + 𝛼 [R + 𝛾 max_a Q(S',a) - Q(S,A)]
			loop for all \overline{S} \overline{A} predicated to lead to S:
				\overline{R} = predicted reward for \overline{S}, \overline{A}, S
				P = |\overline{R} + 𝛾 max_a Q(S,a) - Q(\overline{S}, \overline{A})|
				if P > ??
					insert (\overline{S} \overline{A}) into PQueue with priority P

Expected Updates vs Sample Updates

Different kinds of value-function updates

focusing on 1-step updates there are 3 dimensions:

  • update state values or action values
  • estimate value of optimal-policy or arbitrary-policy
  • updates are:
    • expected updates - considering all possible events
    • sample updates - considering 1 event

Example

updates for approximating 𝑞*:

  • expected update
  • sample update

Given a unit of computational effort, is it better devoted to a few expected updates or to b times as many sample updates?

Trajectory Sampling

two ways of distributing updates:

  • sweep through entire state space, updating each state once per sweep
  • sample state space according to some distribution and update states encountered along the way, i.e.:
    • sample uniformly like Dyna-Q

Real-Time Dynamic Programming (RTDP)

RTDP is an on-policy trajectory-sampling version of the value-iteration algorithm of dynamic programming

  • optimal partial policy - a policy optimal for relevant states, but specify arbitrary actions for irrelevant states

for episodic tasks with exploring starts - RTDP is an asynchronous value-iteration algorithm that converges to optimal policies for discounted finite MDPs

RTDP converges with probability one to a policy that is optimal for all the relevant states provided:

    1. the initial value of every goal state is zero
    1. there exists at least one policy that guarantees that a goal state will be reached with probability one from any start state
    1. all rewards for transitions from non-goal states are strictly negative
    1. all the initial values are equal to, or greater than, their optimal values (which can be satisfied by simply setting the initial values of all states to zero).

Planning at Decision Time

Decision-time planning is most useful in applications in which fast responses are not required. In chess playing programs, for example, one may be permitted seconds or minutes of computation for each move, and strong programs may plan dozens of moves ahead within this time.

Heuristic Search

  • are state-space planning methods
  • are decision-time planning methods

Rollout Algorithms

Rollout Algorithms
  • are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state
  • produce Monte Carlo estimates of action values only for each current state and for a given policy usually called the rollout policy

Monte Carlo Tree Search (MCTS)

  • a type of decision time planning method
  • is a rollout algorithm enhanced by accumulating value estimates from MC simulations
  • MCTS is executed after encountering a new state to select action for that state, and executed again for that new state
  • actions in simulated trajectories are generated by a simple rollout policy

Four steps:

  • Selection - starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node
  • Expansion - on some iterations (depending on details of the application), the tree is expanded from the selected leaf node by adding one or more child nodes reached from the selected node via unexplored actions
  • Simulation - from the selected node, or from one of its newly-added child nodes (if any), simulation of a complete episode is run with actions selected by the rollout policy. The result is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout policy
  • Backup - the return generated by the simulated episode is backed up to update, or to initialize, the action values attached to the edges of the tree traversed by the tree policy in this iteration of MCTS. No values are saved for the states and actions visited by the rollout policy beyond the tree. Figure 8.10 illustrates this by showing a backup from the terminal state of the simulated trajectory directly to the state–action node in the tree where the rollout policy began (though in general, the entire return over the simulated trajectory is backed up to this state–action node)