Value Function
- vπ(s)=Eπ[Gt∣St=s]
- vπ(s)=Eπ[Rt+1+γGt+1∣St=s]
- vπ(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γEπ[Gt+1∣St+1=s′]]
- vπ(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γvπ(s′)]
Action-Value Function
- qπ(s,a)=Eπ[Gt∣St=s,At=a]
Optimal Value Function
- v∗(s)=maxaqπ∗(s,a)
- v∗(s)=maxaEπ∗[Gt∣St=s,At=a]
- v∗(s)=maxaEπ∗[Rt+1+γGt+1∣St=s,At=a]
- v∗(s)=maxaE[Rt+1+γv∗(St+1)∣St=s,At=a]
- v∗(s)=maxa∑s′,rp(s′,r∣s,a)[r+γv∗(s′)]
Optimal Action-Value Function
- q∗(s,a)=E[Rt+1+γmaxa′q∗(St+1,a′)∣St=s,At=a]
- q∗(s,a)=∑s′,rp(s′,r∣s,a)[r+γmaxa′q∗(s′,a′)]
Policy Evalution
Finding New Greedy Policy
- π′(s)=argmaxaqπ(s,a)
- π′(s)=argmaxaE[Rt+1+γvπ(St+1)∣St=s,At=a]
- π′(s)=argmaxa∑s′,r∑s′,rp(s′,r∣s,a)[r+γvπ(s′)]
Policy Iteration
- π0−E−>vπ0−I−>π1−E−>vπ1−I−>π2−E−>...−I−>π∗−E−>v∗
Value Iteration
- policy evaluation is stopped after just one sweep (one update of each state).
- vk+1(s)=maxaE[Rt+1+γvk(St+1)∣St=s,At=a]
- vk+1(s)=maxa∑s′,rp(s′,r∣s,a)[r+γvk(s′)]
Asynchronous Dynamic Programming
are in-place iterative DP algorithms that do not sweep through entire state set. examples include:
- update the value of ONLY one state at each value iteration update
Generalized Policy Iteration (GPI)
policy iteration:
- policy evaluation (PE)
- policy improvement (PI)
GPI refer to the general of letting PE and PI interact
GPI is the family that consist of value iteration and asynchronous dynamic programming