Q-Function
- captures the expected total future reward an agent in state, π , can receive by executing a certain action, π
- π(π π‘,ππ‘) = π[π π‘|π π‘,ππ‘]
- where:
- π
π‘ - is the total reward, the discounted sum of all rewards obtained from time π‘, defined as:
- π π‘ = ππ‘ + πΎππ‘+1 + πΎ2ππ‘+2 + β¦
- π π‘ - state
- ππ‘ - action
- π
π‘ - is the total reward, the discounted sum of all rewards obtained from time π‘, defined as:
How to Act Given Q-Function
The agent needs a policy π(π ) to infer the best action to take given state π .
GivenΒ π(π ,π) the policy π*(π ) is implemented as: