Q-Function
  • captures the expected total future reward an agent in state, 𝑠, can receive by executing a certain action, π‘Ž
    • 𝑄(𝑠𝑑,π‘Žπ‘‘) = 𝐄[𝑅𝑑|𝑠𝑑,π‘Žπ‘‘]
  • where:
    • 𝑅𝑑 - is the total reward, the discounted sum of all rewards obtained from time 𝑑, defined as:
      • 𝑅𝑑 = π‘Ÿπ‘‘ + π›Ύπ‘Ÿπ‘‘+1 + 𝛾2π‘Ÿπ‘‘+2 + …
    • 𝑠𝑑 - state
    • π‘Žπ‘‘ - action

How to Act Given Q-Function

The agent needs a policy πœ‹(𝑠) to infer the best action to take given state 𝑠.

Given 𝑄(𝑠,π‘Ž) the policy πœ‹*(𝑠) is implemented as: