(model free, Passive RL)
Compute value of any state s by dividing total utility obtained by # times visited
Instead keeping an average of total rewards like DE, TD uses an exponential moving average to learn at every timestep, interpolating at each t (model free, Passive RL)
Sample = R(s, π(s), s’) + γVπ(s’)
Incorporate into moving average: Vπ(s) ← (1 - α)Vπ(s) + a * Sample
Both DE and TD-Learning require knowledge of Q-values, but by definition of the Q-value, Q(s, a) requires the transition, T(s, a, s’)
Solution: Learn Q-values directly so we can extract policy (model free, Active RL). Approach is similar to TD-Learning
At each timestep:
Q-learning is not feasible if there are thousands of states
Solution: Store each state as a linear combination of features