Screen Shot 2023-08-06 at 12.11.57 AM.png

Screen Shot 2023-08-06 at 12.11.35 AM.png

Model Based vs Model Free Learning

Direct Evaluation

(model free, Passive RL)

  1. Fix policy π
  2. Experience episodes following π

TD Learning

Instead keeping an average of total rewards like DE, TD uses an exponential moving average to learn at every timestep, interpolating at each t (model free, Passive RL)

  1. Initialize Vπ(s) = 0
  2. At each timestep:

Q-Learning

Both DE and TD-Learning require knowledge of Q-values, but by definition of the Q-value, Q(s, a) requires the transition, T(s, a, s’)

Solution: Learn Q-values directly so we can extract policy (model free, Active RL). Approach is similar to TD-Learning

  1. Initialize Q(s, a) = 0 and α in [0, 1]

Screen Shot 2023-07-26 at 12.06.39 AM.png

  1. At each timestep:

    Screen Shot 2023-07-26 at 12.07.42 AM.png

    Screen Shot 2023-07-26 at 12.08.33 AM.png

    Screen Shot 2023-08-02 at 3.06.56 PM.png

Approximate Q-Learning

Q-learning is not feasible if there are thousands of states

Solution: Store each state as a linear combination of features