Reinforcement Learning

Screen Shot 2023-08-06 at 12.11.57 AM.png

Screen Shot 2023-08-06 at 12.11.35 AM.png

Model Based vs Model Free Learning

Direct Evaluation

(model free, Passive RL)

Fix policy π
Experience episodes following π
- Compute value of any state s by dividing total utility obtained by # times visited

TD Learning

Instead keeping an average of total rewards like DE, TD uses an exponential moving average to learn at every timestep, interpolating at each t (model free, Passive RL)

Initialize Vπ(s) = 0
At each timestep:
- Sample = R(s, π(s), s’) + γVπ(s’)
- Incorporate into moving average: Vπ(s) ← (1 - α)Vπ(s) + a * Sample

Q-Learning

Both DE and TD-Learning require knowledge of Q-values, but by definition of the Q-value, Q(s, a) requires the transition, T(s, a, s’)

Solution: Learn Q-values directly so we can extract policy (model free, Active RL). Approach is similar to TD-Learning

Initialize Q(s, a) = 0 and α in [0, 1]

Screen Shot 2023-07-26 at 12.06.39 AM.png

At each timestep:

Approximate Q-Learning

Q-learning is not feasible if there are thousands of states

Solution: Store each state as a linear combination of features