Bellman Equation

U*(s) / V*(s): optimal value of a state, s (expected value of the utility an optimally-behaving agent that starts in s will receive, over the rest of the agent’s lifetime.)

Q*(s, a): optimal value of a q-state (the optimal value of (s,a) is the expected value of the utility an agent receives after starting in s, taking a, and acting optimally henceforth)

Screen Shot 2023-07-25 at 3.46.54 PM.png

Screen Shot 2023-07-25 at 3.47.37 PM.png

Value Iteration

V_0(s) = 0 for all s in S # 0 timesteps means 0 reward acrued since 0 actions taken

for s in S:
	V_k+1(s) ← max_a ∑s′T(s,a,s′)[R(s,a,s′) + γV_k(s′)]

for every s:

perform update rule:

V_k+1(s) ← max action of transition * [reward of going to a state + a discounted value of that state] over all possible states

Screen Shot 2023-07-25 at 3.57.50 PM.png

Q-Value Iteration

It is like value iteration but for Q-values, w/ the only difference being the max action inside since we select an action before transitioning when we’re in a state, but we transition before selecting a new action when we’re in a Q-state (hence the Q_k(s’, a’)

Screen Shot 2023-07-25 at 10.54.04 PM.png

Policy Extraction

In a state s, take the action a which yields the maximum expected utility; a is the action that yields the maximum q-value to get an optimal policy, π*

Screen Shot 2023-07-25 at 10.47.47 PM.png

Policy Evaluation