U*(s) / V*(s): optimal value of a state, s (expected value of the utility an optimally-behaving agent that starts in s will receive, over the rest of the agent’s lifetime.)
Q*(s, a): optimal value of a q-state (the optimal value of (s,a) is the expected value of the utility an agent receives after starting in s, taking a, and acting optimally henceforth)
V_0(s) = 0 for all s in S # 0 timesteps means 0 reward acrued since 0 actions taken
for s in S:
V_k+1(s) ← max_a ∑s′T(s,a,s′)[R(s,a,s′) + γV_k(s′)]
for every s:
perform update rule:
V_k+1(s) ← max action of transition * [reward of going to a state + a discounted value of that state] over all possible states
It is like value iteration but for Q-values, w/ the only difference being the max action inside since we select an action before transitioning when we’re in a state, but we transition before selecting a new action when we’re in a Q-state (hence the Q_k(s’, a’)
In a state s, take the action a which yields the maximum expected utility; a is the action that yields the maximum q-value to get an optimal policy, π*