Q function is a linear function which calculates the Q value of taking a specific action within a particular state. These Q values represent the expected future rewards by taking a specific action within a particular state and follow the optimal policy afterwards.
For the Fitted Value Iteration and the Fitted Policy Iteration, their Q functions are approximated with a linear model and learn the coefficients using linear regression.
\begin{equation*} \hat{Q}(s,u, \theta)=\theta^T \phi, \end{equation*}For example, $Q=[\theta_1, \theta_2, \theta_3, \theta_4] [1, s^2, su, u^2]^T=\theta_1+\theta_2 s^2+ \theta_3 su+ \theta_4 u^2$.
where $\theta$ is the coefficient set of the linear equation. $\phi$ is the feature vector.
In the real world, the Q function is obtained by using the laws of physics associated with the particular system.
The optiml Value Function is defined as the Q function taking an optimal policy at a particular state.
The common approach for Fitted Value Iteration and Fitted Policy Iteration is to find the optimal values at each state and use linear regression to find the coefficients of its Q function.
(i) In the Value Iteration, calculate the Q(s, u) at each state and each possible action using the following equation:
This involves the calculation of reward(s, u), next_state $s_{next}(s, u, w)$, and approximated $\hat{Q}(s_{next}, u, \theta)$ according to the next state. For example, $s_{next}=A s+B u+w$.
Since the next_state has a random variable $w$, the Q(s, u) needs to take the expectation value which requires a hyperparameter (the number of iterations) to calculate Q(s,u) multiple times and then take the average value of them.
(ii) After the Q table is established, identify the optimal value $V(s)$ at each state.
At this same Step 1, instead of Value Iteration using the Q table to find the optimal value,
(i) The Policy Iteration uses the Q table to
For example, $s_{next}=A s+B u+w$.
(2) Generate next_state $s_{next} (s, u_{optimal})$.
(3) Use the Q table to find the next pair of (next_state, next_optimal policy).
(4) Repeat steps 2 and 3 for a defined number of iterations (a hyperparameter $M$). This equivalently generates an optimal trajectory taking the optimal policies at each state.
(ii) Use the following equation to find V value at each step (state, optimal policy) of the trajectory.
Since the next_state has a random variable $w$, the V value needs to take the expectation value which requires a hyperparameter (the number of iterations) to calculate V value multiple times and take the average.
At Step 1. the expectation value of $V(s)$ is identified as the ground truth of the Value function using either Fitted Value Iteration or Fitted Policy Iteration.
Once $V(s)$ is available, the linear regression method can be used to learn the coefficient of $\hat{Q}(s,u)=\theta^{T} \phi$ because
$V(s)$ is the ground truth of $\hat{Q}(s, u_{optimal})$
Define the cost function as
\begin{equation*}
J=\underset{i=1 \to N_{states}} \sum (\hat{Q}(s(i), u_{optimal}-V(s(i)))^2
\end{equation*}
The gradient of the cost function with respect to the coefficient is
\begin{equation*}
\frac{\partial J}{\partial \theta} = 2 \times \underset{i=1 \to N_{states}} \sum (\hat{Q}(s(i), u_{optimal}-V(s(i))) \times \frac{\partial \hat{Q}}{\partial \theta}
\end{equation*}
where
\begin{equation*}
\frac{\partial \hat{Q}(s(i), u_{optimal})}{\partial \theta} =\phi(s, u_{optimal})
\end{equation*}
The coefficient $\theta$ is updated to be
where $\alpha$ is a learning rate.
Repeat the regression learning for many iterations until $\theta$ converges.