In this project, the Hankel matrix model is used to approximate the trend of the time series of the housing price in one city.
Once the approximation is derived, the noise data series is defined as the difference between the price ground truth and the approximation from the Hankel matrix. Then, the autoregressive model is used to simulate the noise data.
The final prediction of the housing price is the sum of the approximation for the major trend and the noise.
The housing price data source is https://www.zillow.com/research/data/.
For this particular project, the data was downloaded corresponding to [Zillow Home Value Index] at the geography of [Metro & US].
The analysis focuses on the city San Jose in California.
The data is divided into the training set and the test set.
The training set is composed of 100 data samples from the date [2013/07/31] to [2021/10/31].
The test set is composed of 36 data samples from the date [2021/11/30] to [2024/10/31].
The definition of Hankel Matrix can be found in Wikipedia: https://en.wikipedia.org/wiki/Hankel_matrix
For example, a Hankel Matrix with 4 rows and 6 columns requires a data series with 9 samples and is arranged as :
\begin{equation*}
\mathbf{H}_{4x6} = \begin{vmatrix}
h_1 & h_2 & h_3 & h_4 & h_5 & h_6\\
h_2 & h_3 & h_4 & h_5 & h_6 & h_7\\
h_3 & h_4 & h_5 & h_6 & h_7 & h_8 \\
h_4 & h_5 & h_6 & h_7 & h_8 & h_9
\end{vmatrix}
\end{equation*}
First step in the data prediction is to define the shape of the Hankel matrix.
The size of Hankel matrix can be optimized by grid search of the shape with a goal of minimizing the prediction error.
This grid search will be demonstrated in a later section.
Assume the Hankel matrix has a shape of ($HRow$, $HCol$), then the total number of required data samples to construct a Hankel matrix is
In a later section, the autoregressive approximation of each element requires a few consecutive prior samples.
The number of prior samples in addition to the required Hankel matrix elements is equal to at least the number of rows of the Hankel matrix.
Therefore, make the length of training samples to be $T=L+HRow=2 \times HRow+HCol-1$.
For example, the shape of Hankel matrix in this project is selected to be (HRow=3, HCol=79),
requiring 84 data samples in a continuous time series.
Take a singular value decomposition of the Hankel Matrix such that the matrix is factorized into the format of $H=U \cdot S \cdot V_h$,
where S is a rectangular (HRow, HCol) matrix with non-negative real eigenvalues on the diagonal.
References:
Singular Value Decomposition: https://en.wikipedia.org/wiki/Singular_value_decomposition
Numpy svd function: https://numpy.org/doc/2.1/reference/generated/numpy.linalg.svd.html
Take the sum of the square of each eigenvalue $\sigma_i$ until the sum is greater than 0.9
Record the quantity $r$ as the required order for approximation.
The original $S$ has the form of
\begin{equation*}
\mathbf{S}_{4x6} = \begin{vmatrix}
\sigma_1 & 0 & 0 & 0 & 0 &0 \\
0 & \sigma_2 & 0 & 0 & 0 &0\\
0 & 0 & \sigma_3 & 0 & 0 &0 \\
0 & 0 & 0 & \sigma_4 & 0 & 0 \\
\end{vmatrix}
\end{equation*}
If the max order $r$ is 2, then the simplified S becomes:
\begin{equation*}
\mathbf{\hat{S}}_{4x6} = \begin{vmatrix}
\sigma_1 & 0 & 0 & 0 & 0 & 0\\
0 & \sigma_2 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0
\end{vmatrix}
\end{equation*}
Recalculate the approximated Hankel Matrix: \begin{equation*} \hat{H}=U \cdot \hat{S} \cdot V_h \end{equation*}
Let the ground truth
and the approximation
Assume there is a linear regression between the ground truth and the approximation:
For example, \begin{equation*}
\hat{h}_4=\beta_1 \cdot f_{1}+\beta_2 \cdot f_{2}+\beta_3 \cdot f_{3}
\end{equation*}
Arrange the Hankel matrix by transposing it and removing the last column,
That means the linear regression can be summarized with one equation.
At each step of the linear regression calculation, a new prediction at the next step can be derived.
That is
The latest time series $H$ can be derived from the history of its approximation $F$.
The regression coefficients $\beta$ can be obtained by using the Ordinary Least Square Model in Python.
At each prediction step, a new set of regression coefficients is calculated.
model=statsmodels.api.OLS(Y,F)
results=model.fit()
$\beta$=results.params
If F is not invertable, use the pseudo inverse.
$\beta$=numpy.linalg.pinv(F)@Y
References:
(1) https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html
(2) https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html
After the prediction is available, compare it with the ground truth.
When the prediction algorithm steps through the test series, the ground truth shall be the combination of the subsets of training and test series. so that ground truth stays true. There will be an example at the end of this article to demonstrate this concept.
(1) Define the noise as the difference between prediction and the ground truth.
(2) Assume the noise as an autoregressive series. Use ARIMA model for prediction.
The (p,d,q) order of the ARIMA model is for the autoregressive, differences, and moving average components.
In this project, define the autoregressive order p. Set d & q =0.
(3) Set ARIMA trend='n' indicating no trend.
(4) Get the latest element as the new prediction.
ARModel=ARIMA(noiseSeries, order=(ARorder, 0,0), trend='n').fit()
ARPrediction=ARModel.predict()
References:
https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html
This project applies a rolling prediction at each time step.
It means that for the prediction at each time step, the set of samples are renewed by shifting the observation window forward by one step.
The entire prediction algorithm from Steps 1 through 7 is repeated at each step for the entire test series.
Both the non-stationary prediction and the residual prediction are accumulated at each step.
After the entire test series has been estimated, the final prediction is derived by subtracting the noise residual series, from the approximated non-stationary series.
In this project, the adjustable parameters are Hankel matrix shape and the order of Autoregressive model.
Use sklearn.metrics.mean_squared_error to estimate the approximation error.
sklearn.metrics.mean_squared_error(Test_series, finalPrediction)
References:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
Assume the Hankel matrix is defined to be a 4x6 matrix. This requires 9 elements to construct the Hankel matrix and
4 additional prior samples for linear regression.
Therefore, it requires a total of 13 elements in a training series.
The ground truth is a time series $[ y_1, y_2, y_3, y_4, y_5, y_6, y_7, y_8, y_9, y_{10}, y_{11}, y_{12}, y_{13} ]$.
Define the test series to have 3 elements $[ y_{14}, y_{15}, y_{16}]$, following the training series.
Take the last 9 elements in a training series to construct the Hankel matrix:
By going through Steps 1 through 6, the regression coefficients $\beta_i$ and the prediction $\hat{y}_{14}$ at the next step are obtained.
Store $\hat{y}_{14}$ in a prediction list.
\begin{equation*}
\hat{y}_{14}=\beta_1 \cdot y_{11}+\beta_2 \cdot y_{12}+\beta_3 \cdot y_{13}
\end{equation*}
$noise=[y_5-\hat{y}_5, y_6-\hat{y}_6, \cdots, y_{13}-\hat{y}_{13}]$
ARmodel=ARIMA(noise, order=AR_order, 0,0), trend='n').fit()
$n_{14}$=ARmodel.predict()[-1]
Shift the observation windows by 1. The new Hankel matrix is composed with $[ y_6, y_7, y_8, y_9, y_{10}, y_{11}, y_{12}, y_{13}, y_{14} ]$. Please note that $y_{14}$ is the ground truth from the test series, not the approximation $\hat{y}_{14}$. Repeat Steps 1 through 6 to obtain the new regression coeffificents $\beta_i$ and iteratively collect the entire approximation $[ \hat{y}_{14}, \hat{y}_{15}, \hat{y}_{16}]$ and the noise $[ n_{14}, n_{15}, n_{16}]$
The final prediction for the test window is $[\hat{t}_{14}, \hat{t}_{15}, \hat{t}_{16}]=[\hat{y}_{14}-n_{14}, \hat{y}_{15}-n_{15}, \hat{y}_{16}-n_{16}]$.
Perform a grid search of Hankel matrix shape and AR order to minimize the Mean Squared Error.