In this project, the model will take a list of equity tickers and group them according to their daily price momentum. The goal is to identify the equity tickers which have similar price momentum.
The input file is a record of equity tickers which has high price momentum on a particular date . The column name is date (YYMMDD). The rows are qualified equity tickers on that date. See the figure below for a snapshot.

The goal is to cluster the ticker symbols. The tickers' characteristics are that they appear to have high price momentum on particular dates. Therfore, the dates in the database are the features. And the tickers are the samples to be clustered. Consequently, the data file needs to be converted to feature (date) columns versus ticker rows.
Each row is a unique ticker symbol. The value 1 indicates the ticker has a high price momentum on that date column. For example, there are 418 ticker symbols in rows and 144 dates (numeric index) in columns. See the figure below for a snapshot.

The KMeans algorithm clusters data by trying to cluster tickers in n groups of equal variance, minimizing a criterion known as the inertia (within-cluster sum of squares).
This algorithm requires the number of clusters to be specified.
Reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
from sklearn.cluster import KMeans
Can use the metrics Within-Cluster Sum of Squares (inertia) or Silhouette Score to determine the suitable quantity of clusters. It is more advantageous to take principal Components (PCA) first before running the metrics. It is explained in next paragraph.
After the feature matrix is read in as a dataframe df,
kmeans=Kmeans(clusterQty)
kmeans.fit(df)
labels=kmeans.fit_predict(df)
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].
Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. Therefore, the silhouette analysis can be used to choose an optimal value for number of clusters of KMeans clustering.
Before running the silhouette score analysis, make sure to take the pca of the feature matrix.
from sklearn.decomposition import PCA
pca=PCA()
z=pca.fit_transform(df)
If the silhouette score shows that it was able to separate the clusters nicely with acceptable silhouette score, then use the information to define the number of clusters for KMeans clustering model.
Plot the silhouette score distribution for 20 clusters. See its PCA plot and the Silhouette Score Plot.
from sklearn.metrics import silhouette_score, silhouette_samples
from yellowbrick.cluster import SilhouetteVisualizer
from yellowbrick.style import rcmod
from yellowbrick.style.colors import resolve_colors
Reference:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html


KMeans function calculates the inertia and can be iterated to visualize the inertia reduction according to the increase of the cluster quantity. The inertia is the sum of squared distances of samples (ticker symbols) to their closest cluster center.
For example, plot all inertia up to 25 clusters. Use the percentage of reduction from one inertia to next to select the suitable number of clusters. The following plot shows the absolute inertia values according to the number of clusters. The inertia reduction is:
[11%, 4.7%, 3%, 2.6%, 2%, 1.7%, 1.3%, 1.8%, 1.4%, 0.6%, 1.7%, 0.9%, 1.1%, 0.9%, 1.5%, -0.3%, 1%, 0.85%, 0.81%, 1%, 0.96%, 0.55%, 0.45%]. Apparently, it may not be possible to find the elbow of the curve as the number of clusters.
Use the cluster quantity 20 as an experiment. For 418 ticker symbols, we would like to make sure the majority of clusters contain small groups of ticker symbols. One reason is that there are multiple investment management firms which manage the Exchange Traded Funds (etf) with similar portfolios, and are distinguished by unique ticker symbols. And there is a limited number such similar etfs.

Multi-Dimensional Scaling (MDS) creates a map displaying the relative positions of a number of objects and represent distances accurately.
mds=MDS(n_components=2, verbose=1, eps=1e-5)
mds.fit(df)
Reference:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
The tSNE technique preserves the local structure of the data, meaning similar data points stay close together in the visualization while dissimilar points are separated. t-SNE expands dense clusters and contracts sparse ones and places clusters at similar distance apart. It is easier to view the possible sub-clusters within an individual cluster.
The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity.
tsne=TSNE(n_components=2, verbose=1, perplexity=30)
z_tsne=tsne.fit_transform(df)
Reference:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
The output of KMeans clustering is the twenty clusters of ticker symbols. The snapshot of these symbol clusters is shown below. Some clusters still have a large group of symbols which can be clustered further using another iteration of KMeans clustering.

Cluster index 0 gathers the ticker symbols associated with companies in Asia and Emerging Markets. It also indicates they have similar price momentum to the precious metal within the observed 144 days.

Cluster index 2 gathers the ticker symbols associated with companies in US, which are large caps with strong momentum and quality.

Cluster index 3 gathers the ticker symbols associated with companies in the international countries or the emerging markets (less developed countries), excluding US.

Cluster index 4 gathers the ticker symbols associated with the energy sector. The material sector in US happened to have similar momentum during the 144 days.

Cluster index 6 gathers the ticker symbols associated with the gold metal sector.

Cluster index 7 gathers the ticker symbols associated with the bond market and the consumer staple equity sectors, which are all defensive groups against the equity market downturn.

Cluster index 9 gathers the ticker symbols associated with the small and mid caps in US. There are sub-clusters such as infrastructure, utility, industrials, constructions, financial, and travel cruises. These groups are more interest-rate sensitive.

Cluster index 11 gathers the ticker symbols associated with the technology growth companies.

Cluster index 17 gathers the ticker symbols associated with the healthcare companies.
