Part 5: Clustering

Agree to disagree:
The uncertainty, the ambiguity, and the clustering algorithm

Rui Qiu

Updated on 2021-10-11.


Welcome back to the chapter on learning, practicing, and finding the limitation of the clustering algorithm. For now, the whole portfolio seems to be deviating from the original dozen questions raised. Nevertheless, the overall goal is to showcase the capability as a data scientist, isn’t it?

Back to the topic, the focus for this part of the portfolio is to apply various clustering algorithms with multiple sets of metrics, methods, and parameters to approximate the similarity among items. The items here are (1) NBA players active in the last regular season (2020-2021); (2) News titles and headlines from four websites within the time frame of a month.

Since the initially collected data won’t fit in well here, new waves of data are assembled in order to accomplish these tasks. The details will be mentioned below.

 

 

1. Clustering NBA players by their stats

The data comes directly from Basketball Reference. This is because the previously cleaned record data from the last chapter consist of lots of non-numeric data. Specifically, the categorical data are used to describe the shot attempt scenario.

So for the data used this time, these are elementary but thorough statistics appearing in any box score of a basketball game. They are nothing “advanced” like points per 36 minutes, etc. However, it’s undeniable that these numeric values describe the playing style of a player. (Honestly, stats don’t lie, but they don’t tell the whole story.) In this way, hopefully, they can tell the playing position in the game.

1.1 Data cleaning

You can access the raw data set with the link below:

The preliminary cleaning contains three major tasks:

  1. Keep the total statistics if a player got transferred from Team A to Team B within the season.
  2. Clean the player’s name.
  3. Ideally, the label will be the player’s position, so we remove some variables having nothing to with the position, such as rank, team, and position (itself).
  4. Drop players with missing data. (Very few players with limited playing minutes have this trait.)

The variables before cleaning:

glimpse-of-stats
540 rows, 28 columns.
Open in new tab to see in full size.

The variables after cleaning:

glimpse-of-stats-2
503 rows, 26 columns.
Open in new tab to see in full size.

1.2 K-means clustering with different $\ k$ values

A re-usable function iter_kmeans() is defined to take in the data, the value of $\ k$, a string “flag” remark, and the detailed caption, and outputs the clustering visualization.

all-cluster-k2
k = 2
Open in new tab to see in full size.
all-cluster-k3
k = 3
Open in new tab to see in full size.
all-cluster-k4
k = 4
Open in new tab to see in full size.
all-cluster-k5
k = 5
Open in new tab to see in full size.

Since we aim to cluster the data by their playing position, the clustering results shown above are more or less dominated by playing time. It’s pretty apparent players clustered on the right-hand side of the plot are generally starters. In contrast, the ones on the left are benchwarmers mostly. It is not hard to understand that players who have fewer opportunities have a high tendency to be short in stats. Moreover, the clusters generated are not satisfying, to be honest. Too much overlapping.

A proper fix to this is to restrict the selection of players to the ones with sufficient playing time. At the same time, further slice the involved variables to highlight the de facto playing style of a player. In other words, it’s compulsory to get rid of those variables not contributing enough to that. For instance:

In this way, a list of 50 players ranked by their playing minutes in the last season is kept. For the stats remained, they are:

Keep in mind that this selection of variables might be as flawed as the previous “all-inclusive” strategy. Nevertheless, at least it should highlight some playing styles.

If you run the iter_kmeans() function again with the modifications, you probably would get similar results below.

limited-cluster-k2
k = 2
Open in new tab to see in full size.
limited-cluster-k3
k = 3
Open in new tab to see in full size.
limited-cluster-k4
k = 4
Open in new tab to see in full size.
limited-cluster-k5
k = 5
Open in new tab to see in full size.

Significant improvement can be concluded as the clustering by small value $k$ seems gratifying. Still, the overlapping issue remains as $\ k$ increases to 4 and up.

1.3 Hierarchical clustering

Hierarchical clustering is an alternative way to k-means used above to detect similarities and group the records in datasets. One of the most attempting traits of hierarchical clustering is no requirement for a pre-determined $\ k$ value.

To be fair, this section is only touching the very surface of hierarchical clustering. It is not designed to dive deep into the topic. The post Hierarchical Cluster Analysis by the University of Cincinnati is a great reading on R package selection, the algorithm itself, and understanding the dendrograms.

For this part of the task, two types of hierarchical clustering are performed and ultimately compared.

Both hierarchical clusterings are agglomerative (bottom-up):

dendrogram-complete
Dendrogram with setup 1.
Open in new tab to see in full size.
dendrogram-ward-2d
Dendrogram with setup 2.
Open in new tab to see in full size.

Although the two dendrograms seem to be different and are not identical, they still share a lot in common. Please ignore the color mapping in these two, and it’s not hard to detect that some players are always grouped with other certain players. And the most important takeaway from these two is that they both pick $\ k=3$.

To further compare the dendrograms, a tanglegram is drawn to show the big picture in comparison:

tanglegram
Entanglement between two dendrograms is 0.24
Open in new tab to see in full size.

The entanglement shows the quality of the alignment of the two trees (dendrograms). 0.24 is a rather low entanglement coefficient which corresponds to a good alignment.

1.4 Distance matrices with various metrics

This section focuses more on the similarity/dissimilarity of numeric vectors. A total of four different metrics are used to calculate the distances between two players. Note that the player names on x-axis are removed to keep the visualization result tidy. They are essentially the same as y-axis.

euclidean
Euclidean distance.
Open in new tab to see in full size.
manhattan
Manhattan distance.
Open in new tab to see in full size.

$\text{cosine similarity} = \frac{\mathbf{x}\cdot \mathbf{y}}{\sqrt{\mathbf{x}\cdot\mathbf{x}}\sqrt{\mathbf{y}\cdot\mathbf{y}}}$
The dist() function in R does not have Cosine Similarity as a distance metric. It is not hard to implement with R though.
cosine
Cosine similarity.
Open in new tab to see in full size.
canberra
Canberra distance.
Open in new tab to see in full size.

Even though the player names are mixed up in each of the heatmaps, three out of four, except the one generated by Canberra distance, roughly show the evidence of three clustered groups in the sense of similarity.

1.5 Selecting the optimal $\ k$ value

In previous sections of this part, different $\ k$ values are visually evaluated but never rigorously determined.

1.5.1 Elbow method

optimal-k-elbow
Elbow method suggests k=2
Open in new tab to see in full size.

The elbow method suggests $\ k=2$ is the optimal value since the drop from 1 to 2 is the greatest, thus making $\ k=2$ an "elbow."

1.5.2 Silhouette score

optimal-k-silhouette
Silhouette score suggests k=3
Open in new tab to see in full size.

The Silhouette score at $\ k=3$ is maximized, suggesting $\ k=3$ to be optimal. However, $\ k=2$ is very close in the score as well. Hence, the selection here is not decisive.

1.5.3 Gap statistic

optimal-k-gap-stat
Gap statistic suggests k=1
Open in new tab to see in full size.

The gap statistic is slightly off since it hints at an optimal value of $\ k=1$

This one basically makes no sense.

1.5.4 Use fviz_nbclust() to automate

Another approach to automate the selection procedure is to utilize fviz_hbclust() function from {factoextra}. The function determines and visualizes the optimal number of clusters using different methods, within-cluster sum of squares (elbow), average silhouette and gap statistics. The most convenient part is its text output, telling you the optimal $\ k$ in plain text.

*** : The Hubert index is a graphical method of
    determining the number of clusters. In the plot
    of Hubert index, we seek a significant knee
    that corresponds to a significant increase of
    the value of the measure i.e the significant
    peak in Hubert index second differences plot.

*** : The D index is a graphical method of determining
    the number of clusters. In the plot of D index, we
    seek a significant knee (the significant peak in
    Dindex second differences plot) that corresponds
    to a significant increase of the value of the
    measure.

*****************************************************
* Among all indices:
* 7 proposed 2 as the best number of clusters
* 8 proposed 3 as the best number of clusters
* 4 proposed 4 as the best number of clusters
* 4 proposed 5 as the best number of clusters

                ***** Conclusion *****

* According to the majority rule, the best number of
  clusters is  3


*****************************************************
Among all indices:
===================
* 2 proposed  0 as the best number of clusters
* 1 proposed  1 as the best number of clusters
* 7 proposed  2 as the best number of clusters
* 8 proposed  3 as the best number of clusters
* 4 proposed  4 as the best number of clusters
* 4 proposed  5 as the best number of clusters

Conclusion
=========================
* According to the majority rule, the best number of
  clusters is  3 .
            

Again, this method agrees on $\ k=3$. The voting results can be also visualized:

optimal-k-vote
Voting for k=3
Open in new tab to see in full size.

1.6 Cluster prediction

Please see the other tab "Cluster Prediction" for more details.

1.7 Summary

The previous clustering suggests either $\ k=3$ or $\ k=2$ are the most suitable number of clusters hidden within the data. However, the original label is the position of a player, which contains five possible values: PG, SG, SF, PF, and C. Despite that, the clustering with $\ k=5$ is horrible, with lots of overlapping areas. This means the clustering method does not learn quite well to describe a player’s playing position by his on-paper statistics.

Nevertheless, here are takeaways from this experiment:

2. Clustering news (hopefully) to distinguish the categories

Due to the limit of the effective width of this web design, please bear that most visualizations down there are in better visual experience when clicked and enlarged.

2.1 Data cleaning

The original design is to distinguish sources that cover NBA news. But the test run is not very convincing as the clustering messes up everything. An assumption would be that word usage is pretty much identical for most of them. In the end, this is not identifying authors by their writing styles. Therefore, a further trial is conducted after the introduction of more diversified news sources.

The new inclusion includes:

  • ESPN: news about sports.
  • MacRumors: news about consumer electronics.
  • NASA: news about astronomy.
  • Polygon: news about gaming.
A wordcloud of the terms excluding the stopwords is shown below.
wordcloud
Wordcloud of terms from news data.
Open in new tab to see in full size.

2.2 K-means clustering with different $\ k$ values

The 400 news articles are concatenated into one data frame. Unlike last time, the title and headline are combined into one string variable. The text will later be vectorized by two different vectorizers, CountVectorizer and TfidfVectorizer, respectively.

  Source Date Text
0 ESPN 2021-09-09 The battle for the Cy Hawk another test for Oh...
1 ESPN 2021-09-09 Dak is back Why the Cowboys have zero concerns...
2 ESPN 2021-09-09 Why the WNBA draft class has struggled to make...
3 ESPN 2021-09-09 Is UCLA finally on track with Chip Kelly After...
4 ESPN 2021-09-12 Broncos WR Jeudy carted off with lower leg inj...
... ... ... ...
395 MacRumors 2021-10-05 Apple Agrees to Pay Million to Settle Lawsuit ...
396 MacRumors 2021-10-05 Apple Commemorates th Anniversary of Steve Job...
397 MacRumors 2021-09-30 Apple Seeds macOS Big Sur Beta to Developers W...
398 MacRumors 2021-10-01 MacRumors Giveaway Win a Flight Bag Leather Br...
399 MacRumors 2021-09-27 Google Criticizes EU Regulators for Ignoring A...
The cleaned data set, which are Document-Term Matrix and Document-Frequency Matrix in this case, are stored in these two intermediate files:

The screenshots of these two data frames above are shown below.

DTM by CountVectorizer.
Open in new tab to see in full size.
DFM by TfidfVectorizer.
Open in new tab to see in full size.

A re-usable function iterate_w_k() is implemented to perform dimension reduction, K-means clustering over the reduced features, and output the visualization along with two metrics (the homogeneity and Silhouette score.)

The clusterings with CountVectorizer preprocessed data and k from 2 to 5:

Clustering with CountVectorizer.
Open in new tab to see in full size.

The clusterings with TfidfVectorizer preprocessed data and k from 2 to 5:

Clustering with TfidfVectorizer.
Open in new tab to see in full size.

With the prior knowledge that if the clustering truly learns anything about resources, k should be 4 without any debate.

The following table lists the homogeneity score, which tells how alike the data are with a cluster, and the Silhouette score, which describes how far the clusters are from each other. Ideally, the higher, the better for both.

  k=2 k=3
homogeneity score (CountVectorizer) 0.2914526756355663 0.36946391298809556
Silhouette score (CountVectorizer) 0.16836203241503322 0.17904908336242356
homogeneity score (TfidfVectorizer) 0.06146851457954772 0.39324808032485964
Silhouette score (TfidfVectorizer) 0.029815094893238428 0.034968169736253746
  k=4 k=5
homogeneity score (CountVectorizer) 0.29680912347283095 0.45727954760694456
Silhouette score (CountVectorizer) 0.17863738778270835 0.08412074984413664
homogeneity score (TfidfVectorizer) 0.5800599275998554 0.5257363389005074
Silhouette score (TfidfVectorizer) 0.04392759797869041 0.04084535197279959

Roughly speaking, k=4 with TfidfVectorizer gives a decent clustering result. For the procedures below, k=4 will be selected as an example.

2.3 DBSCAN

The Density-Based Spatial Clustering of Applications with Noise detects the samples of high density and expands clusters from them. Since the Silhouette scores above are relatively low, DBSCAN is an alternative to be considered. With k=4, the DBSCAN clusters are:


Open in new tab to see in full size.

However, hyper-parameter tuning in DBSCAN is really tricky. The following two articles provide very detailed solutions to tuning.

It will take pages to dissect the problem, which is not the focus of this part of the portfolio.

2.4 Hierarchical clustering

The hierarchical clustering, unlike anything above, suggests that k=3 should also be considered.


Open in new tab to see in full size.

2.5 Optimal $\ k$

So back to the Silhouette scores. A thorough but tedious Silhouette analysis is conducted on k from value 2 to 5.


Open in new tab to see in full size.

The thickness of each color strip indicates the size of a cluster under its k value. k=3 is a lousy pick since there are some clusters with below-average Silhouette scores. k=5 is another poor choice due to the fluctuation in the size of the silhouette plots.

For n_clusters = 2
The average silhouette_score is : 0.7721933252661253
For n_clusters = 3
The average silhouette_score is : 0.741629848098952
For n_clusters = 4
The average silhouette_score is : 0.7170194571369254
For n_clusters = 5
The average silhouette_score is : 0.47913326363536746
            

Although k=2 delivers a higher average Silhouette score, its lead against k=4 is not significant. In addition, the cluster sizes of k=2 are more extreme than those of k=4.

With prior knowledge generally known to everyone, k=4 is a more logical choice.

2.6 Summary

tf-idf and counts give different results. This is totally reasonable since two of them have unique emphasis. Count is just the plain frequency of certain words' appearances in a document, while tf-idf, according to Wikipedia, "increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general."

When TfidfVectorizer is selected as the text processing vectorizer, a series of trials to find the optimal k value begins. Although some of the results don't coincide, they still manage to give some insights about the text data from their own perspectives.

A potential explanation for poor performance in text clustering could be: the sample size is limited, and the text material within each news piece is not long enough at the same time. A more appropriate approach to study text data will be introduced in next section.

3. Topic modeling with news

This is a bonus part that might be interesting under the same topic. Frankly, text clustering is something fundamental to do in natural language processing. But the story never ends there.

A topic is defined by a set of keywords, which are so important that each of them have a probability of occurrence for the topic. Naturally, different topics have their own keyword sets with corresponding probabilities, while sharing some common keywords at the same time. And, looking at the this in a larger picture, a document in the corpus can contain several topics.

On the contrary, the clustering algorithm clusters documents into different groups based on a selected similarity metric.

For the use case in this chapter of the portfolio, topic modeling seems to be a preferable choice, especially the prior knowledge we have about the documents are their sources, which basically are equivalent to news topics they report and cover.

A Latent Dirichlet allocation (LDA) is used to perform a topic modeling on the same news data.

R script for topic modeling: topic-modeling.R

A quick and dirty LDA gives the following 4 topics ranked by beta value, which is the per-topic-per-word probability.

The tf-idf data frame containing all the information of 400 text segements is:

# A tibble: 10,172 × 7
document term        count    tf   idf tf_idf source
                
1 text303  cinderella      7 0.259  5.99  1.55  Polygon
2 text325  castlevania     5 0.179  5.99  1.07  Polygon
3 text89   reid            2 0.154  5.99  0.922 ESPN
4 text308  battlefield     3 0.15   5.99  0.899 Polygon
5 text21   ap              2 0.154  5.30  0.815 ESPN
6 text106  storage         5 0.147  5.30  0.779 MacRumors
7 text50   nhl             3 0.158  4.89  0.773 ESPN
8 text27   trolls          2 0.125  5.99  0.749 ESPN
9 text27   bunch           2 0.125  5.99  0.749 ESPN
10 text47   redick          2 0.125  5.99  0.749 ESPN
# … with 10,162 more rows
                    

The per-topic-per-word (gamma) probability for all topic-document combination is:

# A tibble: 1,600 × 3
document topic    gamma
        
1 text1        1 0.000808
2 text2        1 0.00136
3 text3        1 0.996
4 text4        1 0.00143
5 text5        1 0.00123
6 text6        1 0.000891
7 text7        1 0.000834
8 text8        1 0.000957
9 text9        1 0.469
10 text10       1 0.00117
# … with 1,590 more rows
                    

The visualization of top words by topic is listed below.

lda
LDA 4 topic modeling.
Open in new tab to see in full size.

It's not a secret that this LDA topic modeling has its own problem as well: although the stopwords are removed in the preparation of dfm, there are still lots of meaningless terms in the final output.

References