Part 8: Naïve Bayes

A tale of two classifiers part 1

Rui Qiu

Updated on 2021-11-21.

The main focus of this part of the portfolio and the next, will be utilizing two classifiers, namely naïve Bayes and SVM, to make the following two predictions:

To predict the shot attempt results of Toronto Raptors (2020-2021 season).
To predict the popularity of reddit threads based on its title.

Mixed record data with R

Data and scripts

Data: team-shots-TOR.csv

Script: nb-svm-script.R

The data consists of Toronto Raptors’ shot attempt data (mostly from 2020-2021 season) with the following general structure (with both numeric and categorical features):

tor-data — A glimpse of shot attempts data.

The data set is also split into a 70-30 train-test split, stratified by the result made.

Model tuning

{caret} package is used to conduct a 10-fold cross validation.

One naïve version of naïve bayes classifiers, without any parameter tuning, is concluded as below:

Take a look at its confusion matrix and mean accuracy:

The overall accuracy is 81.12%, not bad. But apparently, the beauty of model selection is to fine tune some hyperparameters so that a model could improve itself.

The following three hyperparameters are involved:

usekernel: use a kernel density estimate for continuous variables vs a Gaussian density estimate.
adjust: adjust the bandwidth of the kernel density (larger numbers mean more flexible density estimate)
fL: Laplace smoother

Again, the grid search approach is used to accomplish this mission. In addition, the numeric predictors are preprocessed with “center" and “scale”, i.e., standardization.

The resulted models are ordered by accuracy as the selection criterion. Top 5 of them are displayed below:

nb-m2-top5 — Top 5 models from nb.m2 by accuracy.

As the tuning process, it can be visualized as the following two charts as well:

nb-tuning — The tuning process of nb.m2.

The overall accuracy of tuned version naïve Bayes is 81.21%, slightly better than the untuned version.

nb-m2-cm — Confusion matrix of nb.m2 (best model.)

Then the testing data is validated on selected naïve Bayes model, which returns the following confusion matrix:

nb-m2-pred-cm — Confusion matrix of nb.m2 (best model) on testing data.

The result is very uplifting as the model actually outperforms itself with training data, which eventually reaches an accuracy of 82.63%.

Feature importance

The feature importance in an object of train is rather hard to extract when the naïve Bayes model is filled with imbalanced categorical data. Given the fact that some features include more than a dozen of possible categories, for example, the player has almost every active player on Toronto Raptors roaster from last season, therefore, an alternative approach to represent how relatively important those features are is to display all the conditional probabilities. The features which vary a lot between two possible target variables are the ones of greater importance.

nb-feat — Conditional probabilities of all features of nb.m2 (best model) on testing data.
Open in a new tab for more details.

Visualization of NB prediction

To see how the model works in practice, two visualizations are plotted to see the actual shots made/missed and the predicted shots made/missed.

It turns out most of the deviations between the reality and the prediction are in midranges and 45-degree three-pointers.

Finally, a visualization is carried out to show the overall prediction accuracy of a fine-tuned naïve Bayes classifier:

nb-prediction-accuracy — nb.m2 prediction accuracy.

Interpretation

Recall previous attempts to predict if a shot is made with tree-based classifiers, the overall accuracy is barely over 50%. As mentioned last time, if a classifier, especially a binary classifier, only performs like flipping a coin, then it is really bad model.

Is it due to the nature of the data? The answer is, “it’s possible”from last time.

However, a very similar subset of data is used to capture the shot attempt pattern with a naïve Bayes classifier. The core idea is the following multinomial formula by Bayes theorem:

$p(\mathbf{x}|C_k)=\frac{(\sum^n_{i=1}x_i)!}{\prod^n_{i=1}x_i!}\prod^n_{i=1}p_{ki}^{x_i}$

where x’s are possible values of all the predictors.

Then the class C is selected by the argmax of such a product. It’s more or less like a voting procedure, the class results in a higher probability will be the predicted class of this record.

Intuitively, given a set of predictors, the classifier calculates the separate probabilities that could happen under a class C_k. Then the posterior probability is the joint probability of these separate independent probabilities.

However, one should note that in reality, the assumption of independence among features/variables is very rare.

Still, the naïve Bayes classifier gives a decent prediction to start this “tale of two classifiers.” More to be discussed in the next part of this portfolio.

Text data with Python

Data and scripts

Data

Raw data

Cleaned data

Script: reddit-nb-svm.ipynb

Just like the record data, the text data in this section is also from the tree-based classification chapter.

The data is preprocessed by removing stopwords, conversion to lower case, and then calculated as Tf-idf. What is different from last time is that the texts are not stripped as bigrams but unigrams this time.

Additionally, the target variable upvote_ratio is replaced by popularity, a categorical variable indicating how popular the thread is within the time span of a year. To make the classes more balanced, the variable popularity is trivially set as either Extremely Popular (top 50) or Very Popular (top 51-100).

The tf-idf values are also standardized, although the matrix is still very sparse due to the nature of text data that not too many words are repeated in each document.

The general structure of the preprocessed data looks like this:

A wordcloud is also plotted like last time.

reddit-title-wordcloud2 — A wordcloud of text data.

Of course, the text data is also split into a pair of 80-20 training/testing sets.

Model fitting

Then the preprocessed data is fitted in a multinomial naïve Bayes model. The resulted confusion matrix is plotted below:

reddit-nb-cm — Confusion matrix of naïve Bayes classifier on text data.

reddit-nb-metrics — Metrics table of naïve Bayes classifier on text data.

The overall accuracy is 60%.

Furthermore, a naïve Bayes model in fact calculates the probabilties of each class. The following is a histogram displaying the probabilties of the test records to be both classes. Orange stands for Extremely Popular, while blue means Very Popular.

reddit-nb-prediction-hist — Prediction histogram of naïve Bayes classifier on text data.

Finally, an ROC curve is plotted for the “goodness” of such a model, which will be discussed in the later paragraph.

reddit-nb-roc — ROC of naïve Bayes classifier on text data.

Feature importance

The feature importance of the naïve Bayes model on text data, can be represented in the following numpy array. Since the DTM of text data used in this part is rather large and sparse, the feature importance has a hard to interpret length of "relative importance" accordingly. A further dive into the data will reveal that the one with the most importance is the word "deep", which could be related to some amazing long-distance three-pointer shootings in the game.

Besides, the feature standard deviations are also included:

reddit-nb-feat — Feature importance of naïve Bayes classifier on text data.

Interpretation

Without doubt, the naïve Bayes classifier with an accuracy of 60% is not ideal, especially for a testing data set of size 20. It’s just slightly better than flipping a coin, or randomly guess everything is of one class only. One could not even be sure if the 60% is just being “lucky”. It is almost conclusive to say that predicting the popularity barely based on the word choice of titles is impossible.

The ROC curve from the last plot also demonstrates the model's poor performance. In contrast, a decent model should close in the top left corner as much as possible.

Nevertheless, there are some highlights of such a model:

The training process is extremely fast. (Thanks to the small size of the data as well.)
The results are easy to interpret. A vector of probabilities will directly tell the reader what the model “thinks” the record belongs to.
The model almost does not need any tuning.

But we also need to be careful about the assumption that each variable should be independent of each other. This rarely happens in real life, including this case. Just imagine whether the occurrences of the following two words are independent:“Washington” and “Wizards”.

No models are perfect, some even fail its own assumptions.