Part 7: Decision Tree

Tune and interpret decision trees

Rui Qiu

Updated on 2021-11-05.

In this part of the portfolio, the goal is to build a few basic decision tree models based on various selection criteria to make two rough but compelling predictions. The first one is the upvote ratio of a thread in Reddit’s subreddit r/nba. Each user-created thread’s title is entirely text-based, which is an excellent indicator of its content. Other Reddit users are capable of either upvoting or downvoting the threads based on their personal preferences. The second prediction is about if a player could make the shot in a particular on-court scenario.

Data

nba-top-100-threads-this-year, json, csv, tfidf dtm

pbpstats-tracking-shots-cleaned, csv

stephen-curry-3, csv

Script

reddit-dt.ipynb

reddit-scraper-v2.R

shot-dt.R (including the random forest model tuning.)

Reddit thread upvote ratio prediction

Data

Objectively, it will be a better model if the response stays numeric since the raw data has an upvote percentage. A regression model would suffice in predicting its value. However, as required, the ratio is factorized to a set of customized strata.

The modifed scraper extracts the top 100 most upvoted Reddit threads in the past twelve months. The ratio of upvote/(upvote + downvote) is selected to be the dependent variable. It is a numeric value between 0 and 1 by nature, but manually categorized using some stratifying condition. To be speicifc:

highly top rated if upvote_ratio >= 0.95
very top rated if 0.90 <= upvote_ratio < 0.95
somewhat top rated if 0.80 <= upvote_ratio < 0.90
top rated if upvote_ratio < 0.80

The text preprocessing of thread titles includes stopwords removal, punctuation removal, and conversion to lowercase. Some meaningless, mainly the single letter first name, are also trimmed.

Then the text is tokenized with TfidfVectorizer() and transformed into a DocumentTerm matrix.

A glimpse of the data frame after cleaning is shown below:

Wordcloud

The preprocessing of text data creates bigram tokens. Therefore, the wordcloud is created in the same manner.

Clearly, the unigram tokens still dominate in frequency, while some bigrams are more interpretable. Due to the limited number of threads scraped using Reddit’s public API, there is nothing profound to conclude.

tree0: text representation

The most effortless decision tree visualization can be accomplished by applying sklearn.tree.export_text() on a DecisionTreeClassifier object. Since parameter tuning is not the major focus in this chapter, a set of parameters is selected for no particular reason. For this one, the classifier has max_depth=10, min_samples_leaf=2, using ‘gini’ index as the criterion, and the splitter is set to random. And of course, the model is trained on an 80-20 split of raw data.

The text representation of this tree 0 is:

reddit-tree0 — Text representation of tree0.
Open in new tab to see in full size.

tree1: split by Gini index

A more vivid visualization of the same tree is produced by sklearn.tree.plot_tree(). There’s nothing much to tweak when drawing this one.

reddit-tree1 — Decision tree of tree1.
Open in new tab to see in full size.

This decision tree model has a tendency to grow “one-way.” What the model does is basically detect some keywords, then brutally slams the corresponding thread title into a category.

reddit-tree1-cm — Confusion matrix of tree1.
Open in new tab to see in full size.

The confusion matrix is generated based on the portion of test data. The in-sample accuracy is 13/20 = 0.65

Both text and visualization versions of a confusion matrix are produced in the notebook. The confusion matrix indicates a high concentration on the very top rated category. All the samples in this category are correctly labeled. Still, as mentioned, it also seems to have a tendency to misclassify other types as very top rated. An educated guess would be that the training data is generally populated with very top rated samples, which eventually leads to a biased model.

tree2: split by entropy

As previously mentioned, for the next decision tree model, the focus is still not tuning parameters for a model with better performance. Nevertheless, while keeping the max_depth=10, the min_samples_leaf is set to be 1. The splitting condition is changed from Gini index to ‘entropy’, and the splitter is set to best. As a result, the decision tree now looks like this:

reddit-tree2 — Decision tree of tree2.
Open in new tab to see in full size.

The confusion matrix, again, indicates some lousy performance.

reddit-tree2-cm — Confusion matrix of tree2.
Open in new tab to see in full size.

The in-sample accuracy is 12/20 = 0.60

In fact, the only category that this decision tree’s predictions are correct about is the very top rated one. Given the circumstance that the training and testing data are stratified, it really brings out a below-average accuracy. A naïve substitute for this would be, get the counts of each category in the training data and blindly guess all the testing data to be in that category.

tree3: redo the decision tree with unigram tokens

At this point, one might ask: what about tokenizing the text data as unigrams instead of bigrams? Maybe decreasing the specificity in the tokens would expose some more hidden patterns within words. And at the same time, since the model is not competing with others. It is tested for prediction accuracy on both testing data and the whole set.

reddit-tree3 — Decision tree of tree3.
Open in new tab to see in full size.

It looks a little bit different from the previous bigram + entropy’s tree. The unigram model generally seems to have a smaller entropy when reaching the max depth of nodes. But the advantage is not that obvious.

reddit-tree3-cm — Confusion matrix of tree3.
Open in new tab to see in full size.

The in-sample accuracy on test data is 13/20 = 0.65 again.

The confusion matrix on testing data finally has some improvement on correctly labeling some classes other than very top rated.

reddit-tree3-cm-complete — Confusion matrix of tree3 on complete data.
Open in new tab to see in full size.

The in-sample accuracy on complete data set is (5+62+11+5)/100=0.83

Finally, the confusion matrix on the complete data set shows some seemingly impressive accuracy in top rated. Besides, even though the model is only 50% correct in highly top rated and somewhat top rated, it is at least misclassifying them to a similar class.

Comparison of 3 DT on text data

The three decision trees above do share some similarities. For example, if a document (the title of a Reddit post) contains a TF-IDF of "Barkley" or "Charles Barkley" greater than roughly 0.2, it has a higher chance to be classified as a top-quality post. This might hints that the quote of Charles Barkley in a post title is an indicator of popularity.

As for the the effectiveness of these three trees, none of them actually clearly classify all the data. Even at the deepest nodes, the impurity of theirs are still high. Tree2 and tree3 share an entropy around 0.80, while the Gini index of tree1 is at 0.5.

The accuracy calculated by a confusion matrix tells the same story.

The accuracy of tree1 is 13/20 = 0.65
The accuracy of tree2 is 12/20 = 0.60
The accuracy of tree3 is 13/20 = 0.65
The accuracy of tree3 on complete data is 83/100 = 0.83

The sample size is quite limited, even the highest accuracy is not that ideal.

Thoughts

Honestly, all three decision trees above are not performing well. This might be caused by the size of training text, especially for this type of partitioning in supervised learning. The features acquired in training might not be as representative as expected. A large body of text data would definitely help.

A second thought: maybe the manual splitting of a numeric target is just too random. An imbalanced set of classes increases the classification difficulty.

Nevertheless, hopefully after some parameter tuning, the models can contribute more to predicting the post quality. Still, the first predictive model is labeled as a classification task. An alternative regression model should suffice with the raw data. Anyways, it is a test run by the end of the day.

Stephen Curry's three-point shooting prediction

On to the next part, it investigates some on-court offensive situations that a player could face. This is different from the study on shot selection last time. Hopefully, after some parameter tweaking, a decision tree could accurately predict if a shot is going in.

Data

This model utilizes the play-by-play shot attempts data. A slice of Stephen Curry’s 3-point shots is the object of interest for the decision tree training.

curry-data — A glimpse of Curry's 3pt data.
Open in new tab to see in full size.

Spec tuning

The data is divided into 5 folds, and different combinations of parameters are tested for the best performance. The process closely follows the R package {tidymodels}'s documentation.

A 4-level gird search is applied for testing the optimal combo of parameters.

spec-tuning — Spec of parameters.
Open in new tab to see in full size.

The model training process takes over than one minute. Two metrics for classification are considered: accuracy and roc_auc.

model-metrics — Model metrics after tuning the specs.
Open in new tab to see in full size.

The model selection step can be visualized as below:

model-selection-1 — Model selection by visualization.
Open in new tab to see in full size.

And this can be automated by commands too.

model-selection-2 — Model selection by commands.
Open in new tab to see in full size.

In the following sections, the metrics accuracy and roc_auc are selected respectively as the tree picking criteria.

tree1: selected by accuracy

Warning: I’ve been trying really hard to elegantly visualize a rpart object with the available R packages. Unfortunately, there are no good ones compatible with {ggplot2} yet, so I cannot really tweak the plot in my familiar syntax. Plus, the trees generated here are really complex; the final output seems to be very messy. Therefore, as a result, the resolution is not optimal.

Note that since the two metrics overlap in selecting the best model available, we decide to keep the optimal one for roc_auc and choose the fourth-best by accuracy.

The first tree is picked by selecting the fourth-best model ranking by model’s accuracy on training data.

curry-tree1 — Decision tree of tree1.
Open in new tab to see in full size.

Its feature importance is ranked like this:

curry-tree1-fi — Feature importance of tree1.
Open in new tab to see in full size.

It seems that the variable y (the coordinate that indicates the vertical distance), margin (leading or trailing in scores), and x (the coordinate that means the horizontal distance) are the top three variables that determine if Stephen’s shot scores or not.

curry-tree1-cm — Confusion matrix of tree1.
Open in new tab to see in full size.

And finally, a test run is carried out on testing data, and a corresponding confusion matrix is generated.

The in-sample accuracy is (24+26)/(24+26+22+16)=0.57

tree2: selected by roc_auc

Similarly, for the second tree, the metric used here is roc_auc.

curry-tree2 — Decision tree of tree2.
Open in new tab to see in full size.

The feature importance constitution remains unchanged.

curry-tree2-fi — Feature importance of tree2.
Open in new tab to see in full size.

The order stays the same, but the relative importance. Coincidentally, both decision trees have very similar performance on testing data.

curry-tree2-cm — Confusion matrix of tree2.
Open in new tab to see in full size.

The in-sample accuracy is (27+23)/(27+23+19+19)=0.57

Although the true positive, true negative allocations are different from the previous decision tree, the final in-sample accuracy is the same. Probably the accuracy won't stay the same for a larger sample.

tree3: hack

The same hack is applied to the complete data set for the third tree, which means the decision tree is trained with all the available data. In reality, this is risky due to overfitting. Nonetheless, this is a test run. The third tree looks like this:

curry-tree3 — Decision tree of tree3.
Open in new tab to see in full size.

This time the feature importance varies a lot. Both the shot_type and shot_clock are becoming the more critical speakers when making decisions.

curry-tree3-fi — Feature importance of tree3.
Open in new tab to see in full size.

The performance, as expected, is better than the previous two. Apparently, this is the result of overfitting. Such a model could be extremely good-looking if it “remembers” everything it learned. That’s technically not a “prediction,” though.

curry-tree3-cm — Confusion matrix of tree3.
Open in new tab to see in full size.

The in-sample accuracy is (28+36)/(28+36+6+18)=0.73

Comparison of 3 DT on mixed data

A quick and dirty radar plot would be perfect for feature importance comparison among the three decision tree models.

Tree1 and tree2 both share a very similar feature composition, as tree3 is the odd one.

The accuracy calculated by a confusion matrix tells the same story.

The accuracy of tree1 is (24+26)/(24+16+22+26) = 0.568
The accuracy of tree2 is (27+23)/(27+23+19+19) = 0.568
The accuracy of tree3 is (28+36)/(28+36+6+18) = 0.727

The tree3 has the best accuracy on paper. However, this is due to the inclusion of all data available. And it's strange to see even the two models of tree1 and tree2 have different parameters, they still look alike in many ways such as the tree structure, feature importance and final accuracy. They are definitely not identical, as the confusion matrices also suggest. Maybe add more levels in the grid search would result in more combinations of model parameters so that a better model would finally emerge.

Summary (of two predictions)

Frankly speaking, decision trees are useful and interpretable supervised learning for some simple tasks, both classification and regression. However, nothing is truly omnipotent. Impaired performance on those two scenarios above definitely exposes the drawbacks of decision trees.

Decision trees are sensitive to small perturbations. The trees are interpretable but could vary dramatically. Furthermore, trees are overfitting-prone. In addition, they don’t handle out-of-sample predictions well.

For instance, if a new category of Reddit threads upvote ratio is introduced, or a new scenario of shot attempt is brought up, any decision tree model built above will just randomly guess the result. Something new to them is just something new to us. When we face the unknown, we take a guess.

But luckily, as the big brother of decision trees, the random forest might just be the easily reachable solution. And it will be discussed in another part of the portfolio.

So, back to the topic, some other thoughts might not be closely affiliated with this part but are worth discussing:

A predictive model for guessing if a shot can be made has an accuracy slightly higher than 50%. Is it good?
Does that mean another model based on tossing a coin is basically the same as it?
Maybe taking a player’s field goal percentage is a good idea when building a specific decision tree for him.
As for the prediction of upvotes of a Reddit post, some non-textual data might be a excellent addition to the model, like the general popularity of that subreddit, the media type of that thread, the posting time, etc.