Part 7 Extra: Random Forest

The underperformance of a random forest

Rui Qiu

Updated on 2021-11-05.

Data

pbpstats-tracking-shots-cleaned, csv

stephen-curry-3, csv

Script

shot-dt.R (including the random forest model tuning.)

In this part, the data is inherited from Part 7 Decision Tree, where Stephen Curry’s three-pointer shooting data is used to build a predictive model. The goal is to rebuild the model with a random forest, expecting a better overall model performance.

Spec tuning

Just like the tuning process for decision tree models, a grid search is applied to a group of combinations of parameters.

rf-01 — Open in new tab to see in full size.

Specifically, the parametermtry() is set from 3 to 6 because there are only 10 variables in the data set. What people usually expect from mtry() values is that they should be somewhere in the middle between 1 and 10, avoiding values close to the ends.

Afterwards, the show_best("roc_auc") will list the top 5 best models ranking by their roc_auc.

Of course, the models are trained on training data only.

rf-02 — Open in new tab to see in full size.

Evaluation

The random forest (workflow) with the best roc_auc is chosen and fit with training data. It generates predictions on the testing data.

last_fit() will fit our wf to the training data and genearte predictions on the testing data.

rf-03 — Open in new tab to see in full size.

The confusion matrix, however, shows a concerning result again. The random forest barely improves the model’s performance. It’s almost a 50-50 situation once more.

	Truth
Prediction	FALSE	TRUE
FALSE	24	21
TRUE	22	21

The in-sample accuracy of this random forest model is (24+21)/(24+21+21+22)=0.51

It's quite shocking that the accuracy is actually lower than previous three decision trees. ;)

Visualization

To visualize a random forest is always a challenge, especially with R. To start, there’s no built-in function to plot a {ranger} tree or a {randomForest} tree. However, the random forest is made up of hundreds of decision trees. Plotting each of them is definitely not a sensible option. On the other hand, plotting a few sample trees does not actually provide any insights.

Still, plotting the feature importance of a random forest is reasonable.

rf-fi — Feature importance of the random forest.
Open in new tab to see in full size.

Another workaround is to use {rfviz} to generate interactive plots.

Thoughts

In the end, we still end up with an underperforming random forest model. It feels like it goes back around and has achieved nothing. Stephen Curry’s career 3-pointers percentage is 0.432. In other words, technically, if a model constantly guesses the response to be FALSE, it will attain an overall accuracy of 1-0.432=0.568, which is close to the optimal decision tree from part 7.

A random forest is suitable for most cases where a large dataset is available, and interpretability is not a major concern.

In this post from StackExchange, Sycorax lists three cases where random forests could underperform. The following two need to be emphasized.

Sparsity. This is not the issue in the random forest above. Still, it is a concern for the Reddit upvote ratio prediction.
Random forests basically only work on tabular data. The variables x and y are the coordinates on the court, which somehow are not well-approximated by rectangular partitions.

Finally, maybe a logistic regression is a better fit in this three-pointer shooting prediction.