Part 6: ARM & Networks

Typical shot selection: From Zion’s floater to Dame’s logo shot

Rui Qiu

Updated on 2021-10-19.

It’s a pity that the shot attempt data from chapter 3 hasn’t been utilized since then. Luckily, this won’t be the case anymore. This chapter aims to perform associate rule mining (ARM) on a portion of NBA shot data. An explanation will be given later on why only partial data is used here.

The ARM will identify a list of controlled frequent itemsets, thus leading to a list of controlled frequent rules. They are controlled because only rules contain the information of if this specific shot is made matter. Otherwise, rules such as {shottype=Arc3=>value=3} will be valid for sure but not informative.

Although the shot data won’t provide any highlights of a player’s signature move, it will hopefully give some insights into his favorite shot selection. To some extent, this also draws the boundary of a player’s hot zone.

Setup

Data: pbpstats-tracking-shots-cleaned.csv

Script: arm-shots.R

Intermediate transaction rules data:

shot-rules.RDS

06-arm-shot-rules.csv

Intermediate frequent itemsets and rules:

06-arm-frequent-items.csv

06-arm-top-supp-rules.csv

06-arm-top-conf-rules.csv

06-arm-top-lift-rules.csv

Output JSON: network-data.json

The script above fulfills all the tasks in this chapter, including both ARM and network visualization.

The packages used in the script are:

{tidyverse} for standard data analytics pipelines
{arules} for associate rule mining
{arulesViz}, {networkD3} and {visNetwork} for network visualization

Data cleaning

The data cleaning procedure starts by loading a previously cleaned, tidy data set. Since the ARM has a specific requirement for categorical variables as the input, variables will be factorized accordingly.

In short, we are doing the following three types of data mutation:

Transform numeric data to factors
Transform logical data to factors with details
Transform NA to data as well.

Unsurprisingly, since the data preserved lots of NAs in some variables for the integrity of the data, it is relatively reasonable to discard them in this task. Are they providing some information? Yes. But are they providing enough information to be included in the rule? Not really. A trial run with NAs gives out many rules with uninterpretable items. Therefore, the following variables are considered to be the participants of rules:

player: the name of the shot taker.
period: the time period of a game that the shot takes place.
shottype: the specific type of the shot, e.g., Arc3, Corner3, etc.
value: either 2 pts or 3 pts.
and1: whether the shot is fouled, and a free throw is given.
made: whether the shot is made and corresponding pts are rewarded.

Associate rule mining

After that, the data frame will again be processed by the generic R function as(), which coerces an object to another class, to transaction data. It can be examined by inspect().

The transactions data is a single R object, therefore stored as an RDS file in the repository.

In addition, another copy of CSV is stored in the repository. And it can be viewed as an embedded Airtable spreadsheet. Due to this web design’s effective width limitation, screenshots won’t be an appropriate option to display the transaction dataset.

Apriori algorithm is an easy-to-implement and interpretable algorithm frequently used in ARM. It prunes all the combinations of itemsets by setting a few parameters.

The parameters used for the generation of these rules are:

support = 0.0001
confidence = 0.5
minlen = 6
And the RHS of the rule must be either made=is_made or made=not_made

These parameters generate hundreds of meaningful rules, which are listed in that embedded spreadsheet.

As mentioned above, although the complete set of rules should be more than what is presented here, many of them are meaningless, considering the goal of analyzing this dataset is to outline players’ typical shot selection. Therefore, the parameters are set to be:

The frequent itemsets and top rules listed below are stored as intermediate CSV files, also accessible in the repository. At the same, they are uploaded to an Airtable spreadsheet as well. You can observe different sets of data by switching the tabs.

Note: you can click on the burger menu button on top-left corner to switch tables.

Top itemsets

First, let’s take a look at some frequent items in all transactions.

Apparently, many star players’ signature shots are successfully retrieved. For instance:

Stephen Curry’s tons of 3-pointer from the arc, both made and missed.

Damian Lillard’s iconic logo shot.

Zion Williamson’s unguardable attack-the-rim (and possibly with and-1)

Top rules

Then, the rules are sorted by three metrics separately. For each metric, the top 15 results will be listed. You can switch the tabs of the spreadsheet above to view the datasets.

Rank by support. This generally copies the results of the most frequent itemsets since support indicates how frequently the items appear in the dataset.

Rank by confidence. This shows how often the rule has been found to be true. Compared to the top 15 rules by support, there are some differences.

Rank by lift. The lift greater than 1 indicates that the two events (LHS and RHS) are dependent on one another. It shows that some players’ shot selections are frequent, but they are their major attacking approaches. And again, more signature shot selections are identified. All of the 15 typical shot selection combinations are 2-pointers. The top three are Zion and Giannis’ shots with and-1. How unstoppable!

More rules

Before jumping into the network visualization, let’s take a look at the shot rules dataset again. Suppose the LHS is filtered to be a particular player. In that case, a list of rules involving this player can be returned easily.

Again, the length of the output is controlled by the parameter. For example, by changing the minlen to 2:

Conclusion

To conclude, ARM is a basic rule-based machine learning method used to identify interesting relations between variables in a rather large dataset. The threshold used to define what a “strong” rule is, however, can be tuned by manual selection. Based on this method, a list of meaningful rules about NBA players and their shot selections are generated.

Network Viz

The network visualization is a perfect choice to demonstrate the interconnection among different rules and rules’ items. Too many rules can, of course, be visualized within one html widget, but it won’t be easy to be interact with. For the following three visualization packages, each only visualizes a partition of the rules data for the apparent aesthetic reason.

{arulesViz}

Two versions of networks are generated by the {arulesViz} package. They are essentially the same.

network-igraph — Rules network generated by
{arulesViz}
.
Open in new tab to see in full size.

Click HERE to open in a new tab.

Although the rules are ranked by support, the visual cues focus on the other two metrics. The associate rules are marked as red circles in the network. The darker the circle, the higher confidence it indicates. The larger, the higher lift it has. The ones in the corner, for example, are the rules of Zion Williamson and Giannis Antetokounmpos’ shot at the rim. There are also some nodes with high degrees, which means they are pretty common in many rules. The nodes with made=is_made and made=not_made are automatically two with high traffic since any rule connects with either of them. Some other nodes, which are the variables used to describe a shot, are also connected by a large number of edges.

Data preparation

With the rule dataset and some basic knowledge of graph theory, we can transform the dataset into something more D3-friendly. Roughly speaking, the new datasets should contain nodes and edges, including some miscellaneous aspects like the group and the color of those elements.

Unlike the sub-dataset used by {arulesViz}, the one visualized by the following two packages uses the complete ruleset. Then we extract two subsets from the data,

one containing top 50 made shots scenario in the 4th quarter ranked by counts,
one containing only Zion Williamson and Giannis Antetokounmpos’ shot attempts.

The detailed data pipeline scripts are included in the R script under “Network Viz - data preparation.”

{networkD3}

Since {networkD3} utilizes D3.js, the index of records in nodes and edges needs to subtract 1.

Click HERE to open in a new tab.

The network visualized above shows some common shot selections in the 4th quarter of an NBA game, a.k.a, “the clutch time.” The orange nodes are the associate rules, while the blue ones are shot variables (including the player). The most common two variables are automatically made=is_made and period=4, since all rules are apparently about these two. Then there are shottype=AtRim and shottype=ShortMidRange. This might be explained by many tactics built on “safe plays.” Another cause might be filtering by counts, such that many risky but successful 3-pointer made are excluded due to small quantity.

To give some specific examples, the network features more plays like this:

And fewer like this:

{VisNetwork}

Click HERE to open in a new tab.

Even if they are not Giannis’ top choices, Arc3,ShortMidrange, and LongMidrange are in his arsenal. On the contrary, Zion is insanely efficient with AtRim (majorly because of his floaters and dunks).

The two clips below are examples of their typical shot attempts.

Zion's dunk:

Giannis' midrange, though relatively rare in previous seasons:

Sankey diagram

Last but not least, it just takes a one-liner to generate a Sankey diagram of a subset of rules by networkD3::sankeyNetwork().

Click HERE to open in a new tab.

Although the diagram presents the same dataset as {VisNetwork} did, it actually unwraps in a more tidy manner. One can clearly tell the quantitative difference in each node by grouping the same variable with the same color.

Summary

In summary, the data scraped and cleaned from the last few weeks are finally put into action. By using ARM, one can quickly unearth the typical shot selection patterns among NBA players. With some proper data manipulation, even more, information is revealed with the assistance of network visualization.

For the development of modern basketball, splitting the data season by season will clearly highlight the changes in shot selections of a player, a team, or the whole league, which is an indicator of tactical changes. More importantly, ARM and networks can be utilized when dealing with tons of data. Imagine a scenario that an analyst needs to study the shot preferences of an opponent team. Therefore, they can extract the frequent rules in a flash instead of watching hundreds of hours of videotapes.

Admittedly, the dataset used in this chapter is barely a fraction of the entire shot scenario, as lots of other descriptive variables are removed for various reasons. If more data is provided, it’s definitely possible to build a more captivating list of basketball “rules” so that both analysts and ordinary audiences can read the game from another angle. Furthermore, a predictive model based on associate rule mining is practical to predict how threatening a shot could be. Thus, it could serve as one piece of the xT model we initially aim to build from the beginning.