Part 4: Exploring Data

Data is (not always) beautiful.

Rui Qiu

Updated on 2021-09-26.


There are so many aspects to cover when it comes to exploratory data analysis. Some might give the analyst an unseen new perspective, and some might need their prior knowledge before plotting.

Script: shots-viz.R

Although it is called "shots-viz", it contains all the coding used in this part.

Location-based shot attempts

Back to the very core of the model to build, it is vital to show shot attempts whether it is a success or failure.

coordinates
Shot attempts coordinates of all 30 teams.

To keep the plot visually pleasing, the area restricted by the coordinates is made close to a square. Some records are backcourt "yeets," which usually happen when a quarter runs out and the offensive team does not have enough time to organize an attack. In this case, the ball-controlling player will desperately throw the ball as hard as towards the rim to try his luck. Thus some of the data are discarded as a result.

Washington Wizards' shooting locations

As the data are split by teams, it is very handy to load and plot a team's performance in a season. Let's take Washington Wizards as an example.

was-shots
Shot attempts coordinates of Washington Wizards.

Time-based shot attempts

Another angle to look at the data as a whole is to check when those shots are taken. An NBA has four quarters, and each quarter has 12 minutes. If the score is a draw at the end of 4 quarters, another 5 minutes of overtime (OT) will be added to the game. The number of OTs can go on and on. Each time has no more than 24 seconds to run the attack as long as they still have possession of the ball. So, in general, the distribution should be uniform from the second 0 to 2880. But due to some tactical reasons, one team might hold the ball a little longer than or shorter (like run-and-gun), so that's where the teams differ.

shot times
Shot time distribution of all 30 teams.

Most frequent words in /r/nba

This is data from one day only. The plot reflects what the popular topics are about on that day.

reddit top words
Just data from one day.

With or without URLs

The label of players' tweets is whether a tweet contains a URL. The bar chart shows that it's almost a 50-50 situation. Plain and simple.

tweets urls
Basically a 50-50.

For the players' name and id, frankly speaking, there is nothing to visualize for now. It's just a reference.

Correlation of top features in a DTM

For the last plot, it's somewhat interesting to see the correlation between the frequent features of a document-term matrix. The more blue the dot is, the more positively correlated the two terms are, like "los" and "angeles."

correlation
Again, just data from one day.

In the end

To conclude, the plots drawn in this part of the portfolio are rather sketchy. Nonetheless, they still managed to keep some formality, as the plotting style. Are there any brilliant observations emerging from the data? Maybe not. Still, this is just a preliminary step before modeling.