Part 2: Data Gathering

The sources and the methods

Rui Qiu

Updated on 2021-09-16.

Check out the GitHub repository for the source code and the data.

This part includes two sections, one listing all (potential) data sources and one detailing how I retrieved the data.

Data sources

Play-by-play data from PBP Stats Tracking, which merges hand-tracked data and tracking stats from NBA Stats.
Tweets (text data) from @NBA's official NBA Players list.

That list of NBA players itself is also a solid and reliable data source.
Additionally, Basketball Reference also provides a list of NBA players' Twitter handles.

Reddit posts from /r/nba with Reddit API.
FiveThirtyEight’s NBA open data, including

The Pudding’s NBA open data, including

Search term data from Google Trends.
Supporting data (partially included at this moment, will be updated when needed) from

Basketball Reference;
BasketBall-GM-Rosters and potentially some simulation data generated from Basketball GM.
NBA.com;
NBA Team Abbreviations from Wikipedia;
NBA Team Color Codes from Team Color Codes;
21-22 season’s team schedules for computing the travel distance of each team.
More to come!

Gathering methods

Play-by-play data from PBP Stats

Data

Thanks to Darryl Blackport for putting this up as one of the most complete and detailed NBA in-game data sets online. Although it is a paid subscription on Patreon, he deserves every penny of it. The dedication he puts into this project is unparalleled. Here is a snapshot of what the data set looks like.

PBP Stats — Each shot was hand tracked, attached with a video link.

After filtering a bunch of conditions on his website, the data can be directly exported in CSV format.

Twitter data

Text data from players' tweets

Data

Python script

Twitter Dev API and Python’s tweepy library are used to scrape the tweets that appeared in @NBA’s player list.

The raw data looks like this:

Tweet id is used to mark the time range to scrape the tweets backward manually. However, the Twitter API does not allow search by time. The id-trick is very inefficient. What’s worse is that the limit of each request is only 500 tweets, which makes it even harder to scrape backward from a list Twitter list.

We might take a detour to get a shortlist of players first, then scrape their tweets individually. The data at this moment is not enough for the following text mining.

Player accounts data

Scraped data, data from BR

Python script

The player accounts data is acquired in the same manner.

Additionally, Basketball Reference’s list is included as a compliment.

Account Data BR — Basketball Reference's player account data.

It might be a good idea to include some metadata such as follower counts, account created time, etc.

Reddit data

RDS data, JSON data

R script 1, R script 2 (GitHub Action)

For Reddit content scraping, its API is more straightforward, but the documentation is poorly structured. In fact, only a certain amount of data can be acquired. Specifically, there’s no way to retrieve all the comments from a thread, and there is no way to work around this. If it is really necessary, it is strongly suggested to also consider using data dumps like PushShift.

In order to show that utilizing Reddit API with R is viable, some irrelevant data is still scraped in this way. The script contains the metadata of user rexarski.

Metadata of Reddit User rexarski — Metadata of Reddit user rexarski.

Then reconsider what type of data should be prioritized in the data gathering process. For this one, it should be the metadata of posts. It's not bad to take another shot at Reddit's open API, the one accessible even without a dev app registration. For instance, https://api.reddit.com/r/nba/top/?t=day, should return a JSON of of 25 posts on the front page of subreddit r/nba.

Naturally, the data is saved as a raw JSON file, which is accessible as test-run.json in the repo.

But this approach is not satisfying enough. It should be more painless. After some community post browsing, an R package RedditExtractoR emerges. With its power, it becomes much easier to save the top 10 posts in the r/nba subreddit in an R list. Since an R list is a single R variable, it’s perfect to store it in an .RDS file. One iteration of the script above will keep a daily copy of such data. The general structure of that list is shown below:

Reddit Thread Data Structure — Reddit thread data structure.

Moreover, it is also a good practice to keep such time-sensitive data collection routine as automated as possible. Therefore the a GitHub Action is deployed to fulfill the job. Basically, a scheduled cron job inside the repository is set up to run a specified script in a specified environment. And thanks to Git Auto Commit action developed by Stefan Zweifel, the cron job is enabled to commit and push the scraped data to the repo automatically. To summarize, the script runs once a day (at say 21:02 UTC), gets the data, saves the data.

GitHub Actions of the Reddit Scraper — GitHub Actions of the Reddit scraper.

The Action setting file is located here.

We can even checking the workflow status by looking at an SVG badge that GitHub provides.

FiveThirtyEight data

Starting from here, the current location of the data will be listed, and a brief introduction of what they are about will follow. All data are copied from their original repositories directly.

RAPTOR

Data

RAPTOR stands for Robust Algorithm (using) Player Tracking (and) On/Off Ratings. It is FiveThirtyEight’s original NBA statistic.¹

DRAYMOND

Data

DRAYMOND stands for Defensive Rating Accounting for Yielding Minimal Openess by Nearest Defender.² (538 is really good at coming up with an accronym.)

Elo

Historial Elo

Advanced metrics

Win probabilities

The Pudding data

It is copied from the original repositories as well.

last-two-minute-report

Data

Data of NBA’s last two minute officiating reports.³

hype

Data and script

Data about the careers of 1,873 players.⁴

NBA player names with hyphens

Data

The player names data ⁵ is provided in a JSON, downloaded by wget.

wget JSON — Use bash command to retrieve data.

three-seconds

Data

Data of every defensive three seconds call in the NBA between 2015-2018 (including playoffs).⁶

Google Trends search term data

Data.

The data is extracted from the page of Google Trends. After selecting some terms of interest, we click the download button on the top right corner and extract CSV data. But note that, the numbers in the data show relative interest on the Internet. They are not absolute values.

Google Trends Data — Google Trends search term interest data.

Supporting data

Basketball-GM’s 2021-22 NBA roster.
Team information including names, short codes, and colors.

NBA Team Info — Team short codes and colors.

There will be lots of so-called “labor of love” in collecting and tidying those supporting data for sure in the future. Nevertheless, it is still a lot of fun.

^{1. Introducing
RAPTOR, Our New Metric For The Modern NBA. (fivethirtyeight.com)↩}

^{2. A
Better
Way To
Evaluate NBA Defense. (fivethirtyeight.com)↩}

^{3. NBA Last Two Minute Report.
(pudding.cool)↩}

^{4. How many high school stars make it to the
NBA?
(pudding.cool)↩}

^{5. Spell Jam. (pudding.cool)↩}

^{6. The NBA Has a Defensive Three
Seconds
Problem.
(pudding.cool)↩}