The sources and the methods
Rui Qiu
Updated on 2021-09-16.
Check out the GitHub repository for the source code and the data.

This part includes two sections, one listing all (potential) data sources and one detailing how I retrieved the data.
Data sources
- Play-by-play data from PBP Stats Tracking, which merges hand-tracked data and tracking stats from NBA Stats.
- Tweets (text data) from @NBA's official NBA Players list.
- That list of NBA players itself is also a solid and reliable data source.
- Additionally, Basketball Reference also provides a list of NBA players' Twitter handles.
- Reddit posts from /r/nba with Reddit API.
- FiveThirtyEight’s NBA open data, including
- The Pudding’s NBA open data, including
- Search term data from Google Trends.
- Supporting data (partially included at this moment, will be updated when needed) from
- Basketball Reference;
- BasketBall-GM-Rosters and potentially some simulation data generated from Basketball GM.
- NBA.com;
- NBA Team Abbreviations from Wikipedia;
- NBA Team Color Codes from Team Color Codes;
- 21-22 season’s team schedules for computing the travel distance of each team.
- More to come!
Gathering methods
Play-by-play data from PBP Stats
Data
Thanks to Darryl Blackport for putting this up as one of the most complete and detailed NBA in-game data sets online. Although it is a paid subscription on Patreon, he deserves every penny of it. The dedication he puts into this project is unparalleled. Here is a snapshot of what the data set looks like.

After filtering a bunch of conditions on his website, the data can be directly exported in CSV format.

Twitter data
Text data from players' tweets
Twitter Dev API and Python’s tweepy library are used to scrape the tweets that appeared in @NBA’s player list.
The raw data looks like this:

Tweet id is used to mark the time range to scrape the tweets backward manually. However, the Twitter API does not allow search by time. The id-trick is very inefficient. What’s worse is that the limit of each request is only 500 tweets, which makes it even harder to scrape backward from a list Twitter list.
We might take a detour to get a shortlist of players first, then scrape their tweets individually. The data at this moment is not enough for the following text mining.
Player accounts data
The player accounts data is acquired in the same manner.

Additionally, Basketball Reference’s list is included as a compliment.

It might be a good idea to include some metadata such as follower counts, account created time, etc.
Reddit data
For Reddit content scraping, its API is more straightforward, but the documentation is poorly structured. In fact, only a certain amount of data can be acquired. Specifically, there’s no way to retrieve all the comments from a thread, and there is no way to work around this. If it is really necessary, it is strongly suggested to also consider using data dumps like PushShift.
In order to show that utilizing Reddit API with R is viable, some irrelevant data is still scraped in this way. The script contains the metadata of user rexarski.

Then reconsider what type of data should be prioritized in the data gathering process.
For this one, it should be the metadata of posts. It's not bad to take another shot at Reddit's
open
API, the one accessible even without a dev app registration. For instance,
https://api.reddit.com/r/nba/top/?t=day
, should return a JSON of of 25 posts on
the front page of subreddit r/nba.

Naturally, the data is saved as a raw JSON file, which is accessible as test-run.json
in the repo.
But this approach is not satisfying enough. It should be more painless. After some community post browsing, an R package RedditExtractoR emerges. With its power, it becomes much easier to save the top 10 posts in the r/nba subreddit in an R list. Since an R list is a single R variable, it’s perfect to store it in an .RDS file. One iteration of the script above will keep a daily copy of such data. The general structure of that list is shown below:

Moreover, it is also a good practice to keep such time-sensitive data collection routine as automated as possible. Therefore the a GitHub Action is deployed to fulfill the job. Basically, a scheduled cron job inside the repository is set up to run a specified script in a specified environment. And thanks to Git Auto Commit action developed by Stefan Zweifel, the cron job is enabled to commit and push the scraped data to the repo automatically. To summarize, the script runs once a day (at say 21:02 UTC), gets the data, saves the data.

The Action setting file is located here.
We can even checking the workflow status by looking at an SVG badge that GitHub provides.
FiveThirtyEight data
Starting from here, the current location of the data will be listed, and a brief introduction of what they are about will follow. All data are copied from their original repositories directly.
RAPTOR
RAPTOR stands for Robust Algorithm (using) Player Tracking (and) On/Off Ratings. It is FiveThirtyEight’s original NBA statistic.1
DRAYMOND
DRAYMOND stands for Defensive Rating Accounting for Yielding Minimal Openess by Nearest Defender.2 (538 is really good at coming up with an accronym.)
More
The Pudding data
It is copied from the original repositories as well.
last-two-minute-report
Data of NBA’s last two minute officiating reports.3
hype
Data about the careers of 1,873 players.4
NBA player names with hyphens
The player names data 5 is
provided in a JSON, downloaded by wget
.

three-seconds
Data of every defensive three seconds call in the NBA between 2015-2018 (including playoffs).6
Google Trends search term data
Data.
The data is extracted from the page of Google Trends. After selecting some terms of interest, we click the download button on the top right corner and extract CSV data. But note that, the numbers in the data show relative interest on the Internet. They are not absolute values.

Supporting data
- Basketball-GM’s 2021-22 NBA roster.
- Team information including names, short codes, and colors.

There will be lots of so-called “labor of love” in collecting and tidying those supporting data for sure in the future. Nevertheless, it is still a lot of fun.
1. Introducing RAPTOR, Our New Metric For The Modern NBA. (fivethirtyeight.com)↩
2. A Better Way To Evaluate NBA Defense. (fivethirtyeight.com)↩
3. NBA Last Two Minute Report. (pudding.cool)↩
4. How many high school stars make it to the NBA? (pudding.cool)↩
6. The NBA Has a Defensive Three Seconds Problem. (pudding.cool)↩