Part 2: Data Gathering

The sources and the methods

Rui Qiu

Updated on 2021-09-16.


Check out the GitHub repository for the source code and the data.
Flawed Data
xkcd 2494

This part includes two sections, one listing all (potential) data sources and one detailing how I retrieved the data.

Data sources

Gathering methods

Play-by-play data from PBP Stats

Data

Thanks to Darryl Blackport for putting this up as one of the most complete and detailed NBA in-game data sets online. Although it is a paid subscription on Patreon, he deserves every penny of it. The dedication he puts into this project is unparalleled. Here is a snapshot of what the data set looks like.

PBP Stats
Each shot was hand tracked, attached with a video link.

After filtering a bunch of conditions on his website, the data can be directly exported in CSV format.

PBP Stats Raw
PBP raw data.

Twitter data

Text data from players' tweets

Data

Python script

Twitter Dev API and Python’s tweepy library are used to scrape the tweets that appeared in @NBA’s player list.

The raw data looks like this:

Player Tweets
Tweets raw data.

Tweet id is used to mark the time range to scrape the tweets backward manually. However, the Twitter API does not allow search by time. The id-trick is very inefficient. What’s worse is that the limit of each request is only 500 tweets, which makes it even harder to scrape backward from a list Twitter list.

We might take a detour to get a shortlist of players first, then scrape their tweets individually. The data at this moment is not enough for the following text mining.

Player accounts data

Scraped data, data from BR

Python script

The player accounts data is acquired in the same manner.

Account Data
Player account data.

Additionally, Basketball Reference’s list is included as a compliment.

Account Data BR
Basketball Reference's player account data.

It might be a good idea to include some metadata such as follower counts, account created time, etc.

Reddit data

RDS data, JSON data

R script 1, R script 2 (GitHub Action)

For Reddit content scraping, its API is more straightforward, but the documentation is poorly structured. In fact, only a certain amount of data can be acquired. Specifically, there’s no way to retrieve all the comments from a thread, and there is no way to work around this. If it is really necessary, it is strongly suggested to also consider using data dumps like PushShift.

In order to show that utilizing Reddit API with R is viable, some irrelevant data is still scraped in this way. The script contains the metadata of user rexarski.

Metadata of Reddit User rexarski
Metadata of Reddit user rexarski.

Then reconsider what type of data should be prioritized in the data gathering process. For this one, it should be the metadata of posts. It's not bad to take another shot at Reddit's open API, the one accessible even without a dev app registration. For instance, https://api.reddit.com/r/nba/top/?t=day, should return a JSON of of 25 posts on the front page of subreddit r/nba.

Direct API Access to Reddit
Direct API Access to Reddit.

Naturally, the data is saved as a raw JSON file, which is accessible as test-run.json in the repo.

But this approach is not satisfying enough. It should be more painless. After some community post browsing, an R package RedditExtractoR emerges. With its power, it becomes much easier to save the top 10 posts in the r/nba subreddit in an R list. Since an R list is a single R variable, it’s perfect to store it in an .RDS file. One iteration of the script above will keep a daily copy of such data. The general structure of that list is shown below:

Reddit Thread Data Structure
Reddit thread data structure.

Moreover, it is also a good practice to keep such time-sensitive data collection routine as automated as possible. Therefore the a GitHub Action is deployed to fulfill the job. Basically, a scheduled cron job inside the repository is set up to run a specified script in a specified environment. And thanks to Git Auto Commit action developed by Stefan Zweifel, the cron job is enabled to commit and push the scraped data to the repo automatically. To summarize, the script runs once a day (at say 21:02 UTC), gets the data, saves the data.

GitHub Actions of the Reddit Scraper
GitHub Actions of the Reddit scraper.

The Action setting file is located here.

We can even checking the workflow status by looking at an SVG badge that GitHub provides.

FiveThirtyEight data

Starting from here, the current location of the data will be listed, and a brief introduction of what they are about will follow. All data are copied from their original repositories directly.

RAPTOR

Data

RAPTOR stands for Robust Algorithm (using) Player Tracking (and) On/Off Ratings. It is FiveThirtyEight’s original NBA statistic.1

DRAYMOND

Data

DRAYMOND stands for Defensive Rating Accounting for Yielding Minimal Openess by Nearest Defender.2 (538 is really good at coming up with an accronym.)

More

The Pudding data

It is copied from the original repositories as well.

last-two-minute-report

Data

Data of NBA’s last two minute officiating reports.3

hype

Data and script

Data about the careers of 1,873 players.4

NBA player names with hyphens

Data

The player names data 5 is provided in a JSON, downloaded by wget.

wget JSON
Use bash command to retrieve data.

three-seconds

Data

Data of every defensive three seconds call in the NBA between 2015-2018 (including playoffs).6

Supporting data

There will be lots of so-called “labor of love” in collecting and tidying those supporting data for sure in the future. Nevertheless, it is still a lot of fun.


1. Introducing RAPTOR, Our New Metric For The Modern NBA. (fivethirtyeight.com)

2. A Better Way To Evaluate NBA Defense. (fivethirtyeight.com)

3. NBA Last Two Minute Report. (pudding.cool)

4. How many high school stars make it to the NBA? (pudding.cool)

5. Spell Jam. (pudding.cool)

6. The NBA Has a Defensive Three Seconds Problem. (pudding.cool)