Part 3: Data Cleaning

Chaos isn't a pit. Chaos is a ladder.

Rui Qiu

Updated on 2021-09-26.


via GIPHY

Without a doubt, cleaning is painful. But hopefully, it will provide some convenience later.

For this part of the portfolio, not only was the collected data cleaned, more data was introduced into the game. Compared with the collection two weeks ago, it expands in the following three ways.

  1. The hand-tracked play-by-play NBA data provided by PBP Stats only includes three-point shooting attempts, which only account for roughly one-sixth of the total shooting data from the 2020-2021 regular season. More data categorized by the team has been pushed to the repository.
  2. Some sports news articles are extracted with NewsAPI in order to fulfill the objective of cleaning text corpus with Python. There are no intermediate files saved. The final Document-term matrices are stored as text files instead.
  3. As time goes by, the data collection of the top 10 posts in Reddit’s /r/NBA subreddit is growing day by day.

 

Section 1: Record cleaning with R

Script: pbp-cleaner.R

Data (raw): data/nba-pbp/2020-raw-by-team

Data (cleaned): a complete list of shots, a list of player and id, group by team, group by opponent

As mentioned above, the previous data is only a partition of the data of interest. After retrieving more data from the same resource, a directory of all teams’ shot attempts from the 2020-2021 regular season are ready for cleaning.

All the CSV files are row-binded into one large tibble. A glimpse() would demonstrate the overall structure of such tibble.

glimpse(shots_dat)
glimpse of shots_dat

The shots_dat tibble has in total 188975 rows of shots recorded and 37 variables for each shot.

Remove redundant variables

Some of the variables are not that appealing, either because they don’t provide much information for building our expected score/threat model, or they are just ids of some other variables. They are based on the common fact that the current roaster of NBA players does not contain players sharing the same name. Therefore, removing the ids does not matter too much.

The following variables are discarded: gameid, eventnum, oreboundedshoteventnum, oreboundedrebeventnum, offense_team_id, defense_team_id, possession_start_type, possession_start_time, blockplayerid, assistplayerid.

Reorder remaining variables

Usually, relocate() fits the need here. However, select() would be a better choice since all variables might need a reconsideration before moving on to the next stage.

By doing so, the variables are grouped into eight subgroups below:

By far, there is a lethal defect in this collection of data compared with the previous one: there are no traces of assist or pass coordinate in this one. That is to say, to build a time-based or location-based Markov Chain of ball movement before scoring, we need to investigate something like a game log in detail to retrieve those pass locations. This is really disheartening.

Get a list of player-id for reference

This step is not urgent but probably necessary. A list of playerand playeridis exported into a CSV file. Even though some ids variables have been deleted, the lineupid and opplineupid are still string variables with playerid separated by hyphens (-).

players-id pair
Players and id pair.

Check legitimacy

Ideally, this step is where outliers and NAs are removed. There were lots of “trial and error” in the actual cleaning procedure. For the following variables, NAs are strictly prohibited, simply because NAs do not make sense in those: playerid, player, team, lineupid, opponent, opplineupid, date, period, time, margin, x, y, shottype, putback, value, made, and1.

For the rest of the variables, NAs might be just the suitable candidates for some cells. For instance, if a defender executed no block attempt, it should be NA instead of FALSE in blocked.

Additionally, possessionnum is deleted as well, since it is the sequential order number of the event in-game, rather than “how many passes before the shot.”

Then, a summarytools::descr() can be performed to get some descriptive statistics.

descriptive stats
Descriptive statistics of shots_dat.

The real deals are the legitimacy checking for variables. In other words, if things don’t add up, they will be examined or may be thrown away. What do they look like?

  1. Check if time (the remaining seconds of a quarter, called period in the data, though) is between 0 and 720.
  2. Check if a shot is made, but no one assisted (is.na(assisted)).
  3. Check if a shot is blocked, but no block player is recorded (is.na(block_player)).
  4. Check if a shot is made, but blocked simultaneously, which is extremely unlikely in real life.
  5. Check if a shot is assisted, but no assist player is recorded (is.na(assist_player)).

There should definitely be more common facts to check. Luckily, one entry error was spotted when checking the fifth condition.

The Philladelphia 76ers player Tobias Harris was assisted by Ben Simmons, but Simmons’ name was recorded as NA here. This was probably caused by a mixed-up team and opponent variables. Harris should be playing for PHI and against DET. It was fixed with some manipulations.

Create new features

As mentioned above, new features are created by splitting a string of ids connecting with each other.

Another glimpse() shows the shots_dat is now a 188975 by 34 tibble.

If the next step is analysis or model building, there are more standard procedures to do. For example, turning some character variables into factors. Nevertheless, saving the tidy data frame into another text file will counteract the effort.

Save cleaned data

The cleaned data is saved in one file as a whole and grouped by team and opponent, then split into multiple sub-files. The new data is not that unbelievably large (60Mb). Splitting into various files might be overkill.

 

Section 2: Text cleaning with R

Script: reddit-cleaner.R

Data: data/nba-reddit/ raw data stored in .RDS, cleaned data stored in .csv.

The text data scraped from the subreddit is stored in temporary RDS files, named when scraping is performed.

The daily text data varies in size and length:

reddit thread text
Text data from /r/nba

The logic here is clear:

The final look of a daily Reddit text data is like this:

tidy text
Bag-of-Words of one day's data.

cron job update

Script: cron-job.r

Since the Reddit scraping script runs once a day in the evening, it might be a good idea to update it with text processing commands, such that the tidy text data will be saved to remote as well. Besides, the line that saves a temporary list into .RDS file is removed.

cronjob
Updated cron job runs with no issue.

No issue is detected in the automation.

Section 3: Record cleaning with Python

Script: tweet-clean.ipynb

Data (raw): data/nba-tweets/player-accounts-br.csv, data/nba-tweets/player-accounts.csv

Data (cleaned): data/nba-tweets/player-accounts-cleaned.csv

The raw data of consists of two files:

The target output for this part is a list of name-handle pairs with no duplicates. The cleaning procedures are in the following order:

  name account
295 Deonte Burton DeonteBurton
618 Deonte Burton DeeBurton30
1065 Devin Booker DevinBooker31
1303 Devin Booker DevinBook
1406 Marcus Thornton M3Thornton
2184 Marcus Thornton OfficialMT23
485 Mike Conley mconley11
1683 Mike Conley MCONLEY10
1992 Tony Mitchell TonyMitchUNENO
1993 Tony Mitchell tmitch_5
  name account
1442 Justin James 1JustinJames
212 JJ 1JustinJames
2224 Andrew Wiggins 22wiggins
397 andrew wiggins 22wiggins
1594 Kent Bazemore 24Bazemore
... ... ...
1378 Willy Hernangómez willyhg94
490 Thad Young yungsmoove21
2247 Thaddeus Young yungsmoove21
38 Zhaire zhaire_smith
1408 Zhaire Smith zhaire_smith

 

Section 4: Text (CSV) cleaning with Python

Script: tweet-clean.ipynb

Data (raw): data/nba-tweets/player-tweets.csv, data/nba-tweets/player-tweets-2.csv

Data (cleaned): data/nba-tweets/tweets-corpus-cleaned.csv

Two CSV files, each containing around 500 tweets, are loaded.

The label picked here is the last column of CSV tweets data, urls which stores a list of dictionaries of URLs that appeared in a tweet. A funny thing about loading the CSV into a pandas data frame is that the “list” is loaded as a string type but starting with a [ and ending with a ]. The trick to determining if there is a URLis to check if the length of url is equal to 2.

CountVectorizer is then utilized to generate the Bag-of-Words from the data frame. Note that only a maximal number of 500 features are kept.

Then the Document-term matrix is generated, with a column of boolean variable contains_url attached to the first column.

dtm
A DTM of NBA players' tweets.

Afterward, it is saved to the destination directory.

Section 5: Text (corpus) cleaning with Python

Script: tweet-clean.ipynb

Data (raw): data/nba-news-source/news-2021-09-25.csv

Data (cleaned): data/nba-news-source/dtm-2021-09-25.csv

Strictly speaking, there is no text corpus suitable for text preprocessing based on the current data collection. A quick and dirty approach is to apply NewsAPI and grab some news articles from different sources.

Exposing API in the script is a bad idea.

The data contains news headlines from August 25 to September 25, 2021, with a range of sources of SB Nation, ESPN, Fox Sports, and Bleacher Report.

  Source Date Title Headline
0 ESPN 2021-09-02 What if Skills that could change the games of ... What if Luka was automatic from the stripe Wha...
1 ESPN 2021-09-02 NBA eyes strict rules for unvaccinated players Unvaccinated NBA players will have lockers as ...
2 ESPN 2021-09-21 Redick to retire after seasons in NBA JJ Redick who played for the Pelicans and Mave...
3 ESPN 2021-08-25 Lapchick NBA sets high bar for health policies... The NBA has played a leading role in men s pro...
4 ESPN 2021-09-24 Sources Ginobili returning to Spurs as advisor Four time NBA champion Manu Ginobili one of th...
... ... ... ... ...
95 Bleacher Report 2021-09-14 NBA Re Draft Does Jayson Tatum or Donovan Mitc... In the instant analysis culture of today s spo...
96 Bleacher Report 2021-09-15 Ben Simmons Rumors ers Expect PG to Play Next ... Despite a href https nba nbcsports com report ...
97 Bleacher Report 2021-09-14 NBA Trade Rumors John Wall Rockets Mutually Ag... John Wall and the Houston Rockets have reporte...
98 Bleacher Report 2021-08-25 Woj Mike Budenholzer Bucks Agree to New Year C... Mike Budenholzer will remain with the Milwauke...
99 Bleacher Report 2021-09-22 NBA Offseason Moves with the Most Bust Potenti... Every offseason transaction carries downside r...

Due to the API limit, only 100 articles can be retrieved at a time.

The features are extracted from the Headline and used to generate another sparse Document-term matrix. Naturally, the Source column is attached to the end as the label.

Conclusion

There is nothing too spectacular to talk about in data cleaning. The cleaning process itself is very messy and chaotic, to be honest. But it would definitely help later. It is just like a ladder.