Thursday, June 28, 2018

MiLB Statcast Project Part One: Data Collection


Statcast data has revolutionized the way analysts and teams treat baseball data - never before has any sport had such comprehensive and precise analytics available to them for nearly every play of every game. Launch angle and exit velocity have dynamically changed the way that we look at and evaluate hitters, and even if the data available publicly is the tip of the iceberg, it allows us powerful tools for evaluating and scouting players. We can quickly identify Statcast Superstars based on their exit velocity and LA, guys like Aaron Judge and Matt Olson who jump off the page.

Frustratingly, though, we have to wait for players to join the MLB in order to evaluate them based on their Statcast metrics. Scouting reports predict hard hit balls, lift tendencies, etc. - but not to the same degree of precision as Statcast does. Minor league baseball has little-to-none precise tracking information publicly available, so it's impossible to know which minor league hitters have the highest EV or what type of launch angles they have.

But there exists some tracking information. To create its Gameday overlays for MLB games, MLBAM uses Statcast tracking to place pitches and hits, creating visual representations of where pitches hit the zone and where batted balls fall. To create Gameday overlays for MiLB games, MLBAM doesn't use Statcast - rather, an army of stringers who manually input where each pitch was located and where each batted ball fell.

Can we manipulate this data (tracked through Gameday's API) to approximate hitters' Statcast metrics while in the minors?

This five-part series will attempt to answer that question by manipulating AAA-level batted ball data from MLB's Gameday API. The three parts will cover the following topics.

Part One: Scraping and collecting MiLB pitch and batted ball data.
Part Two: Visualizing and cleaning MiLB pitch data.
Part Three: Visualizing and cleaning MiLB batted ball data.
Part Four: Approximating average launch angle from elevation tendencies.
Part Five: The next steps for this data

This article covers Part One of the series.

Scraping MiLB and MLB Data

The first and simplest task is to scrape MLB data from Statcast or from the GD2 server. Here are a couple options, essentially different flavors of accomplishing the same task.
Being an R guy myself, I prefer Petti's method, especially because of how easy it is to write everything into a MySQL DB - but any of these should be capable of grabbing data from either Baseball Savant or GD2.

The next task is to scrape GDX, the MLB server that houses minor league data (edit: apparently, GD2 also houses MiLB data - I haven't seen any differences between the two, but was unaware that GDX and GD2 simultaneously existed). I was worried I'd have to write a scraper from scratch to grab MiLB data, but I realized that GDX is structured almost identically to GD2, so I just needed to find a GD2 scraper and modify it slightly.

Current Mariners analyst John Choiniere wrote a GD2 scraper designed to grab raw PitchFx data a couple years ago. A recent GD2 change broke the scraper in its current form, but it's a quick fix, and with a little fiddling, I was able to edit the scraper to successfully scrape GDX data as well. This scraper also did not collect the location of hits at the MiLB level according to stringers, so I edited the scraper to grab that information as well.

You can find my updated scraper here.

For the work that I'll be doing in this series, I'm grabbing AAA data from 2016-2017. The files from the scraper are huge for CSVs - my play-by-play file is about 120 MB, and my pitch-by-pitch file is about 415 MB.

Here's a glimpse at the head of the at-bat data.

I'll explain the un-intuitive column headers here: retro_game_id is a unique game identifier that, for MLB games, can be used to cross reference the game's information with Retrosheet data. st_fl, regseason_fl, and playoff_fl indicate if the game is a spring-training game, a regular-season game, or  a playoff game, and game_type indicates precisely what type of game it is (in this case, "E" for "Exhibition"). start_bases_cd and end_bases_cd refer to particular base states, useful in calculating RE24 values.

Here's a sampling of the pitch-by-pitch values from the same game.

Game information is basically repeated on here, but it's available for each individual pitch. Note that since the scraper was built for MLB PitchFx data, PitchFx headers are present, but since this is MiLB data none is available.

In part two, we're going to start putting this data to good use, clean it up, and visualize it.

1 comment: