Friday, June 29, 2018

MiLB Statcast Project Part Two: Visualizing pitch data

For the second part of our series, we'll import our pitch-by-pitch data into R, and visualize it with ggplot2. Here, we'll clean it up and work on visualizing it effectively. So, let's fire up R.

On a microscopic level, our data looks nice and clean. For example, here are the pitch locations and outcomes for Dominic Smith's pitches in a random game.

Surface level, this looks okay - the approximate locations of balls and strikes are immediately apparent. But if we zoom out to the level of a full season, we start to see a lot of problems.

To give you an idea of why this is really problematic, let me show you Smith's pitch data from Baseball Savant from his MLB career to date.

This chart shows pitch types against Smith, not results, but the point is clear - Statcast, a much more precise measuring system, shows that pitch placement for almost all hitters are akin to multivariate normal distributions. There is no large gap where the edges of the strike zone are, as is present in Smith's MiLB data.

This data is present in almost every player's pitch chart. Here's J.P. Crawford in 2017.

And here's Ozzie Albies' pitch chart from the same year.

For all hitters, there appears to exist a substantial amount of pro-umpire bias in pitch placement from the stringers. MiLB pitch data records extremely few marginal pitches, and as a result, pitch data is heavily skewed. 

The bias appears to come directly from the umpires' calls - there are very few balls or called strikes on the edge of the zone. It's impossible that pitchers were so direct in placing their pitches in and out of the zone, or that umpires were so accurate in their ball and strike counting - rather, stringers see where the pitch was placed, see the umpires call, and either consciously or unconsciously adjust the pitch placement to reflect the call. Hence, all strikes in the strike zone, all balls outside the strike zone.

If we were to calculate plate-discipline statistics, like O-Swing%, Z-Swing%, O-Contact%, or Z-Contact%, we'd get worthless data. We can still calculate stats like SwStrk% or F-Strike%, and the pitch placement data appears to have some utility, but this is a stark reminder that this data set has a lot of problems.

We have another problem as well - we have no idea of the realistic size, shape, or scale of the strike zone! The X and Y coordinates have no immediately available units, and we have only the Y limits of the strike zone given - in completely different units than the ones for locating pitches.

What units are the X and Y coordinates? They appear to be simply in coordinates on a 250x250 grid, similar to that of batted ball placement. From eyeballing Smith's zone, it appears as though the top and bottom of smith's strike zone occur at Y coordinates of 180 and 120 units, so Smith's zone is about 60 units tall. According to Smith's sz_top and sz_bot, his strike zone is approximately 22 inches tall. 22 divided by 60 is roughly equivalent to the scale factor for inches to centimeters - so we can treat the pixels as centimeters.

To determine the edges of the strike zone, we can look at called strikes - we've already seen that almost all called strikes are recorded as being within the zone, so we can base our algorithmic strike zone on our called strike-zone. For the bottom of the strike zone, let's find the 3rd lowest y coordinate, and for the top, the 3rd highest (think of this as finding the median of the 5 highest or lowest coordinates). Repeating the process for the x coordinates yields the following plot.

Let's turn this into a working function, and remove automatic intentional balls (which are recorded as pitches at 1,1).

It's blindingly apparent why zone/out-of-zone metrics won't function with such a rigidly defined zone. However, this is not to say that the MiLB data is completely worthless - there's still some information we might stand to glean from this data.

As a practical example, let's take a look at whiffs. Joey Gallo has one of the highest swinging strike rates in the MLB, and he had similar problems while in the MiLB - Gallo had a swinging strike rate of 18.0% in AAA in 2016. Where were his whiffs coming from? I created a heat map of Gallo's whiffs in 2016.

I then grabbed Gallo's whiff heat map from Baseball Savant for his 2015-2016 seasons in the MLB, and surprise surprise, his whiffs occurred in largely the same location in the zone.

Despite the bias in terms of the zone, we can use MiLB pitch data to look at other significant trends, and explore pitchers and hitters in the MiLB to see how they might perform in the MLB. We're dealing with some messed up data, but it's not damaged beyond repair.

Thursday, June 28, 2018

MiLB Statcast Project Part One: Data Collection


Statcast data has revolutionized the way analysts and teams treat baseball data - never before has any sport had such comprehensive and precise analytics available to them for nearly every play of every game. Launch angle and exit velocity have dynamically changed the way that we look at and evaluate hitters, and even if the data available publicly is the tip of the iceberg, it allows us powerful tools for evaluating and scouting players. We can quickly identify Statcast Superstars based on their exit velocity and LA, guys like Aaron Judge and Matt Olson who jump off the page.

Frustratingly, though, we have to wait for players to join the MLB in order to evaluate them based on their Statcast metrics. Scouting reports predict hard hit balls, lift tendencies, etc. - but not to the same degree of precision as Statcast does. Minor league baseball has little-to-none precise tracking information publicly available, so it's impossible to know which minor league hitters have the highest EV or what type of launch angles they have.

But there exists some tracking information. To create its Gameday overlays for MLB games, MLBAM uses Statcast tracking to place pitches and hits, creating visual representations of where pitches hit the zone and where batted balls fall. To create Gameday overlays for MiLB games, MLBAM doesn't use Statcast - rather, an army of stringers who manually input where each pitch was located and where each batted ball fell.

Can we manipulate this data (tracked through Gameday's API) to approximate hitters' Statcast metrics while in the minors?

This five-part series will attempt to answer that question by manipulating AAA-level batted ball data from MLB's Gameday API. The three parts will cover the following topics.

Part One: Scraping and collecting MiLB pitch and batted ball data.
Part Two: Visualizing and cleaning MiLB pitch data.
Part Three: Visualizing and cleaning MiLB batted ball data.
Part Four: Approximating average launch angle from elevation tendencies.
Part Five: The next steps for this data

This article covers Part One of the series.

Scraping MiLB and MLB Data

The first and simplest task is to scrape MLB data from Statcast or from the GD2 server. Here are a couple options, essentially different flavors of accomplishing the same task.
Being an R guy myself, I prefer Petti's method, especially because of how easy it is to write everything into a MySQL DB - but any of these should be capable of grabbing data from either Baseball Savant or GD2.

The next task is to scrape GDX, the MLB server that houses minor league data (edit: apparently, GD2 also houses MiLB data - I haven't seen any differences between the two, but was unaware that GDX and GD2 simultaneously existed). I was worried I'd have to write a scraper from scratch to grab MiLB data, but I realized that GDX is structured almost identically to GD2, so I just needed to find a GD2 scraper and modify it slightly.

Current Mariners analyst John Choiniere wrote a GD2 scraper designed to grab raw PitchFx data a couple years ago. A recent GD2 change broke the scraper in its current form, but it's a quick fix, and with a little fiddling, I was able to edit the scraper to successfully scrape GDX data as well. This scraper also did not collect the location of hits at the MiLB level according to stringers, so I edited the scraper to grab that information as well.

You can find my updated scraper here.

For the work that I'll be doing in this series, I'm grabbing AAA data from 2016-2017. The files from the scraper are huge for CSVs - my play-by-play file is about 120 MB, and my pitch-by-pitch file is about 415 MB.

Here's a glimpse at the head of the at-bat data.

I'll explain the un-intuitive column headers here: retro_game_id is a unique game identifier that, for MLB games, can be used to cross reference the game's information with Retrosheet data. st_fl, regseason_fl, and playoff_fl indicate if the game is a spring-training game, a regular-season game, or  a playoff game, and game_type indicates precisely what type of game it is (in this case, "E" for "Exhibition"). start_bases_cd and end_bases_cd refer to particular base states, useful in calculating RE24 values.

Here's a sampling of the pitch-by-pitch values from the same game.

Game information is basically repeated on here, but it's available for each individual pitch. Note that since the scraper was built for MLB PitchFx data, PitchFx headers are present, but since this is MiLB data none is available.

In part two, we're going to start putting this data to good use, clean it up, and visualize it.