For the second part of our series, we'll import our pitch-by-pitch data into R, and visualize it with ggplot2. Here, we'll clean it up and work on visualizing it effectively. So, let's fire up R.
On a microscopic level, our data looks nice and clean. For example, here are the pitch locations and outcomes for Dominic Smith's pitches in a random game.
To give you an idea of why this is really problematic, let me show you Smith's pitch data from Baseball Savant from his MLB career to date.
This chart shows pitch types against Smith, not results, but the point is clear - Statcast, a much more precise measuring system, shows that pitch placement for almost all hitters are akin to multivariate normal distributions. There is no large gap where the edges of the strike zone are, as is present in Smith's MiLB data.
This data is present in almost every player's pitch chart. Here's J.P. Crawford in 2017.
And here's Ozzie Albies' pitch chart from the same year.
For all hitters, there appears to exist a substantial amount of pro-umpire bias in pitch placement from the stringers. MiLB pitch data records extremely few marginal pitches, and as a result, pitch data is heavily skewed.
The bias appears to come directly from the umpires' calls - there are very few balls or called strikes on the edge of the zone. It's impossible that pitchers were so direct in placing their pitches in and out of the zone, or that umpires were so accurate in their ball and strike counting - rather, stringers see where the pitch was placed, see the umpires call, and either consciously or unconsciously adjust the pitch placement to reflect the call. Hence, all strikes in the strike zone, all balls outside the strike zone.
If we were to calculate plate-discipline statistics, like O-Swing%, Z-Swing%, O-Contact%, or Z-Contact%, we'd get worthless data. We can still calculate stats like SwStrk% or F-Strike%, and the pitch placement data appears to have some utility, but this is a stark reminder that this data set has a lot of problems.
We have another problem as well - we have no idea of the realistic size, shape, or scale of the strike zone! The X and Y coordinates have no immediately available units, and we have only the Y limits of the strike zone given - in completely different units than the ones for locating pitches.
What units are the X and Y coordinates? They appear to be simply in coordinates on a 250x250 grid, similar to that of batted ball placement. From eyeballing Smith's zone, it appears as though the top and bottom of smith's strike zone occur at Y coordinates of 180 and 120 units, so Smith's zone is about 60 units tall. According to Smith's sz_top and sz_bot, his strike zone is approximately 22 inches tall. 22 divided by 60 is roughly equivalent to the scale factor for inches to centimeters - so we can treat the pixels as centimeters.
To determine the edges of the strike zone, we can look at called strikes - we've already seen that almost all called strikes are recorded as being within the zone, so we can base our algorithmic strike zone on our called strike-zone. For the bottom of the strike zone, let's find the 3rd lowest y coordinate, and for the top, the 3rd highest (think of this as finding the median of the 5 highest or lowest coordinates). Repeating the process for the x coordinates yields the following plot.
Let's turn this into a working function, and remove automatic intentional balls (which are recorded as pitches at 1,1).
It's blindingly apparent why zone/out-of-zone metrics won't function with such a rigidly defined zone. However, this is not to say that the MiLB data is completely worthless - there's still some information we might stand to glean from this data.
As a practical example, let's take a look at whiffs. Joey Gallo has one of the highest swinging strike rates in the MLB, and he had similar problems while in the MiLB - Gallo had a swinging strike rate of 18.0% in AAA in 2016. Where were his whiffs coming from? I created a heat map of Gallo's whiffs in 2016.
I then grabbed Gallo's whiff heat map from Baseball Savant for his 2015-2016 seasons in the MLB, and surprise surprise, his whiffs occurred in largely the same location in the zone.
Despite the bias in terms of the zone, we can use MiLB pitch data to look at other significant trends, and explore pitchers and hitters in the MiLB to see how they might perform in the MLB. We're dealing with some messed up data, but it's not damaged beyond repair.