Sunday, July 8, 2018

Effective Chase Score

Swinging at pitches outside the zone is generally bad. After all, players are essentially giving up free balls in exchange for either a strike or a poorly hit ball. But hey, if you can put the ball in play, it's not the worst thing in the world. With this in mind, yesterday, I looked at which hitters were best at avoiding chasing those pitches, while making contact on said pitches.
I decided to refine this methodology further, and talk about players ability to effectively chase in that they A. don't chase frequently, B. make contact on pitches that they chase, and C. make quality contact on pitches that they chase.

The three components I incorporated were 1-O-swing% (how frequently players did not swing at outside pitches), O-Contact% (how frequently players make contact on their swings outside the zone) and xwOBA on O-zone pitches (the quality of contact on pitches made outside of the zone). After pulling all of these figures for players with 1000+ pitches this season from Baseball Savant, I then calculated the z-scores for players with regards to each metrics, then added them all together. The end result I called the "Effective Chase Score".

Here are 2018's leaders in Effective Chase Score.

PlayerEffective Chase Score
Joey Votto6.85
Mookie Betts5.43
Brett Gardner4.45
Alex Bregman4.42
Nick Markakis4.23
Jesse Winker4.20
Ben Zobrist3.61
Andrew Benintendi3.56
Aaron Hicks3.43
Jose Ramirez3.40
Andrelton Simmons3.36
Travis Shaw3.22
Carlos Santana3.15
Mike Trout3.01
Shin-Soo Choo2.96
Lorenzo Cain2.87
Matt Chapman2.82
Buster Posey2.81
Ian Kinsler2.80
Denard Span2.70

As we would expect, Votto is miles away the best player in terms in effective chase rate - in addition to having extremely low chase rates, Votto makes contact frequently on his outside swings and has extremely effective contact on outside pitches.

Here are the worst batters by the same metric.

playerEffective Chase Score
Freddy Galvis-2.54
Michael A. Taylor-2.65
Kevin Pillar-2.76
Joey Gallo-2.83
Chris Davis-2.84
Carlos Gomez-2.84
Odubel Herrera-2.85
Eduardo Escobar-3.06
Robinson Chirinos-3.11
Teoscar Hernandez-3.61
Adam Jones-3.62
Tim Anderson-3.62
Giancarlo Stanton-3.63
Luis Valbuena-3.67
JaCoby Jones-4.07
Nicholas Castellanos-4.13
Jonathan Schoop-4.16
Lewis Brinson-4.76
Ryon Healy-4.82
Javier Baez-5.65

There are a lot of free swingers here, including Gomez, Davis, Gallo, etc. Baez, however, is almost as bad as Votto is good - Baez has the worst O-Swing% by 6% (Baez - 46.0%, second is Kevin Pillar, 40.5%), bottom tier O-Contact%, and Baez has just a .237 xwOBA on outside pitches.

To view the full list of hitters with at least 1000+ pitches faced, I published my spreadsheet below.

Friday, July 6, 2018

MiLB Statcast Project Part Five: Next Steps

What's next for MiLB batted ball data? Clearly, there are issues with it, thanks to biased stringers, but there's also a wealth of valuable information in here.

Having already calculated launch angle, it seems logical that the next step would be to calculate exit velocity. It would seem as though some relationship between hit distance (calculated using the home plate location found in part three and the coordinates of the batted balls) and launch angle would yield an approximation for exit velocity, and indeed, such a relationship appears to exist at the major league level.

Despite this, using the model that I reverse engineered from Statcast and correcting for differences in hit-tracking between the stringers and the MiLB, I found that such a model was grossly inaccurate at the minor league level. Shown below are MiLB hitters with at least 200 BIP in 2016 and 200+ BIP in the majors in 2017.

Perhaps the depth of batted ball locations are inaccurate, or perhaps the model itself has issues. I think this is a difficult challenge because we're trying to measure the size of an intangible object using its shadow - it's not as simple as plugging the values into excel's equation solver, as we need to have method behind our model. I think of this challenge as a WIP, and I hope to update this post with a solution soon, but for now I have no clear way of estimating MiLB exit velocity.

Still, the rest of the data that we're working with appears solid and powerful. I've already revealed a couple functions that I've been using, and I hope to develop an R-package for all of these functions, including heatmaps, splits, date-ranges, a built-in R scraper, and more. I hope to keep y'all posted on this later this summer.

Thank you for reading this series! I hope this was insightful or at least entertaining. In my opinion, not enough public analysts are using MiLB data, and while it's certainly rough around the edges, there's still valuable information to be gleaned from it.

Wednesday, July 4, 2018

MiLB Statcast Project Part Four: Reverse Engineering Launch Angle

In the previous two part, we focused heavily on creating visualizations of MiLB pitch and batted ball data, but our data was not really used to create any workable number or analogs for MiLB data. I consider it arbitrary to do things like calculate batting average or slugging percentage, but methods like the ones available with regards to Statcast, such as launch angle and exit velocity. We have neither of the values available to us in any form with minor league data, but we can approximate them. This section will focus on launch angle for MiLB hitters.

While we do not have launch angle available in any form for MiLB hitters, we do have limited batted ball classification data - stringers will manually tabulate which balls are ground balls, which are fly balls, which are line drives, and which are pop-flies. While the stringers do not operate with anything close to the precision of BIS's ball classification system, it still gives a rough idea of players' batted ball tendencies. 

For example, let's say we want to know how frequently Ozzie Albies hit fly balls in 2017 in AAA. There are two ways of calculating fly ball rates - FanGraphs includes pop-ups in their calculation of FB%, but Baseball Savant does not. We'll calculate both for posterity.

FanGraphs has limited MiLB batted ball data from STATS, and Albies' figure for 2017 in AAA is fairly consistent with what we calculated from FanGraphs (37.9% from FanGraphs compared to 38.4% from our dataset). Albies hit fly balls in the MLB in 2017 at a 40.3% rate according to FanGraphs, and at a 32.1% rate according to Baseball Savant, so our minor league figures appear fairly accurate given that Albies' fly ball rate was consistent with his measured values both in MiLB play and in the majors.

So how can we extrapolate launch angle from this? Launch angle plays a large part in batted ball classification. We can use batted ball tendencies to reverse engineer launch angle at the MLB level, and apply that to MiLB data. Using my personal Statcast DB, I found the average launch angle for each batted ball classification for all batted balls ever recorded by Statcast.

BB TypeLaunch Angle
Fly Ball36.646
Line Drive16.756
Ground Ball-12.553

If we treat each batted ball as having been hit with its average launch angle, we can theoretically get a solid estimation of average launch angle from batted ball classifications alone. I pulled 2017 hitters with at least 200 PA and compared their estimated launch angle from batted ball classification to their actual launch angle, the results were extremely promising. To clarify, the exact equation used was:

Our R-squared value is .93, indicating that our formula does an excellent job of estimating launch angle solely from batted ball data - not surprising considering that Baseball Savant likely uses launch angle as a majority factor in classifying batted balls.

If we re-scale our values to get a 1:1 relationship, we have a fairly strong model for estimating launch angle from batted ball classification.

As strong as the correlation is between xLA and LA, our RMSE is a bit weak. In looking at the relationship between residual values and batted ball frequencies, it looks like we're introducing a bit of error with our POP% value.

I found that my RMSE was minimized at pop-fly coefficient of about 60.65 - my guess is that since Statcast has difficulties tracking some balls at extreme launch angles, the true pop-fly angle is skewed upward.

We've marginally improved our RMSE and r-squared with our model. I think there are probably some bigger steps we could take to improve the model's accuracy, but at the moment, I think our r-squared value is superb, and our RMSE value is acceptable as a model of launch angle.

Armed with our model, we are now prepared to determine MiLB launch angle from the batted ball data found in our dataset.

This somewhat-intimidating wall of code grabs batted ball values and calculates estimated launch angle from them using batted ball data. In our csv, we now have estimated launch angle values for hitters in 2016 and 2017 for minor league players! Of course, we need to check ourselves - how accurate are these launch angle values?

To determine the accuracy of our results, we'll compare year n to year n+1 correlation. I pulled hitters who registered 200+ balls in play in 2016 and 2017 (210 of them), and found a correlation between 2016's launch angle and 2017's launch angle of .6606, so this is our benchmark.

We're not going to compare players with 200+ BIP in the minors from 2016 to players with 200+ BIP in the minors from 2017 - it just tells us the correlation between our measured values of FB%, GB%, LD%, and POP% in a rougher form. Instead, we're going to compare hitters with 200+ BIP in the minors from 2016 to hitters with 200+ BIP in the majors from 2017 - in this sense, we're looking at how well MiLB launch angle predicts MLB launch angle.

After pulling these values, I only found 27 hitters who registered both 200+ BIP in AAA in 2016 and 200+ BIP in the MLB in 2017, which was a bit of a disappointment. Still - our r-squared value for these hitters estimated launch angle from their 2016 MiLB campaign and their 2017 MLB campaign was .7274. Because we're dealing with fewer hitters (27 versus 210) and because we're dealing with consistent young hitters (there's no decline due to age or dramatic changes in LA, unlike in our MLB dataset) our r-squared looks better than our benchmark of .6606, but I don't think for a second that our xLA is somehow a better predictor of launch angle than previous year's launch angle. (EDIT: it also might have something to do with the fact that I accidentally included multiple Jose Martinezes here)

Still, xLA appears to have undeniable predictive value.

We have a reasonable predictor of launch angle using minor league data! If we want to compare MiLB launch angles to MLB launch angles to draw comparisons between hitters, we now have that ability, and we can be reasonably confident in our ability to do so.

Monday, July 2, 2018

MiLB Statcast Project Part Three: Cleaning up and visualizing batted ball data

In our previous section, we looked at the issues involved with minor league pitch placement data, strategies for cleaning and visualizing the data, and then compared that data to MLB data. In this section, we'll grab MiLB hit data, and use similar strategies for cleaning and visualizing that data.

Looking at our data, we can see that we have similar issues to our batted ball data-set as we did to our pitching data-set.

The issues with the batted ball data-set are as follows:

  1. There exists bias in the way that batted balls are grouped - batted balls are clustered around where fielders play, especially in the outfield.
  2. The units of the x and y coordinates are not immediately apparent.
  3. The field's dimensions are not cleanly defined.
We have little realistic approach for fixing our issues with the bias in clustering, but we can address problems 2 and 3.

Let's start by discussing the units. When stringers are tracking a game, in order to place a batted ball on the map, they use a 250x250 pixel map of the field. Where they click is then recorded then in pixels as the location of the hit. We have to determine a realistic scale from pixel to a real-world unit in order to calculate factors like hit distance.

So then, let's try to establish some concrete markers for scale. If we look at the 10th lowest y-value for ground balls for a stadium, we get a rough idea of where home-plate is.

If we move that intercept down slightly and plot the median x-value of all batted balls, we should find the tip of the baseball "diamond".

From here, we can construct foul-lines knowing that a baseball field is constructed with a 90 degree angle between the lines. As long as the field is not rotated beyond what we've already done, we can simply construct perpendicular lines from the tip of our diamond outward.

To determine the dimensions of our park (to both plot the outfield lines and to figure out the scale of pixels to park), let's look at the placement of home runs in the park.

There are a surprising number of misplaced HR balls - a bunch of long balls never left the infield, according to our data. We'll filter them out. Then, we'll plot a line of best fit along the outfield wall.

This looks like a decent approximation of Coca-Cola Field's outfield wall. If we shift all the values downward, mess with the colors a bit, and we have a decent approximation of what Coca Cola Field looks like.

The wall is slightly below almost all home runs. Coca-Cola Field does not have a perfectly round wall, but this approximation gives a good visualization, and looks useful for spray charts. And we can finally approximate the pixel to feet conversion factor! Home plate is at ~50 pixels, and the centerfield wall is at ~210, which gives a pixel distance of 160 pixels. In real life, Coca-Cola field measures ~400' from home to centerfield, so our coversion factor is 400/160 = 2.5. It's ~140 pixels down the left and right field lines for values of 350' down the lines. Coca-Cola Field is actually 325' down both lines, but the field itself curves inwards quite a bit. We're not quite capable of doing this with our approximation, but the values line up quite well.

With all of this implemented, let's turn this into a function!

I've overlaid Rhys Hoskins' 2017 batted balls over Coca-Cola Park (no relation to Coca-Cola Field). Our estimations of power look fairly accurate - Rhys has ~25 HR by my count on this chart, when he recorded 29 total in 2017 in AAA. Not bad for completely estimating the outfield wall as a semi-circle.

But what's more important is the information that the chart presents - from this spray chart alone, it's apparent that Hoskins hits a lot of ground balls to the right side of the infield, making him an excellent shift target. He also has substantial pull power.

We can glean this information from a scouting report, but it's important to have a visual confirmation of what's reported, and we can also pick up on systematic changes in approach. We can go deeper in terms of visualizing prospects and MiLB players.