Monday, July 2, 2018

MiLB Statcast Project Part Three: Cleaning up and visualizing batted ball data

In our previous section, we looked at the issues involved with minor league pitch placement data, strategies for cleaning and visualizing the data, and then compared that data to MLB data. In this section, we'll grab MiLB hit data, and use similar strategies for cleaning and visualizing that data.

Looking at our data, we can see that we have similar issues to our batted ball data-set as we did to our pitching data-set.

The issues with the batted ball data-set are as follows:

  1. There exists bias in the way that batted balls are grouped - batted balls are clustered around where fielders play, especially in the outfield.
  2. The units of the x and y coordinates are not immediately apparent.
  3. The field's dimensions are not cleanly defined.
We have little realistic approach for fixing our issues with the bias in clustering, but we can address problems 2 and 3.

Let's start by discussing the units. When stringers are tracking a game, in order to place a batted ball on the map, they use a 250x250 pixel map of the field. Where they click is then recorded then in pixels as the location of the hit. We have to determine a realistic scale from pixel to a real-world unit in order to calculate factors like hit distance.

So then, let's try to establish some concrete markers for scale. If we look at the 10th lowest y-value for ground balls for a stadium, we get a rough idea of where home-plate is.

If we move that intercept down slightly and plot the median x-value of all batted balls, we should find the tip of the baseball "diamond".

From here, we can construct foul-lines knowing that a baseball field is constructed with a 90 degree angle between the lines. As long as the field is not rotated beyond what we've already done, we can simply construct perpendicular lines from the tip of our diamond outward.

To determine the dimensions of our park (to both plot the outfield lines and to figure out the scale of pixels to park), let's look at the placement of home runs in the park.

There are a surprising number of misplaced HR balls - a bunch of long balls never left the infield, according to our data. We'll filter them out. Then, we'll plot a line of best fit along the outfield wall.

This looks like a decent approximation of Coca-Cola Field's outfield wall. If we shift all the values downward, mess with the colors a bit, and we have a decent approximation of what Coca Cola Field looks like.

The wall is slightly below almost all home runs. Coca-Cola Field does not have a perfectly round wall, but this approximation gives a good visualization, and looks useful for spray charts. And we can finally approximate the pixel to feet conversion factor! Home plate is at ~50 pixels, and the centerfield wall is at ~210, which gives a pixel distance of 160 pixels. In real life, Coca-Cola field measures ~400' from home to centerfield, so our coversion factor is 400/160 = 2.5. It's ~140 pixels down the left and right field lines for values of 350' down the lines. Coca-Cola Field is actually 325' down both lines, but the field itself curves inwards quite a bit. We're not quite capable of doing this with our approximation, but the values line up quite well.

With all of this implemented, let's turn this into a function!

I've overlaid Rhys Hoskins' 2017 batted balls over Coca-Cola Park (no relation to Coca-Cola Field). Our estimations of power look fairly accurate - Rhys has ~25 HR by my count on this chart, when he recorded 29 total in 2017 in AAA. Not bad for completely estimating the outfield wall as a semi-circle.

But what's more important is the information that the chart presents - from this spray chart alone, it's apparent that Hoskins hits a lot of ground balls to the right side of the infield, making him an excellent shift target. He also has substantial pull power.

We can glean this information from a scouting report, but it's important to have a visual confirmation of what's reported, and we can also pick up on systematic changes in approach. We can go deeper in terms of visualizing prospects and MiLB players.

No comments:

Post a Comment