Tuesday, May 1, 2018

The Complete Idiot's Guide To Playing With Statcast Data

In baseball analytics, arguably the most powerful dataset publicly available is Statcast data, which includes advanced tracking data on almost every baseball event since 2015 including exit velocity, pitch placement, launch angle, and others. But unlike FanGraphs or Baseball Reference, it can be tricky to play around with the data because Statcasts' data doesn't have a glossary or a key - so staring at a sheet of raw Statcast data can feel awfully intimidating, even for someone who might feel comfortable with that data. So I've thrown together this comprehensive guide to playing with Statcast data - if you've ever wondered what vx0 and ax0 referred to, here's your chance to find out.

Part One: Scraping The Data

Obviously, in order to play around with Statcast data, we need some Statcast data. If you're familiar with R, Bill Petti put together an excellent tutorial on scraping Statcast data from its host site, Baseball Savant, using his BaseballR package, but if you're not acquainted with R, you can just use the following URL to download Statcast data in CSV format for yourself.

https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea="year"%7C&hfSit=&player_type=pitcher&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt="start_date"&game_date_lt="end_date"&team=&position=&hfRO=&home_road=&hfFlag=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_abs=0&type=details&

This link will fetch all games in year starting with start_date and ending with end_date, where start_date and end_date are formatted as "YYYY-MM-DD". Be warned that you're going to attempt to scrape data from a large span of time, say, a week, your CSV file will start to get extremely large and your computer may be unable to handle the file (at which point, it would probably behove you to use R, or better, MySQL, to play around your data).

Part Two: Understanding The Data

So we have a dataset! It might be as big as 10 seasons, or it might be as small as one game, but the dataset contained within is fairly standardized. So let's talk about what each datapoint means.

Understanding The Dimensions of Statcast Data

Before jumping into the data, it's important to understand the way Statcast views a baseball field. Statcast's data works by tracking the baseball as a point through three dimension space, and measuring its coordinates and characteristics as it passes through multiple planes. Statcast generally discusses the movement of a ball through three-dimensional space using X, Y, and Z coordinates. The Y-axis runs between home-plate and second base, the X-axis runs through first base and third base, and the Z axis lies directly perpendicular to the plane of the field.


In playing with the data, you'll notice that data does not contain many Y-coordinates in terms of positional data. Statcast usually represents the ball as moving through a series of X-Z planes, which have a fixed Y value - for example, when discussing where a pitch crossed home plate, Statcast gives us the X and Z coordinates, but the Y coordinate is fixed - we are looking at where the ball intersects the X-Z plane at the front of home plate, which is constant regardless of batter.

Statcast also represents everything from the perspective of the catcher and umpire. Here's something from an article I just published on The Athletic - Baseball Savant generated a heatmap of pitches that Amed Rosario swung at. If we were to transpose this graphic into real life, we'd see the pitcher standing sixty feet away from us, and Rosario standing in the batters box off to the left.

  

Okay! Let's get into the data headers. Here's what's in each column - I've grouped each column by the type of data that it represents. If you're interested in knowing what the values of a given column represent, press "Crtl+F" and search for the column header - the exact order of the columns might vary, so I've grouped them in a way that makes some kind of sense.

Game Situation Information

game_pk: game_pk is the unique key used by the MLB to represent individual games, represented as a six digit numerical code. 

game_date: This is a representation of the day that the game took place on in YYYY-MM-DD format.

game_year: This is a representation of the year of the game in YYYY format.

game_type: This field represents what type of game is being played with a single letter code. This should almost always be "R", and my database contains only regular-season data for convenience's sake, but if you're going to incorporate every game under the sun in there, make sure that you throw in a "where game_type = 'R'" clause into every query. 
  • R = Regular season game
  • S = Spring training game
  • F = Wildcard game
  • D = Divisional Series game
  • L = League Championship game
  • W = World Series game
player_name: Depending on how you got your dataset, player_name might represent the name of the pitcher throwing the pitch or the name of the batter at the plate. Generally, it represents the pitcher's name, and it should be fairly obvious.

batter, pitcher, on_1b, on_2b, on_3b, posX_person_id: These fields might look like gobbledy gook, but they're ID representations of the players filling each role. Like FanGraphs and Baseball Reference, every player has their own unique identifier, and in place of using names (which are not necessarily unique), Statcast uses their keys to track players on the field, which consist of a six-digit code. For example, Baseball Reference uses the key "alcanar01" to refer to Arismendy Alcantara, FanGraphs uses "10711", and the MLB uses "570489". 

batter and pitcher represent the IDs of the players batting and on the mound, respectively; on_1b, on_2b, and on_3b represent the IDs of the players on base, and posX_person_id (where X is some number) represents the person playing the field at position X during the pitch, using the scoring guidelines (so pos6_person_id refers to the player ID of the shortstop).

MLB does not make a list of their IDs available, but it's fairly simple to scrape their player list to grab a list of players. You can use this to assign names to the players involved on each pitch, though you should certainly be using the player IDs to group events.

stand: What side of the plate the batter is standing on. L is "left", and R is "right".

p_throws: What arm a pitcher uses to throw the ball. L is "left", and R is "right".

home_team, away_team: These letter codes are the abbreviations that the MLB uses to refer to the home and away teams. While the MLB tends to use codes that abbreviate a team's city name and represent which league they belong to (for example the New York Mets as NYN - New York National League), Statcast's codes are the more traditional abbreviations - for example, the San Diego Padres are simply "SD", and the Mets are "NYM".

inning: The inning number where the pitch was thrown.

inning_topbot: This field represents whether or not it is the top or bottom of the inning immediately before a pitch is thrown. "Top" means that it's the top of the inning, "Bot" means that it's the bottom of the inning.

balls, strikes: These fields represent the counts of balls and strikes immediately before a pitch is thrown.

outs_when_up: The number of outs immediately before a pitch is thrown - either 0, 1, or 2.

home_score, away_score, bat_score, fld_score: These are the scores on each side immediately before a given pitch is thrown. home_score and away_score are the scores of the home and away teams, respectively, and bat_score/fld_score are the scores of the batting team and the fielding team, respectively.

post_home_score, post_away_score, post_bat_score, post_fld_score: These are essentially the same as the above fields, but they represent the scores after the pitch. These can be used in calculating run expectancy and run values.

at_bat_number: This field is the chronological order of batters who have come to the plate, split by team. The leadoff hitter in a game is batter 1, the cleanup hitter is 4, etc.

pitch_number: This is the chronological order of pitches thrown by pitchers on each team. The first pitch of the game is 1, the second is 2, and so on.

sz_top, sz_bot: While an umpire's strike zone might vary dramatically from pitch to pitch, Statcast establishes a standard strike zone - the left and right sides of the strike zone are .71 feet away from the center of home plate, the bottom part of the strike zone (sz_bot) is at the bottom of his knees, and the top of the strike zone (which is sz_bot + sz_top, not sz_top by itself) is at the middle of his torso. Statcast calculates sz_bot and sz_top as a function of the batters' height. sz_bot and sz_top are both in feet.

Pitch Information

pitch_type, pitch_name: pitch_type is a two letter code describing the type of pitch thrown, and pitch_name is the explicit name of the pitch. Statcast's classification algorithm is not perfect (The database has 36 instances of curveballs thrown 90+ MPH) but it's generally accurate. 

release_speed: This is the speed at which the pitch in the event was thrown. Before 2016, Statcast used Pitchf/x data and adjusted the data to represent the pitch velocity to represent velocity from out-of-hand velocity. After 2016, Statcast started using their own system, which tracks pitch velocity out of hand (which turned into a lawsuit for the MLB, surprisingly). Release speed is measured in MPH.

release_pos_x, release_pos_z, release_pos_y: These coordinates track exactly where a pitch is released by the pitcher. release_pos_x is measured from the center of the rubber, so RHPs will have negative values, and LHPs will have positive values. release_pos_z is measured from the bottom of the rubber up. Submarine pitcher Kazuhisa Makita has an average release_pos_z of just 1.70 feet off the ground, whereas Tyler Anderson has an over-the-top delivery resulting in a release_pos_z of 6.30 feet. release_pos_y tracks the extension of the pitcher, measuring the distance from home-plate from which a pitch is released. All of these values are in feet.

vx0, vy0, vz0: In physics, one can describe the motion of an object using the components of its velocities in three dimensions. These fields are the initial velocities of a pitch in each axis, where x, y, and z correspond to their given axis. Generally, vy0 is the largest, followed by either vx0 or vz0. These values are not in miles per hour, but instead in feet per second.

ax, ay, az: Pitches, however, don't travel in straight lines - each pitch is subject to generally three forces, which accelerate the pitch - gravity (which is seen in az), drag (which is seen in ax, ay, and az), and a Magnus effect (seen in ax, ay, and az). These values are in feet per second per second.

effective_speed: Using a radar gun, it's easy to measure the speed of a given pitch, but to a batter's eye, a pitch might look much faster or slower than it actually is. effective_speed looks at how fast a pitch appears to be going as perceived by the batter. effective_velocity is a function of spin, velocity, and release extension by the pitcher, and recorded in MPH.

release_spin_rate: The Magnus effect will cause pitches to act in strange ways, and it's the principal driver of pitch movement, especially in pitches like curveballs. To generate additional movement, pitchers would want a baseball with a high spin rate. Statcast tracks the spin-rate of a pitch as it comes of the pitchers' hand, measured in RPM.

release_extension: Pitchers don't release the ball in the same place every time - instead of 60 feet and 6 inches away, pitches might release the pitch much closer to home plate, resulting in a higher perceived velocity by the pitcher. release_extention is measured in feet from the center of the rubber, and is measured along the y-axis - it's essentially the inverse of release_pos_y.

pfx_x, pfx_z: These fields represent the distance traveled by the ball along the X axis and Z axis from release to reaching the front of the plate, in feet. A devastating curveball down and away will move a lot along these axes compared to a fastball over the middle of the plate.

plate_x, plate_z: The horizontal and vertical components of the pitch as it crosses the plane formed by the front of the plate. Like release_pos_x and release_pos_z, plate_x and plate_z are measured from the center of the base and from the ground up, respectively - note that a pitch inside to an RHB will record a negative plate_x. For example, here are the pitch locations of pitches that Amed Rosario swung at in 2017 - plate_x is graphed along the bottom axis, and plate_z is graphed along the side axis. From this, we can observe that Rosario tends to swing at pitches down and away.
 

zone: Baseball Savant handily classifies pitch location into a couple buckets. This field is an integer which represents a given bucket, shown below. Remember that these pitches are seen from the perspective of the catcher, so if you were interested in looking at which hitters hit best on fastballs up and in, you could look at how RHB hit fastballs in zone 1, or how LHB hit fastballs in zone 3.

 

Event Information

events: Statcast has a lot of codes to represent the different events that happen on-field during a play. It's possible that there are a couple events that have not occurred yet and aren't represented in the following list, but it's a fairly comprehensive list of plays.
  • batter_interference = A batter interferes with a catcher's attempt to throw out a runner, blocks a catcher from making a tag play at the plate, or hits the catcher on the backswing.
  • catcher_interference = The catcher's mitt touches a bat during the swing (not on the backswing).
  • caught_stealing_2b = A runner is thrown out attempting to steal 2B.
  • caught_stealing_3b = A runner is thrown out attempting to steal 3B.
  • caught_stealing_home = A runner is thrown out attempting to steal home plate.
  • double = A batter hits a ball that results in the batter advancing two bases.
  • double_play = A batter hits a ball that results in two outs.
  • field_error = A fielder makes an error in an attempt to field a batted ball.
  • fielders_choice = A batter advances to a base because of a fielder attempting to make a play on another runner.
  • fielders_choice_out = A batter advances to a base because of a fielder making an out on another runner.
  • force_out = A batter advances to a base because of a fielder making an out via force-out on another runner.
  • grounded_into_double_play = A batter hits a ground ball that results in two outs.
  • hit_by_pitch = A batter is hit by a pitched ball.
  • home_run = A batter circles the bases on a given play.
  • intent_walk = A batter is given first base intentionally by the opposing team.
  • null = No play was recorded on the pitch. A general strike or ball on a pitch is recorded with this value.
  • other_out = A unclassified out was recorded by the fielding team.
  • pickoff_1b = A pitcher attempts to throw out the runner at 1B on the basepaths between pitches.
  • pickoff_2b = A pitcher attempts to throw out the runner at 2B on the basepaths between pitches.
  • pickoff_3b = A pitcher attempts to throw out the runner at 3B on the basepaths between pitches.
  • pickoff_caught_stealing_2b = A runner is successfully picked off between 1B and 2B.
  • pickoff_caught_stealing_3b = A runner is successfully picked off between 2B and 3B.
  • pickoff_caught_stealing_home = A runner is successfully picked off between 3B and home plate.
  • run = A run scores on the pitch without the ball being put in play (i.e. a runner scoring on a passed ball or on a balk).
  • sac_bunt = A batter intentionally bunts the ball in an effort to get themselves out while advancing the runner.
  • sac_bunt_double_play = A batter intentionally bunts the ball in an affort to get themselves out while advancing the runner, but the play results in two outs.
  • sac_fly = A batter hits a fly ball out that results a runner advancing.
  • sac_fly_double_play = A batter hits a fly ball out that results in the runner attempting to advance but being tagged out.
  • single = A batter hits a ball that results in the batter advancing one base.
  • strikeout = A batter records an out by registering three strikes in a PA.
  • strikeout_double_play = A batter strikes out and a runner on the bases runs into an out as well.
  • triple = A batter hits a ball that results in the batter advancing three bases.
  • walk = A batter takes four balls in an PA and advances to first base.
description: This is a more general textual description of the result of a pitch. The possible results are below:
  • automatic_ball = By the rules, a batter is automatically granted a ball.
  • ball = A pitch is placed outside of the strike zone and the batter does not swing at it.
  • blocked_ball = A pitch goes into the dirt but is successfully blocked from going past the backstop by the catcher.
  • called_strike = A pitch is placed in the strike zone and the batter does not swing at it.
  • foul = A batter makes contact with a pitch, but the ball does not fall into fair territory.
  • foul_bunt = A batter attempts to bunt a pitch but the ball does not land in fair territory.
  • foul_pitchout = A batter fouls off a pitch before a catcher can catch the ball and throw to a base in an attempt to throw out a runner at the base.
  • foul_tip = A batter makes glancing contact with a pitch, but the catcher is still able to catch the ball without moving from the backstop.
  • hit_by_pitch = A pitch makes contact with the batter, and he is awarded first base.
  • hit_into_play = A batter makes contact with a pitch and the ball lands in fair territory.
  • hit_into_play_no_out = A batter makes contact with a pitch and the ball lands in fair territory, while no out is recorded on the play.
  • hit_into_play_score = A batter makes contact with a pitch and the ball lands in fair territory, scoring a run.
  • intent_ball = The opposing team issues a walk to the batter, giving him first base.
  • missed_bunt = A batter attempts to bunt a pitch but is unable to make contact.
  • pitchout = After a pitch, a catcher throws to a base in an attempt to throw out a runner at the base.
  • pitchout_hit_into_play_score = A batter makes contact with a pitch intended for a pitchout and the ball falls in fair territory, resulting in a run scoring.
  • swinging_pitchout = After a pitch that a batter swung at, a catcher throws to a base in an attempt to throw out a runner at the base.
  • swinging_strike = A batter swings at a pitch, but is unable to make contact with it.
  • swinging_strike_blocked = A batter swings at a pitch and is unable to make contact, but the pitch is in the dirt and the catcher is able to successfully block the pitch from traveling behind the backstop.
des: This field is a textual representation of what happened on a play, describing all of the fielders involved, what happened to the batter, and where any runners on base ended up. Here are a couple examples from the first game of the 2015 season between the Cardinals and Cubs:
  • Jhonny Peralta flies out to center fielder Dexter Fowler.
  • Mike Olt strikes out swinging.
  • Jon Jay doubles (1) on a line drive to center fielder Dexter Fowler, deflected by right fielder Jorge Soler.
type: This is the field that Statcast uses to record balls and strikes. "S" represents strikes, "B" represents balls, and "X" represents neither - generally an indication of a ball in play.

hit_location: When a fielder goes to field a batted ball, their position number is recorded here. For example, a ball fielded by the center-field is represented by '8'.

hc_x, hc_y: We're finally dealing with Y coordinates! Imagine looking down on the field from above, as a view from the sky - the exact placement of where the ball lands can be represented as a series of coordinates. hc_x and hc_y are unfortunately not super helpful, because they're not in feet or yards, but in pixels. Yeah.

The way that these values are generated is that for each batted ball, an MLB employee (or intern) manually places a marker down on a 250x250 pixel image of the field, estimating where the ball landed. hc_x and hc_y represent the coordinate of the landing location in pixels. I don't know what the exact image that is used by the MLB looks like, but this is a decent approximation - the red dot below represents where the MLB employee best thinks that the ball landed.


hit_distance_sc: So how can we know how far a batted ball traveled? Fortunately Statcast tracks the hit distance of a batted ball from barrel to glove (or seat if it goes yard). This value is in feet.

launch_speed: You know the whole exit velocity craze that everyone's been on as of late? This is exit velocity! Measured in MPH, this is the tracked speed of a baseball as it leaves the bat.

launch_angle: If you've heard of exit velocity, you've undoubtedly heard of launch-angle too. Launch angle is measured in degrees, and it looks at the angle (relative to the ground) that a ball leaves the bat.

bb_type: There are a variety of batted balls recorded by hitters. Baseball Savant uses launch_speed and launch_angle to determine classify the batted ball. The types of batted balls are shown below:
  • fly_ball = A batted ball that went fairly high in the air with a considerable amount of distance.
  • ground_ball = A batted ball that rolled along the ground for some distance.
  • line_drive: A batted ball that traveled a considerable distance in the air despite not traveling very high.
  • null = No batted ball resulted from the pitch.
  • popup = A batted ball that went high in the air but did not travel a great distance.
estimated_ba_using_speedangle: Statcast has tracked thousands of batted balls since 2015, and knows the results of each. In this metric, Statcast compares the launch speed and launch angle of the ball to others like it, and based on how frequently balls like it have fallen in for hits, estimates the likelihood that a given batted ball will fall in for a hit.

estimated_woba_using_speedangle: The methodology for estimated_woba_using_speedangle (or xwOBA, as it's more commonly known) is virtually the same as estimated_ba_using_speedangle. Statcast compares one batted ball to many others with similar launch speed and angle, and calculates how many runs, on average, those balls were worth. Then, it converts the run value to the same scale as wOBA, and returns that value.

woba_value, woba_denom: Given plays are worth a certain amount of runs. Statcast tracks how many runs each event is worth (the woba_value), and tracks whether or not the play would be counted by wOBA (the woba_denom). Essentially, add up the woba_values, divide by the number of woba_denoms a player recorded, and you have a player's wOBA.

babip_value: If a play resulted in a ball in play, then this value is a 1. If not, it's 0. This is used in calculating a player's BABIP.

iso_value: This field tracks the number of extra bases a player recorded on a hit.

launch_speed_angle: Statcast loves to classify things, so they use launch speed and launch angle to classify batted balls based on quality of contact. Each number corresponds to a batted-ball classification, shown below.
  • 1 = Poor/Weak contact
  • 2 = Poor/Topped contact
  • 3 = Poor/Under contact
  • 4 = Flare/Burner contact
  • 5 = Solid contact
  • 6 = Barrel
barrel: If a batted ball is barreled, this field is "1". If not, it's "0".

Deprecated

spin_dir, spin_rate_deprecated, break_angle_deprecated, break_length_deprecated, tfs_deprecated, tfs_zulu_deprecated: All of these column fields have been deprecated and are no longer tracked by Baseball Savant.

These are all of the values you can play around with using Statcast's database. In the words of Jack Buck, "Go Crazy Folks!" with this resource, and enjoy!

5 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I've got a question about plate_x. Is 0 the ground? I have a few negative values. Do you know how those are calculated?

    ReplyDelete
    Replies
    1. Plate x is actually the measure of the horizontal position measured from the middle of the plate so a negative value is on the left side of the plate (from perspective of the catcher).

      Delete
  3. Im not so sure that the sz_top is really supposed to be equal to sz_bot + sz_top. I made a few plots with both definitions and it doesn't check out. Are you able to confirm where you are getting that information from?I could not find it anywhere else.

    ReplyDelete
  4. How would I construct the URL to retrieve all of the 2019 batted ball events that were in play?

    game_year = 2019
    game_type = R
    only the events = single or double or triple or home_run or any kind of batted ball in play out

    then for each of those events i would want to have the following data

    the event type = single or double or triple or home_run or whatever kind of batted ball out
    hit_location
    launch_angle
    launch_speed
    launch_speedangle
    hit_distance_sc
    bb_type
    barrel
    estimated_ba_using_speedangle
    estimated_woba_using_speedangle

    Thanks!

    ReplyDelete