Friday, January 12, 2018

CFB Pythagorean Records Revisited

A couple months ago, I published an article discussing the use of Pythagorean expectation to evaluate CFB teams and taking it a step further by using differentials not just in points, but in yardage and yardage efficiency. If you're interested, that article can be found here. Today, I will be revisiting the concepts discussed in that article, and taking it a step further, looking at the year to year trends in Pythagorean expectation and how that relates to overall win percentage.

The idea of a Pythagorean Record in baseball is that a team tends to regress to their Pythagorean Record. For example, if a team finishes 10 games above what their record would be for a given season, that same team the next season (barring major personnel changes) would finish closer to their Pythagorean Record for the previous year than to their overall record for the previous year. It's called regression - if you flip a fair coin 10 times and it comes up heads 9 times, it's unlikely that the next ten times that you flip it, it will come up 9 times again.

Does this notion hold ground in football? The answer initially appears to be no.

The relationship between Pythagorean Expectation in 2016 and Win Percentage in 2017 is almost non-existent, with an R^2 of 0.01.

This makes a good amount of sense - unlike MLB teams, college teams lose approximately one-quarter of their team every year and replace them with an entirely different group of players. Personnel changes can also dramatically impact a program - see Lane Kiffin's arrival at FAU as an example. As a result, the overall strength of CFB teams from season to season varies dramatically, and Pythagorean expectation from the previous season is not indicative of

Does luck automatically regress year to year? Our data from 2016-2017 fails to support that conclusion as well.

In general, we see that luck does not correlate from season to season. In general, a team that outperforms their Pythagorean Expectation is just as likely to outperform it again as they are to underperform it by the same margin the next season. In addition, if we look at the average of residuals for CFB teams since 2014, it becomes very obvious that it's difficult to consistently overperform or underperform.

You'll notice that the data is largely normally distributed, but skewed a little to the right - I would believe that this results from the fact that it is easier to consistently lose close games than it is to win them. Hence, if you lose more close games than you win, you'll post a negative residual.

But over the past 3 seasons, very few teams have outperformed their Pythagorean expectation consistently, and those that have only did so did not do so by extreme margins. The "luckiest" team in this span, Georgia Southern, outperformed their expected record by winning about 2.6 games more than they should have over the stretch, which is by no means a huge margin. The "unluckiest" team, Washington, lost only about 4.4 games more than they should have. In the longer run, I would expect this differences to become only marginal, and I would avoid calling Washington's spate of bad luck (or GaSo's spate of good luck) anything other than a fluke.

Ultimately, we can conclude that without context, Pythagorean Record does not predict future performance. We can also conclude that college football teams do not really possess the means of consistently outperforming or underperforming their Pythagorean Expectation over long periods of time, indicating that teams will trend towards their pythagorean expectation in the long run.

To view the updated CFB 2017 advanced records, click here.

To view the CFB 2016 advanced records, click here.

Sunday, January 7, 2018

Pythagorean Yardage: A New Way to Evaluate CFB Teams

Note: A version of this article appeared on The Unbalanced on October 17th, 2017. The original article can be found here.

You’ve probably seen ESPN NY radio host Don La Greca’s rant against the use of the Pythagorean theorem in football. If you haven’t, you can watch it here. It’s highly amusing, especially considering that no one uses the Pythagorean theorem in football — most football players today learned it back in middle school (or at their senior year at UNC), and have never used it since. What La Greca might be trying to rant about is Pythagorean expectation: a formula used to predict a team’s win percentage based on point differentials. La Greca’s rant got me thinking — could we use Pythagorean expectation in football? And how can we apply it?

What is Pythagorean Expectation?

Pythagorean expectation originated from baseball sabermetrics, an invention of Bill James. The idea behind Pythagorean expectation is that a team’s run differential can be used to predict a team's win percentage. In other words, a team scoring more runs than they allow will tend to have a higher win percentage than a team that scores fewer runs than it allows.

There are three approaches to this problem — we can use a linear regression, a second order Pythagorean expectation, or a Pythagorean expectation using a specifically calculated exponent.

For linear regression, I downloaded CFB teams’ point differentials and used R to create an equation of a line that would predict their win percentage based on their point differential. This model, while fairly accurate, has significant limitations in that it allows for teams to have win percentages greater than 1 or less than 0.

I performed a similar calculation using a second order Pythagorean expectation formula – (Points scored)^2 / ((Points scored)^2+(Points allowed)^2) = expected win percentage. This equation fares better than the linear differential because it makes it impossible for teams to have win percentages greater than one or less than zero. However, using a power of two in the above equation is only an approximation — we can find the exact exponent to use with some algebra.

Setting our exponents in the above equation from “2” to “K” and solving yields that log (Wins / Losses) = K * log (Points Scored / Points Allowed), where K is a constant. We can use linear regression to solve for K, the exact exponent to be used based on a team's current Wins and Losses, such that our equation becomes (Points scored)K / ((Points scored)K + (Points allowed)K) = expected win percentage.

Expanding Upon Pythagorean Expectation

Pythagorean expectation is nothing new to football, and the Pythagorean expectation equation has been applied to a multitude of sports other than baseball. Rather than just rehash already-existing equations, I decided to explore new ways to look at Pythagorean expectation, especially in regards to football. I explored the relationship between win percentage and yardage differential, yards per play differential, and yards per point differential. I performed the same calculations as described above using these new variables — substituting in Yards gained for Points scored, Yards allowed for Points allowed, etc. and then looked at their accuracy.

Results from 2016

To test my methods initially, I used data of 2016 BCS teams from Sports References’ college football site. I downloaded the following sets of data on a team by team basis:
  • Points scored
  • Points allowed
  • Total Yards of Offense
  • Yards Per Play
  • Yards Per Play Allowed
  • Yards Per Point
  • Yards Per Point Allowed
I then calculated each team's expected win percentage based on the differential for each metric: Points, Yards, Yards Per Play, and Yards Per Point. Using Root Mean Square Error, I then evaluated each model’s accuracy. For context, the closer RMSE is to zero, the more accurate the model is. As a point of scale, the RMSE of a traditional (second order) Pythagorean expectation for a decade of MLB scores is about 0.02.

As expected, the RMSE values for BCS teams in the 2016 season were significantly greater than the RMSE values for MLB teams. Since CFB teams play relatively fewer games than MLB teams, there is only a small range of win percent values for teams to fall upon (in a 13-game season, the only possible win percent values are multiples of 1/13 ranging from 0.000 to 1.000). However, for our purposes, the values in the exact-exponent Pythagorean are very accurate and give us a good picture of the strength of each BCS team.

The Argument for Using Pythagorean Yards

As evidenced above, using the exact-exponent Pythagorean method is superior, since it yields the best results by far. However, there are still merits to the other methods, specifically yardage differentials.

Presented below are several graphs displaying the normalized values of the rolling averages of points per game, yards per game, yards per play, and yards per point for teams in the 2016 season. In most instances, yards per game tends to normalize more quickly than any other value.

In general, the rolling average of yards per play and yards per point swing around drastically throughout the season, but points and yards tend to approach their final values more quickly, and yards tend to approach its final value more quickly than points (though this distinction is not quite as concrete as the assertion that yards per play and yards per point vary more than points and yards).

If we believe that our rolling average for yardage approaches its final value more quickly than points, then earlier in the season, our exact-exponent Pythagorean expectation based on yardage will be more indicative of a team’s end-of-season expected win-percent than the exact-exponent Pythagorean expectation based on points.

Any of the models with exact exponents, however, are technically appropriate to use. As a matter of personal preference, and not out of some quantifiable law, I prefer to use exact-exponent Pythagorean expectation for yardage compared mid-way through the season, and exact-exponent Pythagorean expectation for points approaching the end of the season. In general, the most accurate model is the exact-exponent Pythagorean expectation for points at the end of the season.

Evaluating Outliers

The model hardly fits every team exactly. Presented below are the residuals for each 2016 BCS team based on the exact exponent Pythagorean model for points. If a team has a residual above zero, that means it outperformed its expectation, and if it was below zero, that means it underperformed its expectation.

The team that overperformed its expectation the most during last season was Idaho, who should have had a win percent of .491, but instead posted a win percentage of .692. Meanwhile, the team that underperformed the most was Notre Dame, who was expected to post a .560 win percentage, but instead finished with a .333 win percentage.

What causes teams to over/underperform Pythagorean expectations? Generally, performance in close games can cause teams’ overall win percentages to go out-of-skew with their expected win percentages. In 2016, Notre Dame was 1-4 in games decided by three points or less, but Idaho went 3-0 in such games. Such performances can account for the swings from expected win percent to actual win percent.

Performing poorly in close games can be largely attributed to luck, and is considered unsustainable. In general, teams that outperform or underperform their Pythagorean expectations one season find themselves regressing toward their expectation the next season. Indeed, Notre Dame is 5-1 in 2017, and Idaho is a mere 2-4 thus far this season.

Limitations of the Model

No model is ever 100 percent perfect, and Pythagorean expectation is no different. The residuals above should be evidence enough of this, but I want to take some time to address the limitations of the model in regards to college football.

The most obvious limitation is that, due to BCS’s small schedule, win percent values will fall into certain sets. Notice the residuals chart: each of the residuals falls into slanted lines, each representing a certain record. For a 13-game season, there will only be 14 possible win percent values – 0/13, 1/13, 2/13, etc. but Pythagorean expectations do not fall into such sets. Over a long season, such as baseball’s, these differences would be minimal.

Another caveat is that of “blowout games” and non-FBS opponents. If a team were to run up the score on a much weaker team (say, scoring 222 points while shutting out their opponents), its Pythagorean record will show it as being a much stronger team than it might be. However, the degree to which a team can beat up a weaker opponent is still indicative of the strength of a team. Also, since most teams only schedule one or two non-FBS opponents in a season, usually the impact of such games does not affect a team’s record (though I would certainly caution people about using Pythagorean records to evaluate teams three games into a season when one or two of those teams is non-FBS!).

It’s also worth noting the yardage model essentially assumes the odds of getting a third down stop or forcing a turnover are the same, regardless of field position. If a team’s defense consistently allows opponents to get to the red zone but makes red-zone stops, then that team will outperform its Pythagorean differential. If a team is consistently unable to capitalize on scoring while in the red-zone, then it will underperform its Pythagorean expectation.

The final major consideration is that Pythagorean record does not consider strength of schedule. In the future, I might like to look at calculating Pythagorean records while weighting differentials based on opponent SRS, or opponent Pythagorean expectation, but as it currently stands, this model does not consider opponent strength.

Even with these considerations involved, the model still performs admirably in determining a team's strength and expected win percentage.

Evaluating 2017 Teams

So far, we’ve dealt only with 2016’s data for the purposes of explaining each variable. We can use this data for the current college football season as well! Using the exact-exponent Pythagorean expectation for points, here are the top ten teams in college football.
  1. Penn State (expected W%: 1.000, actual W%: 1.000)
  2. Alabama (expected W%: 1.000, actual W%: 1.000)
  3. Washington (expected W%: 1.000, actual W%: .857)
  4. Ohio State (expected W%: 1.000, actual W%: .857)
  5. UCF (expected W%: 1.000, actual W%: 1.000)
  6. Georgia (expected W%: 1.000, actual W%: 1.000)
  7. Wisconsin (expected W%: .999, actual W%: 1.000)
  8. South Florida (expected W%: .999, actual W%: 1.000)
  9. Clemson (expected W%: .999, actual W%: .857)
  10. Virginia Tech (expected W%: .998, actual W%: .833)
You’ll notice there is very little differentiating the top teams; Penn State has an edge of only about .00001 over Alabama. But, I’m wary of any evaluation that doesn’t put Alabama directly on top. Remember what I said about how yards normalize more quickly than points? Let’s take a look at the top-10 teams from exact-exponent Pythagorean expectation for yards.
  1. Alabama (expected W%: 1.000, actual W%: 1.000)
  2. Ohio State (expected W%: 1.000, actual W%: .857)
  3. Georgia (expected W%: .999, actual W%: 1.000)
  4. Wisconsin (expected W%: .999, actual W%: 1.000)
  5. South Florida (expected W%: .999, actual W%: 1.000)
  6. Washington (expected W%: .999, actual W%: 0.857)
  7. Michigan (expected W%: .998, actual W%: .833)
  8. Oklahoma State (expected W%: .998, actual W%: .833)
  9. UCF (expected W%: .998, actual W%: 1.000)
  10. Penn State (expected W%: .997, actual W%: 1.000)
There’s slightly more differentiation present in the exact-exponent Pythagorean expectation for yards, and Alabama is at the top; this list “passes the eye test” a little bit more than the previous list. Again, though whichever expectation you wish to use is a matter of personal preference, as most of the exact-exponent models have comparable accuracy (though the point-based model is most accurate).

Which teams have been the least lucky this season? Here are the largest negative differences between actual win percent and exact-exponent Pythagorean expectation for yards.
  1. Massachusetts (expected W%: .533, actual W%: .000)
  2. Air Force (expected W%: .838, actual W%: .333)
  3. New Mexico State (expected W%: .919, actual W%: .429)
  4. Louisville (expected W%: .993, actual W%: .571)
  5. Texas (expected W%: .908, actual W%: .500)
These teams have underperformed in terms of wins and losses, but have consistently outgained their opponent in terms of total yardage. Moving forward, I would expect these teams to perform better than they have thus far this season.

The luckiest team this season are as follows, based off the largest positive difference between actual win percent and exact-exponent Pythagorean expectation for yards.
  1. Kentucky (expected W%: .226, actual W%: .833)
  2. Wyoming (expected W%: .073, actual W%: .667)
  3. Akron (expected W%: .044, actual W%: .571)
  4. South Carolina (expected W%: .225, actual W%: .714)
  5. California (expected W%: .083, actual W% .571)
These teams have managed to win a lot of games despite being significantly outgained — hardly a recipe for success. As a result, these teams are much weaker than their records might appear, and I would not expect them to find continued success in the rest of the season.

A note of caution: at this point in the season, some teams have a large divide between their Pythagorean expectation for yardage and for points. If you notice that a team has a large difference between the two, in evaluating them, I would exercise caution and look more closely at a team's overall strength. RMSE for the 2017 figures was approximately twice as large as it was for the 2016 season, because we have only half a season’s worth of figures presented. These figures should become more accurate as the season goes on.

Major thanks go to Chapman and Hall for their book, Analyzing Baseball Data with R, Bill James for inventing Pythagorean wins, and College Football Reference for its wonderful database.

Note: Following the conclusion of the college football playoffs, I hope to publish the results from the 2017-2018 season sometime this week.

Giving Players the Bonds Treatment

Note: This article was originally published at the FanGraphs community site on August 14, 2017. The original version of this article can be found here.

There is no higher compliment that can be given to a ballplayer than to be given “The Bonds Treatment” — being intentionally walked with the bases empty, or even better, with the bases loaded. It’s called “The Bonds Treatment” because Barry Bonds recorded an astounding 41 IBBs with the bases empty, and is one of only two players to ever record a bases-loaded intentional walk. In other words, 28% of IBBs ever issued with the bases empty were given to Bonds — and 50% of IBBs with the bases loaded. Bonds was great, no denying that — but is there anyone out there today who is worthy of such treatment?

We can find out using a Run Expectancy matrix. An RE matrix is based on historical data, and it can tell you how many runs, on average, a team could expect to score in a given situation. A sample RE matrix, from Tom Tango’s site, is shown below.

The chart works as follows — given a base situation (runners on the corners, bases empty, etc.) move down to the corresponding row, then move to the corresponding column and year to find out how many runs a team could expect to score from that situation. In 2015, with a runner on 3rd and 1 out, teams could expect to score .950 runs on average (or, RE is .950). If the batter at the plate struck out, the new RE would be .353.

We can take this a step further. Sean Dolinar created a fantastic tool that allows us to (roughly) examine RE in terms of a batter’s skill. Having Mike Trout at the plate vastly improves your odds of scoring more than having Alcides Escobar, and the tool takes this into account. We can use this tool to look at who deserves the Bonds treatment in 2017 (or, to see if anyone deserves the Bonds treatment): defined as being walked with the bases empty, or the bases loaded.

First, we can look at a given player and their RE scores for having the bases empty or full. In this instance, we will use Michael Conforto, who batted leadoff for the Mets against the Texas Rangers on August 9. Conforto’s wOBA entering the game was .404, and the run environment for the league is 4.65 runs per game, so Conforto’s relevant run expectancy matrix looks like this:

Batting behind him was Jose Reyes, who, entering the game, had a wOBA of .283. Let’s assume that Conforto receives the Bonds Treatment, and is IBB’d in a given PA with bases empty or loaded. What would the run expectancy look like with Reyes up? In other words, what is Reyes’ run expectancy with a runner on first, or with the bases loaded after a run has been IBB’d in?

To do this, we can look at Reyes’ RE with a runner on first and with the bases loaded. Reyes’ RE with a man at 1B is indicative of what the RE would be like if Conforto had been given an intentional free pass. For a bases-loaded walk, we look at Reyes’ RE with the bases loaded, and then add a run onto it (to account for Conforto walking in a run).

Then, we can compare the corresponding cells of the matrices to see if the Texas Rangers would benefit any from walking Conforto. If RE with Conforto up and the bases empty is higher than RE with a runner on first and Reyes up, or RE with the bases loaded and Conforto up is higher than RE with Reyes up and a run already scored, then we can conclude that it makes sense to give Conforto that free pass.

In this instance, we can see that if the Rangers were to face Conforto with the bases empty and two out, it would make more sense for them to IBB Conforto and pitch to Reyes than it would for them to pitch to Conforto, because RE with Conforto up (.172) is higher than RE with Reyes up and Conforto on (.145). As a result, Conforto is a candidate for the Bonds treatment in this lineup configuration, if the right situation arises.

Who else could be subjected to the Bonds treatment? It would take me a few months of work to run through every single individual lineup for every team to figure out who should have been pitched to and who should have gotten a free pass, so to simplify things, I looked at hitters with 400+ PA, looked at when they most frequently batted, who batted behind them most frequently, and whether or not they should have received the Bonds treatment based on who was on deck. While no lineup remains constant throughout the season, looking at these figures gave me a good idea of who regularly batted behind whom.

Three candidates emerged to be IBB’d with the bases empty every time, regardless of outs— Yasiel Puig, Jordy Mercer, and Orlando Arcia. These players usually bat in the eighth slot on NL teams, and right behind them is the pitchers’ slot — considering how historically weak pitchers are with the bat, it makes sense that RE tells us to walk them with the bases empty every single time.

The same could be said of almost anyone batting ahead of a pitcher — according to our model, given an average-hitting pitcher, any hitter with a wOBA over .243 should be IBB’d with the pitcher on deck (only one qualified hitter — Alcides Escobar — has a lower wOBA than .243). The three names above stuck out in the analysis because they were the only players with 400+ PA that had spent most of their PAs batting eighth.

So, an odd takeaway of this exercise is that in the NL, unless a pinch-hitter is looming on deck, the eighth hitter should almost always be intentionally walked with the bases empty, because it lowers the run expectancy. Weird!

The model also identified two hitters who deserved similar treatment to Michael Conforto in the above example (IBB with 2 out and no one on) — Buster Poseyand Chase Headley.

Posey has batted with almost alarming regularity ahead of Brandon Crawford, who is running an abysmal .273 wOBA on the season. Headley is a little more curious — Headley is usually a weak hitter, but earlier in the season, Headley batted ahead of Austin Romine frequently, who was even worse than Crawford.

Headley technically isn’t that much of a candidate for the Bonds Treatment since Romine hasn’t batted behind him since June 30, but Crawford has backed up Posey as recently as August 3 — if he’s batted behind Posey again, the situation could very well arise where it becomes beneficial for teams to simply IBB Posey with two out and bases empty.

But ultimately, no one, aside from NL hitters in the eighth slot, emerges as a candidate to be IBB’d every time with the bases empty. And no one, regardless of the situation, deserves a bases-loaded intentional walk. Which raises the question — was it appropriate to give the man himself, Barry Bonds, the Bonds Treatment?

Bonds received an incredible 19 bases-empty IBBs in 2004 (more than doubling the record he set in 2002), so we’ll use 2004 Bonds and his .537 wOBA as the center of our analysis.

In 2004, Bonds batted almost exclusively fouth, and the two men who shared the bulk of playing time batting fifth behind him (Edgardo Alfonzo and Pedro Feliz) had almost identical wOBAs that season (.333 and .334, respectively) — so we’ll assume that the average hitter behind Bonds in 2004 posted a wOBA of .333. This yields RE matrices that look like this:

Bonds proves himself worthy not only of a bases-empty IBB with two out, but he just barely misses with a bases-loaded IBB. While no one ended up giving Bonds a bases-loaded IBB in 2004, they did give him one in 1998.

For perspective, Bonds was running a .434 wOBA in 1998, and Brent Mayne(who was on deck) was running a .324 wOBA — so this actually wasn’t a move that moved RE or win probability in the right direction.

The final spike in WPA is Bond’s IBB — it gave the Giants a better chance of winning. Ultimately, it was a bad idea that didn’t backfire in the Diamondback’s faces.

And of course, I would be remiss in not mentioning the other player to have ever received a bases-loaded IBB — Josh Hamilton.

With apologies to Hamilton, he wasn’t the right guy to get the Bonds treatment here, either — Hamilton ran a .384 wOBA in 2008, and Marlon Byrd, who was on deck, had a .369 wOBA, which means that an IBB in this instance was a really awful move. An awful move that, like Bonds’ IBB, was rewarded by Byrd striking out in the next AB.

Have there been other players deserving of bases-loaded IBBs? It’s possible, but the most likely candidates — Ted Williams and Babe Ruth — usually had good enough protection in the lineup. Of course, there are few hitters that could have protected Bonds from himself — hence why it’s almost a good idea to IBB him with the bases loaded.

An Exercise in Generating Similarity Scores

Note: This post was originally published on the FanGraphs community page on December 5, 2017, and can be accessed on that site here.

In the process of writing an article, one of the more frustrating things to do is generate comparisons to a given player. Whether I’m trying to figure out who most closely aligns with Rougned Odor or Miguel Sano, it’s a time-consuming and inexact process to find good comparisons. So I tried to simplify the process and make it more exact — using similarity scores.

An Introduction to Similarity Scores

The concept of a similarity score was first introduced by Bill James in his book The Politics of Glory (later republished as Whatever Happened to the Hall of Fame?) as a way of comparing players who were not in the Hall of Fame to those who were, to determine which non-HOFers deserved a spot in Cooperstown. For example, since Phil Rizzuto’s most similar players per James’ metric are not in the HOF, Rizzuto’s case for enshrinement is questionable.

James’ similarity scores work as such: given one player, to compare them to another player, start at 1000 and subtract one point for every difference of 20 games played between the two players. Then, subtract one point for every difference of 75 at-bats. Subtract a point for every difference of 10 runs scored…and so on.

James’ methodology is flawed and inexact, and he’s aware of it: “Similarity scores are a method of asking, imperfectly but at least objectively, whether two players are truly similar, or whether the distance between them is considerable” (WHHF, Chapter 7). But it doesn’t have to be perfect and exact. James is simply looking to find which players are most alike and compare their other numbers, not their similarity scores.

Yes, there are other similarity-score metrics that have built upon James’ methodology, ones that turn those similarities into projections: PECOTA, ZiPS, and KUBIAK come to mind. I’m not interested in making a clone of those because these metrics are obsessed with the accuracy of their score and spitting out a useful number. I’m more interested in the spirit of James’ metric: it doesn’t care for accuracy, only for finding similarities.

Approaching the Similarity Problem

There is a very distinct difference between what James wants to do and I what I want to do, however. James is interested in result-based metrics like hits, doubles, singles, etc. I’m more interested in finding player similarities based on peripherals, specifically a batted-ball profile. Thus, I need to develop some methodology for finding players with similar batted-ball profiles.

In determining a player’s batted-ball profile, I’m going to use three measures of batted-ball frequencies — launch angle, spay angle, and quality of contact. For launch angle, I will use GB%/LD%/FB%; for spray angle, I will use Pull%/Cent%/Oppo%; and for quality of contact, I will use Soft%, Med%, Hard%, and HR/FB (more on why I’m using HR/FB later).

In addition to the batted-ball profiles, I can get a complete picture of a player’s offensive profile by looking at their BB% and K%. To do this, I will create two separate similarity scores — one that measures similarity based solely upon batted balls, and another based upon batted balls and K% and BB%. All of our measures for these tendencies will come from FanGraphs.

Essentially, I want to find which player is closest to which overall in terms of ALL of the metrics that I’m using. The term “closest” is usually used to convey position, and it serves us well in describing what I want to do.

Gettin’ Geometrical

In order to find the most similar player, I’m going to treat every metric (GB%, LD%, FB%, Pull%, and so on) as an axis in a positioning system. Each player has a unique “position” along that axis based on their number in that corresponding metric. Then, I want to find the player nearest to a given player’s position within our coordinates system — that player will be the most similar to our given player.

I can visualize this up to the third dimension. Imagine that I want to find how similar Dee Gordon and Daniel Murphy are in terms of batted balls. I could first plot their LD% values and find the differences.

So the distance between Murphy and Gordon, based on this, is 4.8%. Next, I could introduce the second axis into our geometry, GB%.

The distance between the two players is given by the Pythagorean formula for distance — sqrt(ΔX^2 + ΔY^2), where X is LD% and Y is GB%. To take this visualization to a third dimension and incorporate FB%…

… I would add another term to the distance calculation — sqrt(ΔX^2 + ΔY^2 + ΔZ^2). And so on, for each subsequent term. You’ll just have to use your imagination to plot the next 14 data points because Euclidian geometry can’t handle dimensions greater than three without some really weird projections, but essentially, once I find the distance between those two points in our 10 or 12-dimensional coordinate system, I have an idea how similar they are. Then, if I want to find the most similar batter to Daniel Murphy, I would find the distance between him and every other player in a given sample, and find the smallest distance between him and another player.

If you’ve taken a computer science course before, this problem might sound awfully familiar to you — it’s a nearest-neighbor search problem. The NNS problem is about finding the best way to determine the closest neighbor point to a given point in some space, given a set of points and their position in that space. The “naive” solution, or the brute-force solution, would be to find the distance between our player and every other player in our dataset, then sort the distances. However, there exists a more optimized solution to the NNS problem, called a k-d tree, which progressively splits our n-dimensional space into smaller and smaller subspaces and then finds the nearest neighbor. I’ll use the k-d tree approach to tackling this.

Why It’s Important to Normalize

I used raw data values above in an example calculation of the distance between two players. However, I would like to issue caution against using those raw values because of the scale that some of these numbers fall upon.

Consider that in 2017, the difference between the largest LD% and smallest LD% among qualified hitters was only 14.2%. For GB%, however, that figure was 30.7%! Clearly, there is a greater spread with GB% than there is with LD% — and a difference in GB% of 1% is much less significant than a difference in LD% of 1%. But in using the raw values, I weight that 1% difference the same, so LD% is not treated as being of equal importance to GB%.

To resolve this issue, I need to “normalize” the values. To normalize a series of values is to place differing sets of data all on the same scale. LD% and GB% will now have roughly the same range, but each will retain their distribution and the individual LD% and GB% scores, relative to each other, will remain unchanged.

Note: After publishing this article, it was suggested that I instead use Mahalanobis Distances in my calculation. While certainly useful and arguably a more correct solution, by normalizing the values in such a manner as described above means that I can assume a spherical shape of data and find a geometric distance between data points. It is rough, but it still accomplishes the task practically.

Now, here’s the really big assumption that I’m going to make. After normalizing the values, I won’t scale any particular metric further. Why? Because personally, I don’t believe that in determining similarity, a player’s LD% is any more important than the other metrics I’m measuring. This is my personal assumption, and it may not be true — there’s not really a way to tell otherwise. If I believed LD% was really important, I might apply some scaling factor and weigh it differently than the rest of the values, but I won’t, simply out of personal preference.

Putting it All Together

I’ve identified what needs to happen, now it’s just a matter of making it happen.

So, go ahead, get to work. I expect this on my desk by Monday. Snap to it!

Oh, you’re still here.

If you want to compare answers, I went ahead and wrote up an R package containing the function that performs this search (as well as a few other dog tricks). I can do this in two ways, either using solely batted-ball data or using batted-ball data with K% and BB%. For the rest of this section, I’ll use the second method.

Taking FanGraphs batted-ball data and the name of the target player, the function returns a number of players with similar batted-ball profiles, as well as a score for how similar they are to that player.

For similarity scores, use the following rule of thumb:

0-1 -> The same player having similar seasons.

1-2 -> Players that are very much alike.

2-3 -> Players who are similar in profile.

3-4 -> Players sharing some qualities, but are distinct.

4+ -> Distinct players with distinct offensive profiles.

Note that because of normalization, similarity scores can vary based on the dataset used. Similarity scores shouldn’t be used as strict numbers — their only use should be to rank players based on how similar they are to each other.

To show the tool in action, let’s get someone at random, generate similarity scores for them, and provide their comparisons.

Here’s the offensive data for Elvis Andrus in 2017, his five neighbors in 12-dimensional space (all from 2017), and their similarity scores.

The lower the similarity score, the better, and the guy with the lowest similarity score, J.T. Realmuto, is almost a dead ringer for Andrus in terms of batted-ball data. Mercer, Gurriel, Pujols, and Cabrera aren’t too far off as well.

After extensively testing it, the tool seems to work really well in finding batters with similar profiles — Yonder Alonso is very similar to Justin Smoak, Alex Bregman is similar to Andrew McCutchen, Evan Longoria is similar to Xander Bogaerts, etc.

Keep in mind, however, that not every batter has a good comparison waiting in the wings. Consider poor, lonely Aaron Judge, whose nearest neighbor is the second furthest away of any other player in baseball in 2017 — Chris Davis is closest to him with a similarity score of 3.773. Only DJ LeMahieu had a further nearest-neighbor (similarity score of 3.921!).

The HR/FB Dilemma

While I’m on the subject of Aaron Judge, let’s talk really quickly about HR/FB and why it’s included in the function.

When I first implemented my search function, I designed it to only include batted-ball data and not BB%, K%, and HR/FB. I ran it on a couple players to eye-test it and make sure that it made sense. But when I ran it on Aaron Judge, something stuck out like a sore thumb.

Players 2-5 I could easily see as reasonable comparisons to Judge’s batted balls. But Nick Castellanos? Nick Castellanos? The perpetual sleeper pick?

But there he was, and his batted balls were eerily similar to Judge’s.

Judge hits a few more fly balls, Castellanos hits a few more liners, but aside from that, they’re practically twins!

Except that there’s not. Here’s that same chart with HR/FB thrown in.

There’s one big difference between Judge and Castellanos, aside from their plate discipline — exit velocity. Judge averages 100+ MPH EV on fly balls and line drives, the highest in the majors. Castellanos posted a meek 93.2 MPH AEV on fly balls and line drives, and that’s with a juiced radar gun in Comerica Park. Indeed, after incorporating HR/FB into the equation, Castellanos drops to the 14th-most similar player to Judge. 

HR/FB is partially considered a stat that measures luck, and sure, Judge was getting lucky with some of his home runs, especially with Yankee Stadium’s homer-friendly dimensions. But luck can only carry you so far along the road to 50+ HR, and Judge was making great contact the whole season through, and his HR/FB is representative of that.

In that vein, I feel that it is necessary to include a stat that has a significant randomness component, which is very much in contrast with the rest of the metrics used in making this tool, but it is still a necessary inclusion nevertheless for the skill-based component of that stat.

Using this Tool

If you want to use this tool, you are more than welcome to do so! The code for this tool can be found on GitHub here, along with instructions on how to download it and use it in R. I’m going to mess around with it and keep developing it and hopefully do some cool things with it, so watch this space…

Although I’ve done some bug testing (thanks, Matt!), this code is still far from perfect. I’ve done, like, zero error-catching with it. If in using it, you encounter any issues, please @ me on twitter (@John_Edwards_) and let me know so I can fix them ASAP. Feel free to @ me with any suggestions, improvements, or features as well. Otherwise, use it responsibly!

Saturday, January 6, 2018


Uh... hello there.

If the website couldn't tell you, I'm John Edwards. I'm an Atlanta-based baseball writer and analyst, though I sometimes do research in other fields. I currently contribute to Sporting News, write for The Unbalanced, and publish research on the FanGraphs community site.

This site is here to compile all the different pieces of research that I've done. What little personal/pop-culture stuff that I write about can be found on my Medium site, but this site will be exclusively for research and analysis.

Over the next few days, I expect to compile and publish some of my past research, as well as put up a few interesting leaderboards (Pythagorean Yardage leaderboards, for example, updated for the end of the 2017 season), as well as a few more projects that I've been working on that aren't particularly suited for publication elsewhere but still might be interesting to readers. I'll crosspost any significant research that I do in the future as well on this site.

I hope you'll enjoy, and I look forward to frequently updating this site.