Developing the Bestest xBABIP Equation Yet

As a projectionist, I am seemingly on a never-ending quest to develop equations for every result statistic. By result statistic, I mean home runs, for example, which are fueled by such skills as hitting the ball far, among others, which itself is summarized by the average batted ball distance we reference here quite often.

Another one of those result statistics is batting average. A hitter’s batting average is derived from two underlying skills — his ability to make contact (strikeout rate) and turn balls in play into hits (batting average on balls in play). While a hitter’s strikeout rate is quite stable from year to year, unfortunately his BABIP is not. It’s one of the metrics we still struggle to explain, with luck considered to play a major role.

There have been numerous attempts to come up with xBABIP equations and calculators, none of which proved reliable enough to use in our every day analyses. But I refuse to give up, throw up my hands, and simply go with some sort of three-year average to project the following season’s BABIP. So my newest adventure involves journeying to the bestest xBABIP ever developed. And now I share that trip with you.

I began by poring over all the metrics available on the FG player pages. I decided that the likeliest metrics that would affect BABIP would be those in the Plate Discipline and Batted Ball sections, along with ISO and Spd. My thoughts for including these metrics for my tests were as follows:

Plate Discipline — hitters swinging at more and making greater contact with pitches outside the strike zone would seemingly lead to weaker contact; similarly, balls hit inside the strike zone would be struck better and go for hits more often

Batted Ball — we know that line drives go for hits most often of all the batted ball types, while infield fly balls are almost guaranteed outs

ISO — isolated slugging percentage is used as a proxy for a hitter’s power; greater power should result in harder hit balls, which we know go for hits more often than weakly hit balls

Spd — the faster a player is, the more likely he’ll be able to record infield hits

In addition to looking into these metrics, I discovered a new secret bullet. Aside from giving us the batted ball distance, Baseball Heat Maps also has a “Ground Ball and Line Drive Pull Percentages” leaderboard. It provides us with the average angle of those batted ball types. You will notice on that leaderboard that all the lefties are at top and righties are at the bottom. That’s because right field is measured as positive, with the higher the number the further away from center. And conversely, left field is negative and a greater negative mean the closer to the right field line the ball was hit. What we care about is the absolute value of those numbers, so the right-handers with negative values turn into positive numbers. So now the higher the number, the more pull happy the batter is.

In recent years, teams have shifted more and more and hitters who pull the majority of their ground balls are seeing their BABIP marks plummet as a result. So the hope is that this new data could significantly boost our ability to estimate and project BABIP marks.

After collecting all the necessary data, I calculated the correlation of each metric to BABIP. Let’s take a look at the results:

O-Swing% Z-Swing% Swing% O-Contact% Z-Contact% Contact% SwStr%
0.02 0.04 0.03 -0.07 -0.06 -0.08 0.08

Absolute Val of Angle LD% GB% FB% IFFB% Actual IFFB% IFH% BUH% ISO Spd
-0.29 0.40 0.16 -0.32 -0.38 -0.43 0.19 0.09 0.11 0.25

The Plate Discipline metrics in the first table came in lower than I expected. However, there is some negative correlation between O-Contact% and BABIP, which is logical, but not as much as I expected. The second table is where all the fun is. Yup, it looks like angle plays a real role, while line drive rate, IFFB% and Spd also factor into the equation.

You might wonder why there is an IFFB% correlation, as well as an “Actual IFFB%” one. The “Actual” version represents the percentage of pop-ups hit out of all batted balls, not just fly balls. It’s calculated simply by multiplying IFFB% by FB%. A hitter who posts a 15% IFFB% and 50% FB% is going to hit significantly more pop-ups that are easy outs than one with the same IFFB% but only a 30% FB%. So performing this adjustment results in a more accurate picture of the hitter’s batted ball distribution. And sure enough, the correlation is higher with this adjusted number.

My population set was composed of 2,375 player season from 2007 to 2014. After testing several combinations of components, I landed on the following winning equation:

xBABIP = 0.2530 + (O-Contact% * -0.0484) + (ISO * 0.1814) + (Absolute Value of Angle * -0.0024) + (LD% * 0.3657) + (FB% * IFFB% * -0.4531) + (Spd * 0.0046)

Adjusted R-Squared = 0.423

And now a plot of the results:

xBABIP

The next step was to determine if using xBABIP would help us forecast a player’s next season BABIP better than BABIP itself. So I calculated the correlation of xBABIP Year 1 to BABIP Year 2 and compared it to the correlation of BABIP Year 1 to BABIP Year 2. Here are the results:

Metric Correlation
xBABIP Yr 1 to BABIP Yr 2 0.404
BABIP Yr 1 to BABIP Yr 2 0.352

Success! Now personally, I would prefer to look at a hitter’s previous three seasons of xBABIP marks to forecast BABIP Season 4 and perhaps take a three year average of those xBABIP marks. But we don’t always have three season’s worth of data, so this is as good as it gets.

It’s time to put this equation into action and investigate the 2014 season. What follows is a spreadsheet with two separate tabs — the first is the overperformers and the second is the underperformers.

Overperformers

We knew that Drew Stubbs couldn’t possibly deserve a .404 BABIP, but an xBABIP mark of .345 confirms that he was still knocking the snot out of the ball. Along with his blend of power and speed, he should once again be one of the most valuable fourth outfielders in fantasy baseball and be a solid contributor in NL-Only leagues or mixed leagues with daily transactions.

J.D. Martinez’s power surge may or may not be for real, but his BABIP surge most certainly wasn’t. Still, a significantly lower rate of pop-ups led to an xBABIP mark that was still higher than any of his previous actual BABIP marks. His draft cost is sure to be highly variable and there will be no telling if he’ll come at a discount, a fair price or be an expensive risk.

Since he rarely takes a walk and strikes out too often for a hitter with mediocre power, Danny Santana is highly dependent on his BABIP for his offensive output. Unfortunately, that’s going to come crashing down in 2015, though he apparently did do all the right things last year to carry an inflated mark. Without any sort of BABIP luck, he’s at risk of losing grip on the leadoff spot in the lineup, which if lost, would take a huge bite out of his fantasy value.

Yasiel Puig outperformed his xBABIP mark in 2013 by even more than in 2014, so perhaps he’s doing something else not being captured by our equation. He’s due for a HR/FB rate rebound anyway, which will offset any decline in BABIP and keep his fantasy value near elite levels.

By doubling his stolen base total and posting a crazy BABIP, Lorenzo Cain put himself back on fantasy owners’ radars. But that high BABIP isn’t going to last and he’ll have just his legs to fall back on to earn himself respectable fantasy value.

Underperformers

Surprise, surprise, Chris Davis tops our underperformers list. Despite finishing fifth in baseball in pull-happiness with his ground balls and line drives, xBABIP still thinks Davis profiles as a strong BABIPer. What helped? A low O-Contact%, a high ISO and LD% and very few pop-ups. His pull tendency lost him a whopping .041 points of xBABIP! While the extreme pull guys are losing hits to the shift that the equation can’t account for, there’s little doubt that Davis should enjoy a significant BABIP rebound this season.

Brian McCann and Mark Teixeira are two more pull-happy guys whose xBABIP marks may be a bit inflated without any additional reductions from the shifts they face. McCann actually wasn’t totally about pulling the ball as he ranked just 55th among left-handed hitters in angle, but Teixeira ranked third in all of baseball in angle. Since 2012, McCann has underperformed his xBABIP by a large margin, after previously being close to it. That’s surely due to the shift. Teixeira is in the same boat, but his underperformance dates back to 2011.

What’s up with players whose first name ends in “hris” and last name is “Davis”?! Khris Davis doesn’t get shifted like the other Davis, but still hit into some apparent bad luck. Like the other Davis though, this one also hits for excellent power and avoids the pop-up. There’s some serious playing time risk here though due to the presence of Gerardo Parra, but he should perform better from a rate perspective.

Jedd Gyorko underperformed his xBABIP in 2013 as well, but not to the same degree. Expectations of a healthier season and a much improved supporting cast make him a nice option as a cheap second baseman or middle infielder.

Jay Bruce has underperformed his xBABIP in five of his seven seasons and he was one of the most shifted players in 2014. He was a complete disaster at the plate last year, but entering his age 28 season, you have to figure some sort of rebound is in store.

Joey Votto has shown no pattern of xBABIP under or overperformance, which is a good sign for 2015. His BABIP should fully rebound, but his health is obviously the bigger question mark.

In mid-January, Jeff Zimmerman published 2014 xBABIP values using his most updated equation. His formula incorporated “hard hit ball percentage” data from Inside Edge, as well as Speed Score. Let’s compare the xBABIP values using both equations and take a look at whose values differ most.

Pod’s xBABIP More Bullish

My equation loves Jose Abreu’s huge power, relatively low angle, above average line drive rate and pop-up avoidance skills. But Zimm’s equation isn’t so enamored and thinks he should have been a sub-.300 hitter during his phenomenal rookie campaign.

Christian Yelich is a BABIP dream. He hits liners and grounders to all fields, has a bit of pop, above average speed and hit all of one pop-up all season long. Perhaps he didn’t rate highly in terms of “well-hit” balls, which is the type that boosts Zimm’s xBABIP numbers.

Boy oh boy, if Joe Mauer ever lost his sensational ability to turn balls in play into hits, his fantasy value would crater. With a 27.2% LD% and 0 (ZERO!!) pop-ups all year, I have to admit that my .346 xBABIP value appears more reasonable here than Zimm’s mark of .310, which would have represented a career low.

My equation thinks that Jean Segura will enjoy a BABIP rebound this season if his profile holds up, while Zimm’s version thinks his low BABIP was deserved. Either way, it will be tough for him to recover his fantasy value while batting eighth in the lineup.

What’s Dee Gordon without a high BABIP? A one-category standout with lots of risk. My xBABIP thinks he deserved a high BABIP, while Zimm’s disagrees.

Zimm’s xBABIP More Bullish

Zimm’s xBABIP believes that Carlos Santana was quite unlucky last year, while mine thinks that wasn’t necessarily the case. He led baseball in pull percentage and hit a high percentage of pop-ups, which led to the low mark from my version of the equation.

By my equation, Andrew McCutchen was one of the luckier hitters in BABIP this past year. However, Zimm’s equation thinks an inflated BABIP was well deserved. McCutchen has beaten his xBABIP marks in five of his six seasons, and significantly so in the last three years. So perhaps Zimm’s version better reflects his true BABIP talent.

Edwin Encarnacion hasn’t posted a .300 BABIP since 2007, but Zimm’s equation thinks he deserved one in 2014. A low line drive rate, lots of pop-ups and a pull tendency hurt his xBABIP in my formula.

On the whole, I think I made real progress in explaining and projecting hitter BABIP marks. We do still have a lot of room to improve though as dealing with shifts is proving difficult. But this is another step in the right direction and I finally feel comfortable using xBABIP values to help me project next season BABIP marks.





Mike Podhorzer is the 2015 Fantasy Sports Writers Association Baseball Writer of the Year. He produces player projections using his own forecasting system and is the author of the eBook Projecting X 2.0: How to Forecast Baseball Player Performance, which teaches you how to project players yourself. His projections helped him win the inaugural 2013 Tout Wars mixed draft league. Follow Mike on Twitter @MikePodhorzer and contact him via email.

9 Responses to “Developing the Bestest xBABIP Equation Yet”

You can follow any responses to this entry through the RSS 2.0 feed.
  1. Rotoholic says:
    FanGraphs Supporting Member

    I like that you’re looking at the predictive value, here. I’m always the guy that pokes and prods about how formulas predict year to year performance (sorry!).

    One thing I’ll say, is that it might make the equation more predictive if you ran the regression with independent variables for Year N and the dependent variable (BABIP) for Year N+1. Describing past performance is a different matter than predicting future performance. My guess is that it would lower the coefficient on LD% since it’s not very predictive, and would likely give a higher overall correlation.

    Anyway, keep up the good work, I enjoy reading your stuff. It’s cool to see you and Jeff tackle the same subject from different angles.

  2. brian_msbc says:

    Where do we find Absolute Value of Angle? Is that data available?

  3. FanGraphs Supporting Member

    It’s the angle column in the Baseball Heat Maps leaderboard linked to. You just take the absolute value of that value. Positive numbers remain the same and negative numbers become positive.

  4. Rotoholic says:
    FanGraphs Supporting Member

    Thinking about this some more, I think there are possible issues with handedness, especially switch hitters. As a case study, lets look at Ben Zobrist.

    The angle data only uses left-handed plate appearances for switch-hitters, and throws out all data from right-handed plate appearances. This is good. Or at least, better than lumping all the data together. But the problem is that the ISO and batted ball data are (presumably, correct me if I’m wrong) for all plate appearances lumped together. We can correct this by using splits vs RHP, but still, we are ignoring the 30% of his plate appearances that come vs LHP, and many switch hitters have a huge platoon difference in their BABIP. Considering Zobrist has a 0.326 vs LHP and 0.278 vs RHP, it’s no surprise to see him on the list of overperformers with his real-life .301 BABIP in 2014. His xBABIP calculation said he should have had a .278 BABIP. But it’s only using Average Angle data vs RHP. And in 2014, almost exactly mirroring his career, Ben Zobrist had a .276 BABIP against RHP. The formula was correct. I didn’t cherry-pick Zobrist either, he was the first one I thought of. There aren’t that many switch hitters, but it’s just another way to better predict BABIP and improve that correlation. Although, it seems the correlation was taken using in-sample data (the formula was derived by a regression using data from 2007-2014, and then the correlation was tested on those very same years) which would inflate the correlation, and not be a true test of the predictability.

    Another issue with handedness: I assume Average Angle would factor differently for LHB than for RHB. For RHB shifting is rarer, and usually not as extreme, since there needs to be a guy close to 1B. Also, the throw to first on a righty pull hitter is much longer than for lefty pull hitters. So the Angle coefficient used for LHB might be lower than -0.0024 and for RHB, it might be higher. In any case, they are likely different, which means a different equation for each side would make the xBABIP even better. I know all of this takes the simplicity out of it, but when you get to the point of deriving multiple xBABIP formulas, those diminishing returns are embraced!

  5. FanGraphs Supporting Member

    You are absolutely correct. I realized the issue of switch hitters while doing the study, but there was nothing I could do about it since the data I need just isn’t available. It probably has limited effect on the research though since there are few switch hitters screwing up the coefficients.

    Next, the equation isn’t meant to be predictive, but rather descriptive. Instead of using something like a 3 year average BABIP for your year 4 forecasted BABIP, now you can use a 3 year average xBABIP. Just like I wouldn’t use one year of BABIP to forecast year 2, I wouldn’t use one year of xBABIP.

  6. Rotoholic says:
    FanGraphs Supporting Member

    Well, I meant for the correlation done on xBABIP (0.40) vs BABIP (0.35). To truly test the correlation, you’d have to do something like run the regression on data from 2007-2013 (would result in a similar formula to what you have, but slightly different) and then find the correlation of 2013 BABIP and 2013 xBABIP with 2014 BABIP. That way it’s a fair fight, and neither system already knows the data that it’s trying to predict. Since you used an 8 year sample size it won’t be a huge effect, but it’ll definitely make a difference.

  7. FanGraphs Supporting Member

    Thanks for the helpful feedback! I’ll look into doing that and will post the results.

  8. esolney33 says:

    Do you have a formula for pitchers as well?

  9. FanGraphs Supporting Member

    I do not. There have been many attempts at a pitcher xBABIP already and I don’t have any new metrics to incorporate. I tried the hitter version because of the new pull/oppo data.