Mining for Under (and Over) Performers: Strikeouts
A while back, I had a little pet project to try and simplify the process of sniffing out the over- and under-performers relative to strikeout rates. More specifically, recognizing the sometimes wild fluctuations between strikeout rates year to year, I wanted a better idea if a particular pitcher earned their increase (or decrease) in the category. Was there a process — similar to the one we use on ERA with batting average on balls in play and strand rates — that we could go through for strikeout rates?
Obviously, a high swinging strike rate suggests an inherent ability to strike batters out. Makes sense –- you don’t miss many bats, you’re not likely to wind up registering many strikeouts. So using the swinging strike rate to potentially identify the pretenders from the contenders has merit as the season wears on. But I wanted to tighten that up a bit — add variables that would perhaps control for another part of a pitcher’s skill set to help us identify who should reasonably be expected to strike out more, or fewer, batters. And of course, this is with fantasy baseball in mind –- so the idea was that we can all outsmart the next guy relative to the strikeout column.
I originally looked at strikeouts per nine innings pitched (K/9) as my dependent variable in the analysis, but as colleague Jeff Zimmerman astutely pointed out to me later, the strikeout percentage is a stronger measure to show actual ability. That is, “three straight strikeouts in an inning is better than three strikeouts and three hits in an inning even though they both end up as the same K/9.” An excellent observation. So back to the drawing board I went, with some improvements to the sample size, the overall data set, and yes, we’ll use strikeout percentage as our dependent variable.
First, the data set.
I’m looking at starting pitchers who tossed at least 100 innings in 2011. In the last study, I only used qualified starters and it served to limit the sample set considerably, so this opens things up accordingly. In order to make a strikeout rate comparison with the previous year, which will help us with the model, I need these pitchers to have thrown at least 100 innings in 2010 as well. So we lose some names in that cut (most notably, guys like Jeremy Hellickson, Jordan Zimmerman, Michael Pineda, Vance Worley, Brandon Beachy, Cory Luebke, and Alexi Ogando).
Next, I only wanted to compare pitchers that use their fastball enough to make the velocity on that pitch relevant, and thus kicked out R.A. Dickey and Tim Wakefield. Comparing their fastball velocity in this sample would only serve to skew the results since they’re obviously working on the knuckler for their outs, by and large. So we lose two more.
What we’re left with is a sample of 97 starting pitchers who have thrown at least 100 innings in the last two seasons.
Running correlations on this group, looking specifically at the variables we have in mind –- 2011 K% (represented in the graphs as K%), 2010 K%, Fastball velocity (FBv) and Swinging Strike rate (SwStr%) — all have statistically significant correlations with 2011 K% at 0.01 significance (incidentally, age was interestingly correlated at -.231, significant at the .05 level…but that’s for another study). The correlations are as follows:
Graphically, looking at the relationships of K% (2011) with 2010 K%, FBv, and SwStr%, you can see that all relationships in the model are statistically significant, and the associated R-squared in each case help define how much of the variance can be explained by that given relationship. In other words, how much of a pitcher’s 2011’s strikeout percentage can be explained by his 2010 K%? How much of that percentage is explained by his fastball velocity? Forgive me for the odd formatting, but if you click on the graph, you should have full functionality to hover over each case to get the name of the pitcher and the two sets of associated data, which is rather fun:
The R-squared for the 2010 K% is 0.568 with a p-value of <0.0001
The R-squared for the FBv is 0.295 with a p-value of <0.0001
The R-squared for the SwStr% is 0.681 with a p-value of <0.0001
Plugging all three into a linear regression model that uses 2011 K% as our dependent variable — in an effort to come up with an expected K% in 2011 — we find a model represented by this:
xK% = -.278 + (.003)*FBv + (1.428)*SwStr% + (.321)*K% 2010
The model summary and fit:
|R||R Square||Adjusted R Square||Std. Error of the Estimate|
So the model explains a good degree more variance in strikeout rate than any singular variable, which of course what this is all about. Applying this to our 97 starting pitchers from the sample set, there are 50 starting pitchers that are within 1.5% (either above or below) of their actual 2011 K%, there are 35 above 1.5% (outperforming their expected strikeout rate), and there are 12 under 1.5% (under-performing their expected strikeout rate). The full results with analysis of a couple cases thereafter:
|Name||x2011K%||2010 K%||2011 K%||FBv||SwStr%||Difference|
Zack Greinke topping this list as the resident over-performer is probably misleading for a couple of very obvious reasons. One, his 2010 K% of 19.7% was about seven percentage points behind 2009, and the fact that his league change to the National League should also help his strikeout figures. His career rate of 21.1% would have been a more reasonable number to use instead of 2010 K%, but hey, we can’t cherry pick. 28.1% might be a little lofty looking forward for Greinke, but I also think the model probably underrates him at 21.4%.
Cliff Lee is an interesting one. His 25.9% K rate in 2011 is far and away the highest of his career. His career K% is 19.3%, so using the 2010 K% figure of 22% isn’t necessarily under-representing his skill set. But Lee also moved leagues and the last time he was with the Phillies, he posted the second highest K rate of his career. I’d bet that he won’t repeat 25.9%, but he should best 20% in the National League.
Also interesting is the pretty dramatic drop in K% for Travis Wood between 2010 and 2011. Objectively, because he doesn’t have a terribly long track record, you might expect him to land somewhere comfortably in between his 20.5 K% from 2010 and his 15.5% rate from 2011, but in this sample, he was one of only two pitchers to perfectly match his expected K% — so according to the model, he fully earned his 15.5 K% over the course of 2011.
The model predicts that several pitchers ought to have performed significantly better than they did in 2011 as well. In particular, it sees pitchers such as Jaime Garcia, Hiroki Kuroda, and Luke Hochevar having the skills to post stronger strikeout rates than they did in either of the last two seasons. And while it suggests Francisco Liriano is capable of more relative to strikeout rates, I suspect the rest of the Twins faithful feels the same way — and you’ll want to weigh his fantasy value less on whether he underperformed in strikeouts than whether he is healthy and manages to find the strike zone again going forward.
There are many interesting nuggets in the overall sample, but I encourage you to take each as merely another information point as you make plans for 2012. For pitchers stacked towards the middle, there’s probably not much of a story, other than the fact that you can take their 2011 rate with a little more confidence. For those on the poles, consider the back story (Greinke, for instance) but you might also consider them to be good candidates to regress or improve going forward.
Michael was born in Massachusetts and grew up in the Seattle area but had nothing to do with the Heathcliff Slocumb trade although Boston fans are welcome to thank him. You can find him on twitter at @michaelcbarr.
wow. really interesting article. I’m definitely going to be on the look-out for those variables as I get closer to draft day.