Clutch Hitting: Fact or Fiction?

August 14, 2003 (revised Feb. 2, 2004)

A generally accepted principle of sabermetrics is that clutch hitting, a staple of TV broadcasts and common baseball commentary, is mostly imaginary. While there are important plays, those that cause a large swing in the likely outcome of a game, it does not seem that some players are any more or less likely to make those key plays. This is the source of scorn in both directions: statheads mock conventional wisdom, while baseball experts see this as proof that sabermetrics is overlooking major parts of the game.

While I don't wish to downplay the significance of previous work (after all, it was a significant finding that 0.250 batters do not turn into 0.400 batters in clutch situations), I also do not feel that the full range of statistical tests have been used to look for clutch factors. In particular, clutch studies generally look for correlations between a player's clutch performance in various years, thus dividing the data sample down to sizes where random noise overwhelms whatever clutch factor one hopes to find.

What needs to be made is therefore a test that avoids chopping the data into small samples and whose results can be unambigously tested against the null hypothesis (that players perform statistically identically in clutch and non-clutch situations). If a statistically significant difference is found, the follow-up questions are how large it is and how one can determine who is a clutch hitter and who is a choker.

Is Clutch Real?

In order to allow easy reproduction of my calculations, I detail my procedure here. The basic idea was as follows. First, the simplest statistical distribution (and thus the most unambiguous to analyze) is the binomial, which gives the probability of X successes out of N trials. So while one might prefer to use a sophisticated offensive production metric, I will restrict my study to looking at modified on-base averages, in which a plate appearance can have a successful or unsuccessful result. Second, in order to have the largest data samples possible, I have chosen to use career stats for players broken only into two categories: clutch OBP and non-clutch OBP. Again, there might be arguments for breaking stats into smaller subsets, but I believe that by doing so you throw away more statistical sensitivity than you gain.

Data were obtained from the Retrosheet event archive, which lists virtually all events in MLB games in 1969 and 1972-1992, as well as in AL games for 1963, 1967, and 1968. To analyze only plate appearances where the pitcher pitched to the batter and the batter tried to get on base, I removed bunts, intentional walks, and hit batsmen from the list of events, as well as all non-batter related events such as steals, pickoffs, etc. While there is probably a non-zero amount of data contained in the removed categories of events, they are sufficiently uncommon and different from other categories that removing them takes very little away from the statistical accuracy while removing a potentially large source of confusion. (Besides which, statheads could argue endlessly over whether or not a sacrifice bunt constituted a successful or unsuccessful event.)

Of the remaining events types, a walk or hit was defined to be successful; outs (strikeout or out in play) and reached on error were defined to be unsuccessful. Two notes. First, a reaching base on error is obviously "successful" in terms of increasing the batting team's odds of victory. However, because the batter's contribution to the play was perhaps to hit an easy ground ball to the shortstop, this cannot be used as evidence of clutch performance. A more difficult decision was to call sacrifice flies unsuccessful, something necessitated by the fact that the base states go into the definition of "clutch situations", a selection effect that could lead to bias. Lest the reader be too concerned over my treatment of errors and sacrifice flies, I have carried out the study with them omitted altogether and measured the same final result.

With a modified OBP defined using unintentional walks and hits as "successes" and non-bunt outs and reached on error as "failure", the next choice was the division of the data into clutch and non-clutch situations so that differences in the modified OBP could be searched for. A temptation is to consider only very high pressure situations, such as ninth inning with two outs, a runner in scoring position, and the team trailing by one run. Doing so will lead to such small sample sizes that no worthwhile measurment can be made. Likewise, one might choose to use sabermetrically-determined "high-leverage" situations, but the goal here is to determine performances in situations that the players deem to be significant. Instead, I took a more liberal approach and defined clutch situations as 6th inning or later, and either a tie game or the tying run on base, at bat, or on deck. (This includes 27% of plate appearances.) All other plate appearances in the 6th inning or later were thrown out to eliminate "zero-clutch" situations in which the game may be out of hand (of course, the tendency of players to ease up in such situations would be an interesting study in its own right). Of all batters appearing in the Retrosheet archive, 612 had 1000 or more plate appearances in "non-clutch" situations and 250+ in "clutch" situations. These 612 players form the basis set for this study.

A quick look at performances shows that batters overall performed slightly worse in clutch situations than in non-clutch situations (modified OBPs of 0.322 and 0.331, respectively), so this was taken into account in a way that would not invalidate the use of binomial statistics. (The care taken here was probably unnecessary, since the two averages are only 2.5% different, but I felt that it was better to err on the side of caution.) The reason for this discrepancy is unimportant for the purpose of this article, but I assume it relates to the fact that clutch situations for the batting team are frequently save situations for the other team, and thus batters are more likely to face the opposition's best relievers. (Whether this adjustment is made by adding or by multiplying the OBPs is irrelevant; the final result is the same either way.)

After making the correction, I now had the necessary data to make the test using binomial statistics. The probability that a batter gets H1 hits+walks and O1 outs in P1 clutch plate appearances and H2 hits+walks and O2 outs in P2 non-clutch appearances if players perform equally well in clutch and non-clutch situations is given in the appendix. The probability of all 612 players having their actual stats if clutch hitting does not exist equals the product of each of the 612 individual probabilities.

The expectation value of the probability can be determined using a Monte Carlo test, in which a large quantity of data is simulated with identical properties as the actual data. To do this, I ran several thousand trials in which 612 players were created at random using the observed OBP talent distribution (0.329 average, 0.026 rms) and given batting stats based on their talent. The probability from the actual data was compared to that from the Monte Carlo tests to measure the probability that the actual stats might have resulted from randomness alone (i.e. clutch hitting is no different from non-clutch hitting). A second set of tests was made, giving the random players slightly different OBP talent in clutch and non-clutch situations to measure the probability that the actual stats might have resulted from randomness plus a clutch/non-clutch difference.

The results were surprising. The probability that data less consistent than the actual data would be created with no clutch difference was a mere 0.9%, which means that the "zero clutch" model matched the data extrmely poorly. A perhaps more useful number is the ratio of the probabilities that the data were generated with and without a clutch difference, often times referred to as bookmakers' odds. Odds of 1:1 indicate no indication of a difference between clutch and non-clutch hitting; I measured odds of 14:1 in favor of a difference.

To ensure that this was not a result of my definition of clutch situations, I ran the same calculations using a few other definitions of 'clutch' hitting. One is from the Great American Baseball Stat Book (7th inning or later with a 1-run lead, tie, or tying run on base, at bat, or on deck). Using this definition, the probability becomes 4% that comparable or worse data were produced with no inherent clutch/non-clutch difference, with bookmakers' odds of 4.2:1 that there is a difference. (Interestingly, if I remove plate appearances with 1-run leads from the GABSB definition, the clutch/non-clutch discrepancy increases, as one might expect since a 1-run lead is not a situation in which there is unusually high pressure to score.) Finally, I defined "clutch situations" as those with runners in scoring position (in any inning); this produced a probability of 0.02% and bookmakers' odds of over 400:1.

Regardless of how one chooses to define "clutch" situations, it is clear that there is indeed a statistically significant difference between how players perform in clutch and non-clutch situations.

Sanity Check

An obvious question to ask is if perhaps my statistical treatment is in error, or there is a bug in the program. To address this, I created three additional "clutch" definitions, none of which corresponds to a situation in which clutch performance is expected to matter. First, "clutch" was defined as any plate appearance in the 3rd-5th innings. Applying this definition to calculations described above, one finds a 22% chance that comparable or worse data could have been produced with no real difference between the two, and bookmakers' odds of only 1.3:1. With batters doing slightly better than average in these innings, this marginal result may indicate a difference as batters adjust to pitchers the second time through the rotation (and some batters adjust better than others). However it is not strongly statistically significant and has a reasonably high chance of having been produced entirely randomly.

A second test defined clutch as situations in which the batting team had a 1- or 2-run lead; this produced a probability of 45% and bookmakers' odds of 1.0:1. The final test defined clutch as plate appearances with one out; this produced a probability of 46% and also bookmakers' odds of 1.0:1.

It should be noted that in all three of these "non-clutch" calculations, the number of selected at-bats was greater than the number used in my initial clutch hitting study, so the lack of significant effects in any of those is not the result of small number statistics. The clear conclusion, therefore, is that there were no significant deviations from expected random behavior in any of these three control experiments.

Clutch Hitting Skills

The fact that clutch hitting is indeed a real skill in baseball does not necessarily mean it is an important one. If it accounts for a maximum change of 0.0005 points of OBP, for example, it makes a difference at most once in 2000 clutch at-bats, which means a player with a 13-year career would likely get one additional clutch hit. So the second important question is: how large a factor is clutch hitting?

To answer this, I took the same calculations made above, but introduced a small amount of randomness in the Monte Carlo tests when calculating simulated player clutch batting stats. Assuming a Gaussian distribution of clutch hitting skills, a standard deviation of 0.0071 (OBP) was sufficient to model the acutal data adequately. (Note that the standard deviation is the typical difference of a player from average, not the maximum difference.) While an OBP variation of 0.0071 does not sound like a whole lot, it is significant. Making a similar calculation for overall skills, the standard deviation of players' overall OBP rates is only 0.0258, meaning that the difference between a good and bad clutch performer is 28% the difference between a good and bad hitter -- a fraction much larger than has generally been thought. Is clutch everything? No. Is it important? Absolutely.

One last point should be addressed here is the importance of predictive power. It is commonly assumed that a factor is only important if it can be used to predict future performance. Over the course of a season, an average hitter will get approximately 150 clutch plate appearances, in which he will get on base 49 times with a standard deviation due to randomness of 5.7. The difference between a "one standard deviation good" and an average clutch hitter amounts to only 1.1 successful appearances, while the difference between a good and an average overall hitter amounts to 3.9 successful plate appearances. In short, any argument that clutch skills should be ignored could equally well be an argument that all batting skill should be ignored in clutch situations, given that randomness is the largest factor of all.

The size of the effect should make it clear why many attempts to measure clutch hitting have failed. With random effects accounting for +/-5.7 clutch hits and clutch hitting accounting for only +/-1.1 clutch hits, the correlation between a player's clutch hitting one year and the next should equal only 0.04, which is virtually undetectable.

Clutch Hitters and Chokers

Now comes the final question: who is a clutch performer and who is a choker? given the uncertainties involved in measuring clutch skills, nobody can be labeled with 100% confidence. Nevertheless, there are several players with very large numbers of plate appearances (1000 in clutch situations) during the years studied that are probably clutch performers, and several that are probably chokers. First the clutch performers: Ozzie Smith, Dave Collins, Lloyd Moseby, Tony Gwynn, Ruppert Jones, Rafael Ramirez, Alfredo Griffin, Jorge Orta, Rickey Henderson, and Bill Buckner. Now the chokers: Dave Winfield, Rod Carew, Lance Parrish, Jack Clark, Reggie Jackson, Roy Smalley, Tom Herr, Fred Lynn, Carl Yastrzemski, and Julio Franco. (Note that all of these players are good because of selection effects; one has to be a regular for about 7 seasons to pick up 1000 clutch plate appearances.)

What stands out immediately about this list is that the "chokers" are largely power hitters, while the "clutch performers" are largely singles hitters. The statistics are not convincing for only ten players, so I measured the correlation between clutch performance and power hitting for all players included in the study. The relation remained intact, with the strongest correlation being between clutch performance and the slugging average (as opposed to isolated power, H/HR ratio, or other similar power stats). The correlation was also present if "clutch performance" (on-base percentage) was redefined as "clutch slugging average", "clutch OPS", etc.; in other words there is no way of tweaking the data to eliminate this conclusion.

If it were caused by how pitchers approach different hitters in clutch situations, one would expect a pitcher to pitch around high-slugging hitters, resulting in more walks and a higher on-base average Since I measure the opposite, the more likely possibility is that power hitters tend to approach the plate differently in clutch situations than in non-clutch situations. Perhaps they tend to swing for the fences more than average; if so, this tactic is unsuccessful as their slugging percentages are also lower in clutch situations. Another plausible option is that a team's stars feel more pressure to come through. Regardless of the exact reason, the size of the correlation is such that an increase of 0.050 in slugging average generally means a loss of 0.005 in both OBP and SLG in clutch situations.

It should be noted that the correlation with slugging average contains two significant correlations. One is that power hitters (as judged by isolated power or HR/H ratio) tend to choke; the other is that good hitters (as judged by batting average or on base average) likewise tend to choke. (As always, to "choke" means to play worse than that player normally plays; it does not mean to play worse than the league average.)

This correlation accounts for half of the variation in player clutch performance, the remainder apparently being player-to-player variation, which can be significant. For example, George Brett and Eddie Murray should have been classic chokers with lifetime 0.487 and 0.476 slugging averages, respectively, but actually were better than average in the clutch. Tim Foli represents the other extreme: poor contact, poor power, and poor clutch performance. (Perhaps not coincidentally, Brett and Murray are hall of famers, while Foli was a classic defense-only shortstop.)

While the player-to-player variations are too small to measure accurately, the strong correlation with slugging average provides useful information that can be used to guess how well a player will do in clutch situations, and having half the information is better than having none at all. The expected change in on-base average in clutch situations equals:

  clutch OBA = -0.007 - 0.10*(SLG-avgSLG),

where "avgSLG" is the MLB slugging average. The expected change in slugging in clutch situations equals:

  clutch SLG = -0.017 - 0.11*(SLG-avgSLG).

In principal, one could attempt to use regression to average in his actual stats, but it would require 9000 clutch plate appearances to reduce the random error in clutch OBA to 0.005 (which would lower the regression amount to 50%).

Conclusions

Clutch hitting is an important skill in baseball.
The difference between a good and a bad clutch performer is about 28% the difference between a good and a bad hitter, a much larger effect than had previously been thought from sabermetric work. So it is unlikely that any 0.250 hitters turn into 0.400 hitters in clutch situations, but there are 0.285 hitters who turn into 0.300 hitters.
Because of random effects, it is extremely difficult, if not impossible, to peg a specific player as a clutch performer or choker with a high degree of certainty. (For that matter, it is extremely difficult to ascertain much of anything about a player's batting skills to an accuracy better than 20 points of OBP based on one season's stats.)
That said, power hitters that perform better in the clutch are fairly rare, as are singles hitters that perform worse in the clutch. This can be used to make an educated guess of a player's clutch tendencies.

I don't pretend to understand exactly what makes one player a clutch hitter and another a choker. The correlation with slugging average probably gives a good clue, but half the "skill" seems to be uncorrelated with any obvious batting stats. However, it does appear that clutch hitting exists and that its importance has been generally underestimated.

Appendix: Binomial Statistics

Binomial statistics are often treated by approximating the binomial distribution as a Gaussian with a standard devaition of sqrt(x*(1-x)/P), where "x" is the probability of a successful result and "P" is the number of trials. For most work, this will suffice, but since I am looking for very minute deviations from binomial statistics, a more proper treatment is warranted.

The probability of "H" successful results out of "P" trials, if the probability of a successful result in one trial equals "x", equals:

P(H|x,P) =  P! x^H (1-x)^(P-H) / H! (P-H)!

Since we are trying to evaluate the probability of "H1" successful results out of "P1" trials and "H2" successes out of "P2" trials, the probability of both occuring from the same "x" is the product of the above equation. Rewriting "O" for "P-H", the probability of both results being true is:

P(H1,H2|x,P1,P2) = P1! P2! x^(H1+H2) (1-x)^(O1+O2) / (H1! H2! O1! O2!)

We prefer to not make any assumption about a player's true value of "x". Instead, one should marginalize x by integrating the probability above over the full range of values x could be, in this case from zero to one. A useful integral here is:

integral(x=0,1) x^A (1-x)^B dx = A! B! / (1+A+B)!

Using this solution for the integral in the previous equation, the probability of a player having H1 successes and O1 failures in P1 appearances and H2 successed and O2 failures in P2 appearances, if the odds of success are the same in both samples, equals:

  (H1+H2)! (O1+O2)! P1! P2!
  --------------------------.
  (1+P1+P2)! H1! O1! H2! O2!

Note that the above calculations make no assumption regarding the inherent distribution of player abilities, x. A convenient form for such a prior, which can be multiplied into the probability equations above, equals:

P(x) = x^A (1-x)^B

where A and B are chosen to so that the distribution's average and standard deviation equal those of the inherent distribution of player abilities.

Doing this changes the probability of a player's stats being observed if clutch does not exist to:

  (H1+H2+A)! (O1+O2+B)! P1! P2!
  ------------------------------,
  (1+P1+P2+A+B)! H1! O1! H2! O2!

times some constants that do not matter. I choose not to use this second equation because it also measures whether or not the overall stats meet the assumed talent distribution (something I'm not particularly interested in for this study), but I include it for the sake of completeness. It should be noted that the final results are identical regardless of which comparison statistic is used.

As a postscript, I should note that I was once told that my findings duplicated those of previous studies and that this work contains nothing new. I am unaware of any previous studies of this topic that come to this conclusion; if any indeed exist I apologize for repeating earlier work without giving proper credit.

Update, 9/27/04

The preceding article was posted on Baseball Primer in February 2004, but the discussion appears to have been erased when Primer moved to a new server. There are several points brought up during that discussion that I would like to add:

A suggestion was made that this was due to good hitters more likely to face same-handed pitchers in clutch situations. Limiting the study to at-bats with RH batters and RH pitchers, the result was unchanged and thus this was not a cause of the discrepancy in clutch hitting performance.
Another suggestion was made that good hitters are more likely to face good pitchers in clutch situations. Correcting the OBPs for quality of pitchers faced, the result was again unchanged, meaning that this was not a cause of the discrepancy.
Still another suggestion was made that including men on base in my clutch situation definition caused changes due to men on base to be confused with changes due to pressure. Redefining clutch situations based only on inning and run difference, I found the result to be unchanged.
There was discussion that the definition of clutch hitting was too loose, including many situations in which a batter wouldn't feel an unusual amount of pressure. This was intentional on my part, as I wanted to improve the size of the sample. However, in doing so, it is likely that I watered down the result. Redoing the calcualtions with a stricter definition of clutch hitting, I found the same significance of the result and a much larger discrepancy between clutch and non-clutch hitting. This would indicate that the original result was indeed watered down by including low-pressure situations. In addition, the correlation between clutch hitting and slugging was weaker, meaning that a greater fraction of a batter's clutch tendency is uncorrelated with his hitting stats.
There was a great deal of discussion as to whether or not the tendencies of power hitters was evidence that what I am measuring is not "cluch", but rather tendencies of various types of hitters against the types of pitching one would find in clutch situations. It is unclear how to prove or disprove this theory without knowing what types and quality of pitches each batter saw during his at-bat, but the fact that this correlation is less with higher-pressure situations would seem to suggest that the important part of the variation is not a function of hitter profile.
A small error in the Retrosheet data I initially downloaded caused approximately half a season of at-bats to be duplicated. Naturally this artificially increased the significance of the result. After fixing the error, the result is still statistically significant.
Another recent study on the topic, by Tangotiger using 1999-2003 data, claimed a similar result as mine. However, his uncertainties were underestimated, thus creating a false positive. After fixing the error, his finding was not statistically significant, something verified by both Alan Jordan and me.

Return to Dolphin Ratings main page