The Case Against a Case Against FIP

© Steven Bisig-USA TODAY Sports

At FanGraphs, our headline WAR number for pitchers is based on FIP. Because of that, and because people enjoy debating and arguing, there’s a yearly refrain that you’ve probably heard. “FanGraphs pitching WAR only considers (X)% of what a pitcher does, how can that be used for value?” No one would dispute that year-one FIP does a better job of estimating year-two ERA than ERA does – or at least, not many people would – but the discussion around whether FIP does a good job of assigning year-one value is alive and well.

One reason for this view is pretty obvious. FIP considers home runs, strikeouts, walks, and hit batters to estimate pitcher production on an ERA scale. Our WAR does some fancy stuff in the background – it treats infield fly balls, which virtually never fall for hits, as strikeouts, and it adjusts for park and league. In the end, though, it’s estimating pitcher value using just three (well, actually four — HBPs always draw the short straw) outcomes. There are a lot of other outcomes in baseball!

In 2021, roughly 39% of plate appearances ended in a homer, strikeout, walk, hit batter, or infield pop up. One thing you could think, in recognition of that fact, is that FIP-based WAR doesn’t consider enough of a pitcher’s production. You wouldn’t use 40% of a hitter’s plate appearances to calculate their WAR, so why do it for pitchers? But that doesn’t actually make sense, as David Appelman pointed out to me recently. Assuming “average results on balls in play” is actually going to be pretty close for every pitcher, by definition.

How so? Consider it this way. Let’s say you’re a pitcher who for whatever reason allows a .350 BABIP. That’s awful! That would be one of the 10 highest single-season BABIPs allowed since 2015 (minimum 100 innings pitched). Let’s get a little more specific and say that you, the reader, are Brady Singer, with a .350 BABIP in 2021. Oof, sorry Brady (but thanks for reading).

FIP assigns Singer league average results on balls in play. Our modified FIP-WAR does a little better by including infield fly balls. That’s not how it broke down for him in real life, though. How poor of an estimate was FIP’s “Singer will have average BABIP” assumption? It’s actually pretty easy to do the math.

After excluding infield fly balls, Singer allowed an fBABIP (I’m just calling it that for the duration of this article, but that’s not a real term) of .359. The league as a whole allowed an fBABIP of .301. Over the 368 balls in play that Singer allowed (again, excluding infield fly balls), that’s a difference of 21 hits. That’s a lot of hits – and it’s also just 3.6% of the batters Singer faced in 2021.

Put another way, FIP’s estimation of Singer includes 96.4% of the batters he faced. That’s less than 100%, but if your only criteria for which statistic you want to use is what percentage of batters faced it “considers,” FIP is going to be pretty close to 100% across the board, not the 39% that it takes directly from player stats. Averages are powerful that way, and the population of pitchers who throw 100 innings in the major leagues has a near-perfect normal distribution of fBABIP:

The league average fBABIP over that time is .305, with a standard deviation of roughly 25 points. 67.8% of pitcher BABIPs are within one standard deviation of average; 96% are within two standard deviations. There’s no skew. A statistics textbook should use this as a case study.

If you think of FIP as certain outcomes (the three true outcomes plus pop ups) with a BABIP assumption, you’re basically correct. As it turns out, that BABIP assumption is really close to capturing all the balls in play. For 68% of pitchers, FIP correctly pegs at least 98.5% of their batters faced. For 95% of pitchers, FIP gets 96.5% or more correct. It’s even more accurate than that in our WAR calculations, because we include park and league adjustments, which I haven’t done here and which causes some of the variation. The argument that FIP is ignoring 60% of what a pitcher does isn’t right – the assumption that the pitcher is average is, in itself, a pretty good description of that last 60%.

You might reasonably say that BABIP isn’t the only thing that matters when it comes to balls in play. If a pitcher is giving up rocketed line drives left and right, the doubles will mount, and doubles are more damaging than singles despite looking the same in BABIP. To account for this, I calculated wOBABIP – wOBA on balls in play, although excluding infield fly balls again – for the same cohort of pitchers.

This gets away from the “but-FIP-is-40%-of-a-pitcher” argument that kicked the article off, but I think it’s still instructive. FIP does a pretty good job of estimating this part of pitching too. For every pitcher with 100 innings pitched in a single season since 2015, I calculated the difference between that season’s actual wOBABIP and the league average wOBABIP for that year. I converted that into runs, then converted those runs into runs allowed per nine innings.

The average absolute difference between the runs a pitcher would allow with league average results on balls in play and what they would allow with their actual results on balls in play is 0.43 runs per nine innings. Half the pitchers in baseball had a gap of less than 0.35 runs per nine innings. Only 7% of pitchers had a gap of one run or more. You can see the output of both my BABIP and wOBABIP calculations here. Again, I didn’t consider park adjustments — the actual random variation will be smaller.

Why not use the actual wOBABIP to improve FIP, then? It’s quite unclear how much of this variation is something that a pitcher accomplishes via their own skill and how much comes down to either the breaks of the game or defense. Per Statcast’s OAA, the average variation due to defense from pitchers who allowed at least 250 balls in play in a season between 2016 and ’21 was 0.18 runs per nine innings. It would be strange to credit or debit pitchers for that in a value metric – the fielders are the ones responsible for those runs in either direction.

That doesn’t take shifts into account, or whether the wind was blowing in or out that day, or even which of the two baseballs they happened to throw. How much a pitcher does to influence their results on balls in play is an ongoing debate, but even if it weren’t, FIP captures a good chunk of those balls in play by assuming league average results.

Does that mean FanGraphs’ formulation of FIP is a flawless, unassailably perfect way to assign value to pitchers? Definitely not. Whether you want to isolate pitcher value or isolate “what happened on the field,” there are key shortcomings of using either FIP or RA9-WAR.

Both have a strained relationship with defense, and particularly with how a pitcher’s incentives change based on the defense behind them, but the possible shortcomings don’t stop there. FIP ignores sequencing – a pitcher who walks the bases loaded and then gets three outs without allowing a run might not succeed in the long run, but in that inning, he certainly didn’t allow any runs. Maybe he even got those outs in ways that didn’t tax his defense – routine grounders and lazy fly balls. Sequencing has a lot to say about how many runs a pitcher allows in a given season, even if pitchers mostly don’t show any ability to control sequencing over time.

Likewise, ERA- and RA-based WAR have faults. ERA is extremely strange – an official scorer’s decision doesn’t mean squat for how a pitcher performs, but it changes how many earned runs they allow. Even beyond that, should we give different credit to a pitcher who allows a bases-loaded screaming liner that the center fielder corrals, or to one that finds a hole in the outfield? What if the ball is hit to the exact same place and the defender just got a good or bad jump?

While we’re there, what about what happens after a pitcher leaves? RA9-based WAR will tend to flatter pitchers with better bullpens. If one pitcher leaves with the bases loaded and gets three runs for his trouble while another leaves in the same spot and gets a zero, did we learn anything about the difference in their skill? Clearly not, but the RA9 tally would tell you the two performed differently.

I won’t act like I’ve solved the debate or that one WAR is clearly superior to the other. But the complaint that FIP only considers a minority of what a pitcher does doesn’t track to me. On the low end, it’s giving an accurate accounting of 95% of the batters a pitcher faces. Those last 5% can have an outsized influence on how many runs a pitcher allows – a few extra singles here and there or a poorly-timed double can have outsized effects on a pitcher’s line.

Some of that probably accrues to the pitcher, some to the defense, and some to luck. How to assign credit and blame among those three things is the central disagreement in how to assign WAR to pitchers. FIP goes to one extreme; RA9 WAR unadjusted for defense goes to the other. I don’t know the right answer to this disagreement. But I do know that “FIP only considers 40% of balls in play” doesn’t give FIP enough credit.





Ben is a writer at FanGraphs. He can be found on Twitter @_Ben_Clemens.

119 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Kevbot034
1 year ago

Funny enough, Singer also has a 350 BABIP again and I am curious suddenly about what drives that? FIP usually tends to penalize pitch to contact types, doesn’t it? But it seems to still like Singer specifically despite him allowing a billion baserunners with a stupid high BABIP.

Ivan_Grushenkomember
1 year ago
Reply to  Kevbot034

XFIP, SIERA and xERA are all closer to FIP than to ERA. This seems to support the hypothesis that BABIP is not Singer’s fault very much unless he’s a terrible fielder. This might or might not mean that FIP is enough and those other things don’t add much.

Also since 2020 Royals have been the 3rd best defensive team by Def but with only average range. Maybe this means they aren’t positioning well behind Singer. I don’t know

Last edited 1 year ago by Ivan_Grushenko
RonnieDobbs
1 year ago
Reply to  Ivan_Grushenko

That is unthinkable that pitcher defense would play any role at all in BABIP. Do you know how few balls are fielded by a pitcher? The expectation is that he never makes a single play. If the pitcher put his hands in his pockets, the rest of the infield would be OK with that. On the contrary, BABIP is about two things – execution and luck.

Defensive metrics are worth absolute zero.Someday everyone will agree with this statement when all of the things that we currently use are thrown out or overhauled.

Francoeursteinmember
1 year ago
Reply to  RonnieDobbs

I have always informally considered controlling the running game as pitcher defense. I think that was one of the many reasons Julio Teheran was able to run such a high FIP-ERA for many years

CC AFCmember
1 year ago
Reply to  Kevbot034

Depends on how you define “pitch to contact types.” If you just mean “guys who have low strikeout rates,” then no. To some extent, some pitchers have batted ball profiles that can lead to sustainably low or high babips, but they’re not necessarily “pitch to contact types,” see e.g. Scherzer, Max.

As regards Singer specifically, it’s been less than 6 innings this year. Not enough to draw any conclusions.

Six Ten
1 year ago
Reply to  Kevbot034

Singer isn’t really a pitch to contact pitcher. He only has two pitches and one of them is a sinker, which traditionally suggests a pitch to contact approach. But out of the 129 pitchers who threw 100+ innings last year he had the 53rd most K/9 and 18th most BB/9. So not blowing guys away, but that’s a lot of PAs ending without contact too. And he was pretty decent at avoiding home runs too.

Sinkers are good at preventing fly balls, and they are perfectly fine on balls in play as long as they don’t become fly balls. But if the batter elevates one, it gets tagged. That means home runs! Unless, of course, you play in a park with a gigantic outfield. Then it’s all about what happens to that fly ball in play. High BABIP: bad! Low BABIP: good! Well, FIP doesn’t like worrying about BABIP.

Which is to say: if you throw a sinker and you play in a really large stadium, it seems like FIP might tend to give you too much credit rather than too little. And that’s Singer.

dl80
1 year ago
Reply to  Kevbot034

But he’s not really. He’s averaged over 9 K/9 the two years he’s had the inflated BABIP (slightly above league average). And he’s given up fewer line drives than average, also.

The one thing that stands out is that he gets a lot of groundballs, which are more likely to become hits.

But all of their infielders are above average defensively this year, so it’s not like he’s being victimized by bad defense. And it was mostly true for last year, also.

So it’s either random bad luck or the team does a terrible job shifting.

Ivan_Grushenkomember
1 year ago
Reply to  dl80

If the problem is Royal Shifting he’d be a good trade target for a better shifting team

Jason Bmember
1 year ago
Reply to  Ivan_Grushenko

I asked Dennis Miller for his opinion. He said “I haven’t seen this kind of Royal Shifting since the Dutch royal family fled to Canada, Chachi.”

TheGarrettCooperFanClub
1 year ago
Reply to  dl80

The Royals shift the least in all of baseball (17.3%)! Perhaps that is most of the issue.

Six Ten
1 year ago

It’s possible shifting is the issue, but if there’s a non-luck explanation I think it’s more likely the type of contact Singer allowed. In 2021 he allowed a ton of medium contact and quite a lot of line drives. Medium line drives are far and away the contact type with the highest BABIP. The Royals’ shift rate in 2021 was 27.3%, 19th rather than 30th.

I still think there’s probably a lot of luck involved. But only having two pitches makes the contact quality explanation pretty enticing.

Ivan_Grushenkomember
1 year ago
Reply to  Six Ten

The medium contact and line drives should be in xERA no?

RonnieDobbs
1 year ago
Reply to  Kevbot034

BABIP punishes people with poor command and/or bad luck. The guys that throw hard but don’t have to execute very well to succeed get pummeled by BABIP. I think you have it completely backwards. People with command and poise are better at managing BABIP. Missing bats does not lead to batted ball events. Few things are more detrimental to building a better understanding of than creating false generalizations.

Kevbot034
1 year ago
Reply to  RonnieDobbs

No, I don’t think I really have it backward, pitch to contact guys tend to not have great FIP, or at least outperform their FIP, because they don’t K many. Anyone walking a ton is going to be detrimental, regardless of if they are strikeout artists or pitch to contact guys. Command and poise relies on your defense – something FIP intentionally ignores. I’m also confused how I would be creating a false generalization anyway, when it was written as a question.