This is the second in series of posts is about building a baseball projection system. If you’re new to the series, check out part one.
The first step in this process is to take every player’s stats and adjust them to estimate what their production would have been in a neutral run environment. Why does a players’ stats need to be adjusted?
Players in different leagues face less than 10% of the same opponents.
Players in on teams in different divisions play a similarly uneven schedule.
Players on different teams play half their games in a different home park.
- Players on the same team might have their home park affect them differently depending on if they are right or left-handed, or if they are primarily a line-drive contact hitter or a fly-ball power hitter.
Factoring out the Park
Today we’ll go over the last two of these topics.
For example, we know that Nolan Arenado hit 41 home runs in 2016, meanwhile Freddie Freeman hit just 34. So who was the better home run hitter? Absent any other data, the obvious choice is Arenado. We do have other data, though:
Arenado played half of his games in Colorado, the most offense-boosting park in the majors, especially for right-handed hitters, and especially for home runs.
Freeman played half of his games in Atlanta, a park that is especially difficult to hit home runs in.
- Arenado’s home park also makes it easier for hitters to avoid striking out (less air resistance in Denver means breaking balls will move less), whereas in Freeman’s home park strikeouts are more prevalent. This means Arenado his getting his bat on the ball more often than Freeman, giving him even more chances for home runs.
Now can you tell me who was the better home run hitter? To answer this question we need neutralize those numbers so that we can compare them.
The idea here is that we’re going to break every stat into it’s component rates, adjust those components based on how the park has played for the past few years, and then put the components back together for each rate.
So what is a park factor?
A park factor is an estimate of how much a ballpark influences the run-scoring environment. The first park factors were very simple, and only accounted for number of runs scored. So a ballpark with a 150 run factor would have 150% of the runs of a league average park. Over time many different park factor calculations have emerged, with varying levels of accuracy and detail.
Creating a reliable park factor metric in itself is a task worthy of weeks of research and implementation, so this is going to be the first step in the process where I defer some of the calculations to someone else’s research. In this case, we’ll be using Matthew Carruth’s park factors, published at StatCorner. Details about his methodology can be found on this page on his site.
Taking the Colorado out of Arenado’s Home Run Totals, step by step
We want to find out how many home runs Arenado would have hit in a neutral park. To do this we will take Arenado’s HR/PA rate and adjust it for the park factors.
Home runs are a bit complicated, as the factor for home runs is for home runs per ball hit in the air (fly balls and line drives). Fly ball and line drive rate is on a per ball-in-play basis, and ball-in-play rate is calculated by
100% - strikeout rate - walk rate, where strikeout and walk rates are also affected by park.
So let’s take an inventory of what Arenado did in 2016. We’ll go through all the base component rates based on what he actually did, then show the adjustments, and work them back out to an estimated number of neutral home runs. HR/PA rate: 41/696 = 5.9%
K/PA rate: 103/696 = 14.8% BB/PA rate: 68/696 = 9.8% BIP rate: (696-103-68)/696 = 75.4%
FB/PA: 244/696 = 35.1% LD/PA: 94/696 = 13.5% Ball in Air/PA: 338/696 = 48.6%
BiA/BIP rate: 338/525 = 64.4% HR/BiA: 41/338: 12.1%
Now let’s adjust all these rates for the park factors! Let’s work through all of these. The K/PA factor for right-handed batters in Coors Field is 90, which means that the number of strikeouts is 90% of a neutral park. Since Arenado only plays in Coors for half of his games, we cut that in half.
K/PA rate: 14.8% / 95% = 15.6% BB/PA rate: 9.8% / 99% = 9.9% BIP rate: 100% - 9.9% - 15.6% = 74.5%
FB rate: 35.1% / 97.5% = 36.0% LD rate: 13.5% / 107% = 12.6% Ball in Air rate: 36.0% + 12.6% = 48.6%
BiA/BIP rate: 48.6% / 74.5% = 65.2% HR/BiA rate: 12.1% / 109% = 11.1%
Now that we have the adjusted rates, we’ll throw them back together to find what his home run total would have been had all his Coors games been in a neutral park instead.
Balls in Air: 48.6% 696 = 338 (this hasn’t changed, due to the effects of Coors on Fly Ball and Line Drive rates canceling out) Home runs: 338 11.1% = 37.5
From 41 down to 37.5 means he loses 3.5 home runs due to park factors!
By now you may have noticed that we’re assuming that the half of Arenado’s games that aren’t in Coors Field are in a neutral (100 park factor across the board) park, but we know that isn’t the case. I used this just to illustrate the point more clearly. Our actual adjusted numbers will take into account the uneven schedule and be adjusted based on where Arenado actually played in 2016.
After adjusting for each park individually, Arenado’s adjusted home run total bumps back up to 37.8, 0.3 more than the naive number. The reason is that the other parks in his division include more home run suppressing parks (SF, LA) than home run boosting parks (AZ). (SD is neutral for home runs for right-handed batters)
Running the same formula on Freddie Freeman, we end up with 38.3 home runs! That’s 0.5 more than Nolan Arenado. So while the results tell us that Arenado hit 7 more home runs, adjusting for park tells us that they were actually virtual equals as home run hitters.
Coming up next
The next step of neutralizing data is to adjust for strength of schedule, an oft-neglected part of player analysis when it comes to things like MVP or Cy Young discussion. I was originally going to cover it in this post, but as I go through this process writing code alongside these articles, strength of schedule looks to be an even more nuanced topic than park factors.