Since the publication of Michael Lewis's Moneyball in 2003, there has been an explosion of interest in the field of sabermetrics. Sabermetrics involves the application of statistical analysis on baseball data to identify players with high-performance potential who may have otherwise gone unnoticed, thus will not command a salary as high as that of more well-known names.
Data Source: Lahman Baseball Database is a comprehensive record of batting and pitching statistics from 1871 to 2016. It also contains fielding statistics, standings, team stats, managerial records, post-season data, and some other data points.
Taking inspiration from Moneyball, this study aims to mine the relationship between various player statistics, salaries, and team performance. This report is divided into the following sections.
What is the relationship between On-Base Percentage (OBP), Slugging Percentage (SLG), Walks plus Hits per Innings Pitched (WHIP), Payroll and Team Performance?
The response variable, team performance is measured as the percentage of games won per season. The explanatory variables of choice are OBP, SLG, WHIP, and Payroll.
A chosen model for this statistical problem is Linear Regression (which has been summarized below).
If there are $k$ explanatory variables, the regression equation model is $y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{k}x_{k} + \epsilon.$
The $x_{1}, x_{2}, ..., x_{k}$ represent the $k$ explanatory variables. $\beta_{0}$ is the y-intercept, or constant, $\beta_{1}$ is the coefficient on the first explanatory variable, $\beta_{2}$ is the coefficient on the second explanatory variable, and so on. $\epsilon$ is the error term or the residual that can't be explained by the model.
Based on the above, four null hypotheses are formulated, each stating that the coefficient of the respective variable equals zero, i.e., this explanatory variable does not have an impact on the response variable, Winning Percentage.
$H_{0}$ : $\beta_{OBP}= 0$
$H_{0}$ : $\beta_{SLG}= 0$
$H_{0}$ : $\beta_{WHIP}= 0$
$H_{0}$ : $\beta_{Payroll}= 0$
The alternative hypotheses then are:
$H_{a}$ : $\beta_{OBP}\neq 0$
$H_{a}$ : $\beta_{SLG}\neq 0$
$H_{a}$ : $\beta_{WHIP}\neq 0$
$H_{a}$ : $\beta_{Payroll}\neq 0$
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.seterr(invalid='ignore');
# Function to load CSVs
def load(data_file):
"""This function loads the CSVs as a dataframe.
The CSVs should be stored in a folder named 'data'.
"""
return pd.read_csv('data/' + data_file)
# Creating a master DataFrame
master = load('Teams.csv')
# Subsetting - Salary data is only available after 1985
master = master[master['yearID'] >= 1985]
# Adding winning_percentage to DF
master['winning_percentage'] = (master['W']/master['G']) * 100
# Resetting the index to integer index
master = master.reset_index(drop=True)
On-Base Percentage: On-Base Percentage (OBP) measures the how much a player can make a not-out. Players with high on-base percentages tend to reach base at a high rate, prolonging games and giving their team more opportunities to score.
Source: fangraphs.com
master['OBP'] = (master['H'] + master['BB'] + master['HBP']) \
/(master['AB'] + master['BB'] + master['HBP'] + master['SF'])
master.info()
By looking at master_df.info(), it can be noted that there are only 510 values for HBP and SF, and thus for OBP too.
This can be explained via the following.
HBP_SF_missing = master[(master['HBP'].isnull()) | \
(master['SF'].isnull())]['yearID']
print("Number of missing values: " + str(HBP_SF_missing.size))
print(set(HBP_SF_missing))
From 1985 to 1992, MLB had 26 teams.
From 1993 to 1997, 28 teams.
From 1998, 30 teams.
Hence, 826 + 528 + 2*30 = 408 missing values. This logic exactly matches the number of missing values.
Batting.csv consists of individual batting statistics. These can be used to fill-in the missing values.
batting = load('Batting.csv')
# Values for 1985 to 1999
batting = batting[(batting['yearID'] >=1985) & (batting['yearID'] < 2000)]
# Group by team and year
batting_totals = batting.groupby(['teamID', 'yearID']).sum()
#print(len(batting_totals))
# Calculate On-Base Percentage
batting_totals['OBP'] = (batting_totals['H'] + batting_totals['BB'] + batting_totals['HBP'])\
/(batting_totals['AB'] + batting_totals['BB'] + batting_totals['HBP'] + batting_totals['SF'])
# Reshaping the DataFrame
batting_totals.drop(['G', 'stint'], 1, inplace=True)
batting_totals = batting_totals.reset_index()
batting_totals.sort_values(['yearID', 'teamID'], inplace=True)
batting_totals = batting_totals.reset_index(drop=True)
# Merge the DataFrame
master = master.combine_first(batting_totals)
master.info()
There are still missing values for some of the attributes. Although, it is not necessary to work them out as they are not required for this analysis. Hence, master DataFrame can be left as it is.
Slugging Percentage: Slugging Percentage (SLG) represents the total number of bases a players records per at-bat.
Unlike on-base percentage, slugging percentage deals only with hits and does not include walks and hit-by-pitches in its equation. Slugging percentage exists mainly as a response to a noticeable flaw in batting average: not all hits are created equal. In batting average, a single and a home run are valued the same. Slugging percenage counts home-runs four times as much as a single.
Source: mlb.com
# Calculating and adding SLG to master DataFrame
master['SLG'] = (master['H'] + master['2B'] + 2*master['3B'] + 3*master['HR'])\
/(master['AB'])
WHIP: WHIP is one of the most commonly used statistic for evaluating a pitcher's performance. The statistics shows how well a pitcher has kept runners off the basepaths, one his main goals.
The formula is described as: the sum of pitcher's walks and hits, divided by his totals innings pitched.
Source: mlb.com
# Calcuating and adding WHIP to master DataFrame
master['WHIP'] = (master['BBA'] + master['HA'])/(master['IPouts']/3)
# Innings Pitched = IPouts (Innings Pitched Outs)/3
master.head()
Payroll: These statistics are related to player salaries from the year 1985 onwards.
# Salaries.csv contains individual salary data from 1985
salaries = load('Salaries.csv')
salaries.info()
# Get payroll of each team
# Sum up the individual salaries for each season
payroll = salaries.groupby(['teamID', 'yearID']).sum()
payroll.head(10)
# Join the obtained DF with master DF
master = master.join(payroll, on=['teamID', 'yearID'])
Winning games with a payroll of a million dollars had a very different significance in 1985 than in 2017. To get comparable data, the data has been normalized using the approach: http://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range
An alternative approach is to use preprocessing scaling via sk-learn. Although, it hasn't been implemented.
# Renaming salary column
master.rename(columns={'salary':'payroll'}, inplace=True)
# Adding normalized payroll to master DF
master['norm_payroll'] = pd.Series()
grouped = master.groupby('yearID')
for name, group in grouped:
master['norm_payroll'].update((group['payroll'] - group['payroll'].min())\
/(group['payroll'].max() - group['payroll'].min()))
# Correlation Plots
def corr_plot(xv, yv, title):
runs_wins_plot = sns.regplot(x=xv, y=yv, scatter_kws={'alpha':0.25})
runs_wins_plot.set_title(title, fontsize=13)
plt.show()
print("Pearson's Correlation Coefficient (r) = " + str(xv.corr(yv)))
print('\n')
# Plots as desribed below
corr_plot(master['OBP'], master['winning_percentage'], 'OBP vs. Winning-Percentage')
corr_plot(master['SLG'], master['winning_percentage'], 'SLG vs. Winning-Percentage')
corr_plot(master['WHIP'], master['winning_percentage'], 'WHIP vs. Winning-Percentage')
corr_plot(master['norm_payroll'], master['winning_percentage'], 'Payroll vs. Winning-Percentage')
OBP and SLG are positively correlated with winning percentage. The strength of the correlation is decent. WHIP is negatively correlated with a high strength of association. These meet the intuitive expectations.
Payroll is mildly associated with winning games. This inference might be counter-intuitive (or intuitive, if familiar with the ideology of Moneyball). However, money has always been a significant factor in team building. Hence, further examination has been done for this variable.
# A Ordinary Least Squares (OLS) regression is applied.
# All variables are included in the regression model.
import statsmodels.api as sm
import statsmodels.formula.api as smf
result = smf.ols(formula='winning_percentage ~ OBP + SLG + WHIP + norm_payroll',\
data=master).fit()
print(result.summary())
There's more to multiple regression (e.g., verification of the fit by checking the residual plots) that could've been performed. As the statistical analysis is not the key aspect of this exercise, it has been kept minimal.
The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (the variable does not affect the response variable). A low p-value (< 0.05) indicates that the null hypothesis can be rejected.
Hence, all explanatory variables are meaningful additions to the model - all their p-values are 0.000. Even at $\alpha$ = 0.01 level of significance, the null hypothesis for each explanatory variable can be rejected.
An r-squared value of 0.779 indicates that our model explains 78% of the variability of the response data. It does a decent job of explaining wins.
# Helper functions
def corrected_standard_error(x):
return (x-x.mean())/x.std(ddof=0)
def correlation(x, y):
std_x = corrected_standard_error(x)
std_y = corrected_standard_error(y)
return (std_x*std_y).mean()
def add_win_percentage_to_df(df):
"""Calculate win-%age & add attribute to data frame.
"""
df['winPercent'] = df['W']/(df['W'] + df['L'])
Teams.csv contains team level statistics. This dataset will be useful in answering the questions about the correlation between losses and errors. In addition, winning statistics can be combined with Salaries.csv dataset to be able to answer the objective questions.
Salaries.csv shows the player's salary, year and the team they were associated with. This data however is available only from the year 1985 onwards. Therefore, data prior to 1985 has been ignored.
teams = load('Teams.csv')
add_win_percentage_to_df(teams)
win_percent = teams[['yearID', 'teamID', 'winPercent']]
win_percent.describe()
The Teams.csv file has a number of fields that are empty for strike-outs, or set to 0. Those rows have been excluded for this part of the analysis.
teams_with_strikeouts_recorded = teams[teams['SO'] > 0]
strikeouts = teams_with_strikeouts_recorded['SO']
win_percent = teams_with_strikeouts_recorded['winPercent']
print (correlation(win_percent, strikeouts))
Strike-outs do not appear to have a correlation with wins. This is contrary to the intuitive judgement mentioned earlier i.e. teams with more strikeouts are hitting the ball less on average which in turn would result in less runs being scored further resulting in less overall wins. However, this does not appear to be the case.
Source: https://en.wikipedia.org/wiki/Strikeout
The correlation coefficient is negative and at the same time is very close to zero, hence nothing conclusive can be said.
win_percent = teams['winPercent']
runs = teams['R']
earned_runs_allowed = teams['ER']
# Distribution of metrics
fig, ((ax1, ax2)) = plt.subplots(nrows=2)
sns.distplot(runs, ax=ax1, kde=False)
sns.distplot(earned_runs_allowed, ax=ax2, kde=False)
ax1.set_title('Run Scored')
ax2.set_title('Earned Runs-Allowed')
plt.tight_layout()
plt.show()
corr_plot(runs, win_percent, 'Win/Runs Correlation - Standardized')
corr_plot(earned_runs_allowed, win_percent, 'Wins-Percentage/Earned-Runs-Allowed')
The correlation between the win-percentage to the runs scored is a moderate correlation in the positive direction. Also, the correlation between win-percentage and to earned-runs is a weak negative correlation.
Overall, it appears that if teams focus on offense, i.e., to score more runs, it is more likely for them to win more games. There is a caveat to this analysis though. These findings are very preliminary incorporating of only correlation measures. Multiple factors and variables are likely to cause changes to the results.
The prior sections emphasized primarily on batting statistics. Hence, this section aims to compensate for pitcher related statistics to determine which statistics determine a pitcher's salary. For this, correlation tests are done with Salary as a response variable and other pitching records as independent variables.
salary = load('Salaries.csv')
pitching_record = load('Pitching.csv')
salary.describe()
pitching_record.describe()
The 'yearID' in salary DataFrame ranged from 1985, while 'yearID' in pitching_record is available from 1871. Hence, when inner merging, a lot of data for pitching prior to 1985 gets removed, which is acceptable as there are still enough observations to conduct the analysis.
pitchers = salary.merge(pitching_record, \
on = ['playerID', 'yearID', 'teamID', 'lgID'], how = 'inner')
# Displays all attributes with .describe()
from IPython.display import display
with pd.option_context('display.max_columns', None):
display(pitchers.describe())
The counts of GIDP, SH, SF only amounts to 6481, which is low if compared to the data available for other columns with 11797 observations. This is most likely to be because of unavailability of data for a certain time period.
pitchers.groupby('GIDP')['yearID']
print(pitchers.groupby('SH').agg({'yearID':'min'}).min())
print(pitchers.groupby('SF').agg({'yearID':'min'}).min())
print(pitchers.groupby('GIDP').agg({'yearID':'min'}).min())
Thus, all data is available only after the year 2000. For this analysis, these columns are removed from the DataFrame.
pitchers = pitchers.drop('GIDP', axis=1)
pitchers = pitchers.drop('SH', axis=1)
pitchers = pitchers.drop('SF', axis=1)
As for the rest of the columns, 'BAOpp' and 'ERA' have missing values as indicated by the code below. Considering that the number of missing values are relatively small compared to total observations , the missing values are imputed with the median of each attribute.
[Rows: 11,531; NaN-BAOpp: 461; NaN-ERA: 5]
print("Missing values for ERA: " + str(len(pitchers[pd.isnull(pitchers.ERA)])))
print("Missing values for BAOpp: " + str(len(pitchers[pd.isnull(pitchers.BAOpp)])))
names = list(pitchers)
names.remove('teamID')
names.remove('lgID')
names.remove('playerID')
sns.set(style="white")
# Compute correlation matrix
corr_mat = pitchers.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr_mat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
plot_corr = sns.heatmap(corr_mat, mask=mask, cmap=cmap, vmax=1,
square=True, xticklabels=names, yticklabels=names,
linewidths=.5, cbar_kws={"shrink": .6}, ax=ax)
plot_corr.set_title('Correlation Plot of Various Metrics of Pitchers', fontsize=14)
plt.show()
pitchers.corr()['salary'].sort_values()
yearID seems to be an interesting feature. Hence, a plot representing the relationship between yearID and salary is made.
# To see change in salaries over the years
yearly_salary = salary.groupby(['yearID'], as_index=False).mean()
salary_trend_plot = plt.plot(yearly_salary['yearID'], yearly_salary['salary'])
plt.title("Annual Salary - Trend")
plt.xlabel("Year")
plt.ylabel("Salary")
plt.show()
From the relationship obtained above, the year and average salary, there is a clear upward trend. Based on the computed correlation, it can be noted that there is no significant correlation between the average salaries and the number of wins in a season. Hence, it can be concluded that spending a lot of money does not play a significant role in the team performance.
Investigating real performance metrics of players, such as ERA and SO.
ra_salary_plot = sns.regplot(x=pitchers['ERA'], y=pitchers['salary'],\
scatter_kws={'alpha':0.25}, fit_reg=False, marker="*")
ra_salary_plot.set_title('Relationship between ERA and Salary', fontsize=14)
plt.show()
The above scatterplot of salary and ERA (negative correlation) makes sense, as people normally perceive a pitcher with low ERA a good pichter, and thus deserve higher salary.
ra_salary_plot = sns.regplot(x=pitchers['SO'], y=pitchers['salary'],\
scatter_kws={'alpha':0.25}, fit_reg=False, marker="*")
ra_salary_plot.set_title('Relationship between SO and Salary', fontsize=14)
plt.show()
Finding whether a salary is a factor to win games or not. For this, 3-variable scatter plot is made with wins, loses & salary.
# Plot. Salary is scaled by 300000 for better visualizaion
pitchers.plot.scatter(x= 'W', y='L', s=pitchers['salary']/300000, title = 'The Relationship Between W, L, and Salary')
plt.show()
It seems that the capability of winning a game is not related to the salary. On the other hand, the games lost by a pitcher does not have a clear relationship with salary as well. One plausible explanation could be that pitchers who can be responsible for many games would be perceived as a member of contribution. There are around 160 games a year. Pitchers who can share the workload and maintain win/loss ratio at a certain level might be appreciated.
For this, W/L ratio is calculated for further exploration.
pitchers = pitchers.assign(win_ratio = pitchers.W/ pitchers.L)
pitchers.plot.scatter(x = 'GS', y = 'win_ratio', s=pitchers['salary']/900000, title = 'The Relationship Between GS, W/L, and Salary')
plt.show()
It looks win_ratio still matters. It can be seen that the bubble size for the points with win_ratio above 2 is relatively more abundant. The salary has a high correlation with games started (GS). The seeming relationship between L and salary could be GS. Even though the players lost games, they can still contribute to the team by sharing the workload, and thus have a high reputation among team management.
The performance metrics such as higher Strikeouts per game (SO_avg), Ratio of Winning (W_avg), Average Outs Pitched (IPouts_avg), Game Started (GS), and Batters faced by Pitcher (BFP), and lower Earned Run Average (ERA) are related to high salary. In other words, a pitcher who has a better capability of dominating the game (reflected by SO_avg, IPouts_avg, W_avg, ERA) or can be a starter that deals with more batters (reflecting by GS and BFP), tends to earn more.
Several conclusions can be drawn from the various sections of this analysis, but there are also numerous caveats that must be taken care of to prevent these conclusions from being accepted as fact. Since the analysis only involves correlation testing, it cannot imply causation.
Many aspects were not accounted for in the study; they are listed below:
The statistical conclusions are presented at the end of each section.
Overall, this project demonstrated the entire process of investigating a dataset including posing questions about a dataset, wrangling the data into a usable format, analyzing the data, and extracting appropriate conclusions.
There are still multiple possibilities to test the inferences drawn from this analysis which might further strengthen the deductions or provide an alternative conclusion. The field of sabermetrics is fascinating; I wish to further try more analytical techniques to baseball and other sports.