Winning and NHL Salaries: Technical Note

by Jose Mena

I’d like to take some space here to discuss the methodology we used to analyze NHL salaries and their relation to success over the past four years.  This way, it’s clear where I’m coming from with all of my arguments, and so that it’s easier for people to start a discussion in case I’ve made a mistake.

We parsed our data from the HTML pages of the nhlnumbers.com website using a combination of bash and Perl scripts.  We used their website’s annotations for player positions.  They're awesome, and they acknowledge that their data may be imperfect; we feel that because we're looking at larger trends, any inaccuracy will average out over the course of the sample.  We took the pro-rated cap hit of each player paid by the team that year, which means that we have accounted for roster changes made either by trade or by injury/minor-league call-up.  There may be inaccuracies in the data, but because we're considering larger trends we're going to assume that these average out over the course of the dataset.


In order to compensate for differences in the distribution of spending on a year-to-year basis, we converted all payroll figures into Z-scores using the formula:

Here, Z is the Z-score, S is a yearly raw salary for team t in year y at position p, µ is the mean salary of all teams at position p in year y and σ is the standard deviation of the salaries of all teams at position p in year y.  We argue that points are up for grabs and salaries are determined on a season-by-season basis, so the natural population to compare each data point to is restricted to that those of the same year.  We subdivided position p into LW, C, RW, D and G, and analyzed several subsets of these data.  We could find data from 2007-08 to the most recent season; we wished for more.  Z-scores for points were calculated in a similar fashion to the Z-scores for salaries.

To find the correlation between total spending and points, we found the Pearson correlation between the Z-scores for points and the Z-scores for spending.  The error bars are found as the standard deviation in the estimate of the slope of a linear regression between spending and points; because we are using Z-scores, the slope of the regression is the same as the Pearson correlation.  

In order to analyze the positional breakdown of relative importance to success on a year-to-year basis, we performed a multivariate linear regression on the Z-scores of the positional salaries against the Z-scores of the points gained.  All regressions were performed with Excel’s LINEST function.  We assumed a simple linear model of the form:

The coefficients of this model represent the relative weights of each type of spending as it relates to points earned.  We used all 30 teams in each year as data points for the regression.  In order to provide context for the model weights obtained for each year, we assumed a reference model in which each position contributes equally to points earned.  The coefficient obtained from a linear regression against such a model is simply the correlation between total spending and points earned; we seek to control for the inflation of the positional weights due simply to an increased influence of the marginal dollar, without regard to where it is spent. 

We interpret the coefficients of our multivariate regression to be essentially marginal points per dollar (in different units), so the normalized coefficient we show in the chart represents the marginal points per dollar spent in that position above the background relationship between spending and points earned.  We used the formula below:

The x-axis of the last chart in the main post is therefore to be interpreted as "marginal standard deviation of points earned per standard deviation of payroll".  We feel that this metric provides us with a robust way to compare regression coefficients across the multiple years we have in the data.

No comments:

Post a Comment