Using unsupervised learning and logistic regression on real NHL data to determine what it takes to win in pro hockey
I’ve been working on a new side project, and I’ve posted some of this on Kaggle. I’m trying to use NHL player and team data to determine how to build a winning team. I used k-means clustering, PCA, and logistic regression to accomplish this. My sense is that NHL GMs still determine how to build a team in a smoke-filled room, without a lot of analytics, although maybe I’m wrong. But looking on Kaggle or even some web sites dedicated to hockey analytics, I’ve never seen something similar, so I think it’s pretty innovative work (if I do say so myself).
The idea of this analysis is to use NHL player and team performance data to determine the characteristics of a high-performing NHL team, i.e., what mix of players do they have? With a model like this, we can look at the roster of a team, and determine their chances of going deep into the playoffs, and we can also make suggestions about what kinds of player moves they must make in order to become a great team.
I decided to use regular season player statistics, because this ensures we have a lot of data on every player in the league. However, to determine if a team is a high-performer, I used playoff performance, not regular season performance. It might be worth trying to use regular season data, but playoff performance is the ultimate test, and arguably the nature of the game changes in the playoffs. So I used playoff wins as the performance measure for a good team.
My approach was as follows:
- Use k-means clustering to create clusters of similar players. I used three different groupings–forwards, defenders, and goalies, and created clusters within each grouping
- Use those clusters to characterize the 32 NHL teams for each season between 2008/9 and 2021/22. How many players from each cluster do the teams have?
- Then use logistic regression to see if we can find relationships between teams having certain mixes of players, and going “deep” in the playoffs. e.g., What’s the relationship between having a forward from cluster 3 and going deep in the playoffs? If we know that, we know how important it is to have such players on your roster, and we can draw conclusions regarding the player moves teams should attempt to make in order to improve.
You can look at my Python notebook on Kaggle here, or paste this into your browser: https://www.kaggle.com/code/mesadowski/what-makes-an-nhl-winner