Where there is “Madness” there is an opportunity to model. Every March at eCapital Advisors when the NCAA Tournament rolls around, we do a company pool like most companies. However, the past two years some of my data science colleagues and I have built predictive models to help us fill out our brackets. Why? Because sports are a fascinating and interesting example to apply data science given their seemingly unpredictability. Here is the non-technical process behind how a data scientist applies a machine learning model to a March Madness bracket.

Data and Target Variable

The data all came from Kaggle, a website with data sets for machine learning competitions. We primarily used data that consisted of historic NCAA Tournament game outcomes. Along with season average statistics for each team going back to 2003. We are trying to predict whether the team won or lost an NCAA tournament game.

Exploratory Data Analysis: General Trends

Before jumping into the modeling, it is always fun to explore interesting relationships in the data with visualizations. Below we plotted the number of first round wins by seed since 1985 as the first round is typically the round with the most upsets. As you can see, there is a significant drop off in wins from 4 seeds to 5 seeds as the 5 seed is in those 5 vs. 12 seed matchups which have historically produced a lot of upsets.The number of NCAA Tournament upsets per season are shown in the graph below which varies between 12 and 23 upsets per tournament. This graph shows the variability of the “madness” over the years.The below visualization shows the number of championships per seed since 1985. As expected, most championships have been by a top seed. The machine learning algorithm biased giving higher seeds a better chance at winning than lower seeds when seed is used as a feature.

Features: Advanced Basketball Statistics

The features (predictive variables) at our disposal consisted of game by game simple box score statistics for each team. These statistics included: field goals made, field goals attempted, three pointers attempted, free throws made, free throws attempted, offensive and defensive rebounds, assists, turnovers, steals, blocks, and personal fouls. Ten years of basketball research tells us “advanced statistics”, like assists to turnover ratio, is highly correlated with winning percentage. As opposed to simply assists or turnovers.

We calculated more advanced features such as offensive rating (off_rtg), defensive rating (def_rtg), team impact estimate (ie), effective field goal percentage (efg_pct), true shooting percentage (ts_pct), free throw attempt rate (ft_rate), assist-to-turnover ratio (ast_rtio), turnover-to-possession ratio (to_poss), net efficiency rating which is also referred to as strength of schedule (sos), etc. These averaged by season and team and merged with the NCAA tournament game data for each season. Next, we examined a correlation matrix to see which of these variables correlated highly with winning percentage, represented in the first column:

In addition to the advanced statistics listed above, we also added ordinal rank. Ordinal rank is essentially an average of all rankings produced by statistical gurus like Ken Pomeroy or Kenneth Massey. Plus, the distance traveled from each team’s campus to the game site location.

Model Building and Testing

In machine learning, you typically divide your dataset into two partitions: test and training sets. We used the training set to build a model and evaluated its performance on unseen data, i.e. the test set. This provided a sense of how predictions are going to fare in the future. In our case, the 2019 NCAA Tournament. We built a series of regression models using the training set where the coefficients of each of the advanced statistics were learned in trying to predict whether the team won or lost a NCAA tournament game.

The highest accuracy model correctly predicted the outcome of an NCAA game in the test set ~74% of the time. Top ten features: ordinal rank, distance traveled, strength of schedule, true shooting percentage, free throw rate, team impact estimate, assist-to-turnover ratio, offensive rating, steal percentage, and block percentage. The produced bracket is below with probabilities:As you can see, most of the time it favored the higher seed which doesn’t really resemble a lot of the “madness” you’d expect. This year’s tournament was pretty “chalk” and the model got 15/16 sweet 16 teams correct. However, we also ran a model not using ranking or seeding. But, overall it wasn’t as accurate with only predicting the test set winners correctly ~69% of the time.

Top features: schedule strength, offensive rating, effective field goal %, distance, block %, true shooting %, and free throw rate. It produced the following bracket:As you can see the model without rankings or seedings predicted some of the first-round upsets correctly: UC Irvine, Murray State, and Oregon.

Conclusion

While our brackets weren’t perfect, we did well nationally on ESPN ranking in the 90th percentile. Plus, our data science team was top of the office leader-board too. Ending with 2nd and 3rd place finishes. Let’s pretend a last minute 3-pointer didn’t push me out of 1st place. I’m still trying to get over it!