SPSS Modeler is statistical analysis software used for data analysis, data mining and forecasting. Statistical analysis allows us to use a sample of data to make predictions about a larger population. Creating predictive models utilizing the information currently at your fingertips to predict what decisions will impact your future success. Predictive analytics is hugely important as it allows you to see into the future and make quality decisions based on long term planning.

Decision tree analyses are popular models because they indicate which predictors are most strongly related to the target. The purpose of decision trees is to model a series of events and look at how it affects an outcome. This type of model calculates a set of conditional probabilities based on different scenarios.

This blog will detail how to create a simple predictive model using a CHAID analysis and how to interpret the decision tree results. In this example I will be predicting student enrollment, which has two categories Yes, meaning those students who did enroll in the university and No, those students who did not enroll.

Creating the Model:

Starting from the sources tab I’m going to drag in a statistics file node and import the .sav file from my local machine.

Decision tree analysis - selecting sources

To view the data drag in a table node and attach it to the statistics node already on the canvas.

Click run and double click to view the table output.

The next step in the process is to read in the data using a type node. The type node specifies metadata and data properties for each field: the measurement level, data values, the role and missing value definitions. From the screenshot you can see that the field enroll is our target.

Next drag a CHAID node and attach it to the existing type node. CHAID stands for chi square automatic interaction detection and is one of the more popular decision tree models. And really what’s going on behind the scenes is that the model is running the chi square test many times. This will make more sense in just a little bit, but essentially, the model is picking the predictors with the strongest relationship with the outcome field and that is determined by the field that has the highest chi square statistic.

Edit the CHAID node and make sure the target is set correctly. You can also remove any inputs or predictors that you don’t want included in the model. When you’re ready click the Run icon in the lower left hand corner to create the model.

If all goes well you will get the golden nugget. Double click the nugget to see the results.

What we have here are the top predictors of enrollment. The top three predictors are financial aid, overnight visit and alumni meeting. In total there were six predictors that the model deemed important. To view the decision tree click on the Viewer tab.

Interpreting the Results:

The decision tree starts with the root node, which simply shows the distribution of the outcome field, which as we know is enrollment.  The data is then split based on statistical significance by the predictor with the strongest relationship with the target field, financial aid in this case. And you can see that there are five “buckets” that financial aid has been split into (0%, 1%-25%, 26%-5-%, 51%-75% and 75% +). Looking at those students who were offered a  51%-75% financial aid package, the model was able to predict that those students would enroll roughly 54% of the time. This prediction applied to 416 students and the model was accurate 224 times.

As we continue to work our way down the tree, we see that the next most important variable is an overnight visit. If a student was offered a financial aid package of 51%-75% and also took an overnight visit we were able to accurately predict that they would enroll around 93% of the time. Alternatively, if students did not take an overnight visit we predicted that they would not enroll 63% of the time. This rule applied to 289 students and we were accurate about 183 times.

And just like that we continue to work our way down the tree to the next most significant variable until we reach a terminal node, which signifies that the prediction has ended.

This was a simple decision tree aimed as showing which variables help us to accurately predict student enrollment. Keep in mind that predictive analytics can be applied in a variety of industries including education, retail, healthcare and finance just to name a few.