Data science is a big area to tackle, but a core area is predictive analysis. Prediction if correct, in a business setting; you could probably say is priceless. So where does one start? Well the choice for many is to either use R or Python, in this post I’m using R. I’m not going to talk about the history of R, you can look that up yourself. I personally only started using R fairly recently and I found the learning curve quite difficult, and more than frustrating. Why? Well I found an article that goes into extensive detail, but does some up the many of the gripes that I personally have.
http://r4stats.com/articles/why-r-is-hard-to-learn/
Ok, so now we know you’re not the only one with R issues which bring me to Caret. Isn’t that spelt ‘carrot’? Nope it is definitely Caret, and it stands for _C_lassification _A_nd _RE_gression _T_raining. Caret is a wrapper to standardise the interface of many predictive analysis R packages.
http://topepo.github.io/caret/index.html
Caret’s author is Max Kuhn who is currently a software engineer at RStudio Inc. Check out his Linkedin profile for his full bio.
https://www.linkedin.com/in/max-kuhn-864a9110
Now there are quite a few Caret tutorials out there, but I am going to share my own, which is a very quick and dirty introduction. Now, I am by no means a data scientist nor do I have a statistical background, but we all have to start somewhere right?
This tutorial specifically uses Caret to do what is called ‘supervised learning’. To do this a set of data that has the following is needed:
(a) an outcome or ‘dependent’ variable
and
(b) a set of input or predictors known as ‘independent’ variables
The ‘supervision’ means – can a model be built using a portion of data, also known as training data, to then accurately predict the outcome on the rest of the data? By comparing the predicted outcome to real outcome we can see how well the model performed.
This tutorial uses data from direct marketing campaigns of a Portuguese banking institution – which is apparently real; anyway as you might be aware obtaining good data is half the battle. This data provides the results to a direct marketing campaign that sold bank term deposits. I found this data at the Center for Machine Learning and Intelligent Systems at the University of California (UCI), which can be downloaded from here:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
So let’s use dataset 4 ‘bank.csv’. It is described as – “bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs)”. In this example we’ll be using the Caret library to interface with the randomForest library. The ‘Random Forest’ is the algorithm that we’ll be using to build the model, but of course Caret is designed give a standard interface to many different algorithms. For more on Random Forests I really liked this blog post by Edwin Chen:
http://blog.echen.me/2011/03/14/laymans-introduction-to-random-forests
Now let’s get into the R code. First things first we need to load the Caret library and the randomForest library.
> # Step 1: Load libraries > #install.packages("caret") > #apparently Caret has a dependency on ‘e1071’ > #install.packages("e1071”) > #install.packages("randomForest") > library(caret) > library(e1071) > library(randomForest)
Next we load the data.
> # Step 2: Load data > bankData = read.csv("/Users/leighroy-mbp/Documents/R Scripts/Caret Example/bank.csv", header = TRUE, sep = ';')
The next step is to create a training data set for the supervised learning with the ‘createDataPartition()’ function being part of the Caret library.
> # Step 3: Create training and test datasets > #create the training data > trainingIndex <- createDataPartition(bankData[,17], p = .75, list = FALSE, times = 1) > trainingData <- bankData[trainingIndex,] > #create the test data > fullIndex <- 1:nrow(bankData) > #setdiff is used to select the remaining rows that weren't selected in the training data > testIndex <- setdiff(fullIndex,trainingIndex) > testData <- bankData[testIndex,]
The ‘p = .75’ means randomly take 75% of the data, ‘times’ is how many sets to create.
Now that we have our data ready it is time to build the model. To do this we use the Caret function called ‘train()‘.
> # Step 4: Build the model > model <- train(y ~ duration + pdays + age, > data = trainingData, # Use the trainingData dataframe as the training data > method = 'rf', # 'rf' is Random Forests > importance = TRUE, # for Random Forests this needs to be set to true > trControl = trainControl(method = 'repeatedcv', number = 5) # Use 5 folds for cross-validation > )
The ‘train()’ function takes the outcome (or dependent) variable first followed by the ‘~’ (tilda) then each input variable you would like to add to the model followed by the ‘+’ symbol. The next arguments are the ‘data’, which is the training data, the ‘method’ (or algorithm); ‘rf’ is for Random Forests. For a list of all algorithms that can be used with Caret see:
https://topepo.github.io/caret/train-models-by-tag.html
‘importance’ needs to be set to ‘TRUE’ ok.. and then we assign the ‘trainControl()’ function to the ‘trControl’ argument. This basically tells the algorithm to how to internally split the training data into 5 different permutations for each different model that the algorithm tests, this is done to stop overfitting.
Now that there is a model, let’s see look at which input or predictors have the most influence on predicting the outcome. I found with a little bit of trial and error that the predictors ‘duration’ and ‘pdays’ and ‘age’ were the most influential. I’m sure you could automate this step or somewhere someone ha written code to automatically find the most optimal set of predictors. Predictors that give negative values actually make the model less accurate so don’t use those.
> # Step 5: Check variable importance > varImp <- varImp(model, scale = FALSE) > print(varImp) > plot(varImp)
The last step is to run the model against other 25% percent of test data that the model hasn’t seen. First we use the ‘predict()’ function to add a new column to our test data with the predicted result.
> # Step 6: Test the model against data it hasn’t seen > testData$myPrediction <- predict(model, newdata = testData)
Now we can look at how well the model predicted. Now initially my apparently was achieving great results but then I realised that the number of true negatives was skewing the results. This is known as the ‘Accuracy Paradox’. So after some head twisting reading about the ‘Confusion Matrix’ I settled on using the true positive rate (TPR). In this example this keeps things simple, it does this by only measuring how many actual successful marketing campaigns were predicted correctly.
# Step 7: View the results > actualResult <-table(testData$y) > actualResultNumOfYes <- actualResult[names(actualResult)=="yes"] > predictedResult <-table(testData$myPrediction) > predictedResultNumOfYes <- predictedResult[names(predictedResult)=="yes"] > modelAccuracy <- predictedResultNumOfYes / actualResultNumOfYes * 100 > result <- paste("The model was :", toString(modelAccuracy), "% accurate for 'yes' predictions") > print(result)
Now the interesting thing is – each time the script was run, because it randomly picked different sets of data, the model accuracy varied from between 60% to 80%. Still, in the real world that ain’t too bad. Maybe try running the model on the full dataset and see what you get.