The Power of Machine Learning and Random Forests

If you are wanting to know how to implement random forests in R, then you have come to the right place! Machine learning is a powerful tool that helps us accurately predict certain outcomes by creating a model with the information we give it. In other words, it ‘learns’ with what is given and predicts with data that hasn’t been seen yet. Isn’t that amazing?! In this article, you will get the chance to wield this power by making a random forest model using a classic Kaggle dataset called Titanic, predicting which passengers lived and died based on their background and information. To grab the datasets and if you want to learn more about Kaggle click this link

Steps To Making a Successful Random Forest

Load in Libraries and Datasets

The first step we must take is installing and loading the libraries we will use for this random forest example. We won’t go into depth about what each of the libraries do, but if you are curious, you can follow this link to the R documentation website to learn more.

library(tidyverse)
library(viridis)
library(vroom)
library(ranger)
library(missForest)

Next, we will load the two datasets of the Titanic passengers. The difference between the two is that the training dataset will help us make our model, and our test dataset will be plugged into our model. A good way of telling the difference is that the test dataset does not have the response variable Survived. The random forest model will produce those for the test data.

# Loading The Data --------------------------------------------------------
train <- vroom("~/Desktop/RA /RA/titanic/train.csv") %>% 
  mutate(Survived = as.factor(Survived))
test <- vroom("~/Desktop/RA /RA/titanic/test.csv")

Combining The Datasets

Before we can train a random forest model, we need to do some data cleaning to both datasets. We must make sure that there is no missing data or our model cannot be created. Using the function plot_missing() we can produce a graph that shows us which columns have missing data. To make our lives easier we will combine the two datasets and clean them at the same time to save lines of code.

train1 <- select(train,-c(Survived))

combine <- rbind(train1,test) 

plot_missing(combine)

Removing Columns and Factoring Columns

For the sake of simplicity we are going to delete some of the columns. However, they have useful information that we can use to train our model. That information can be collected using “feature importance”. This article will not cover this topic, but you can learn more in this article here.

combine <- combine %>%
  mutate(Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Embarked = as.factor(Embarked)) %>% 
  select(-c(Ticket, Name, Cabin))
  

Another thing we will do is make some of our discrete variables into factors so that they can be distinguished separately rather than as numbers. We will also convert binary categorical data to be 1’s and 0’s, which will better enhance the model we train.

#Male and female
combine$Male[combine$Sex == 'male'] <- 1
combine$Male[combine$Sex != 'male'] <- 0
combine$Male <- as.factor(combine$Male)

combine$Female[combine$Sex == 'female'] <- 1
combine$Female[combine$Sex != 'female'] <- 0
combine$Female <- as.factor(combine$Female)

#Embarked
combine$Emb_C[combine$Embarked == "C"] <- 1
combine$Emb_C[combine$Embarked != "C"] <- 0
combine$Emb_C <- as.factor(combine$Emb_C)

combine$Emb_S[combine$Embarked == "S"] <- 1
combine$Emb_S[combine$Embarked != "S"] <- 0
combine$Emb_S <- as.factor(combine$Emb_S)

combine$Emb_Q[combine$Embarked == "Q"] <- 1
combine$Emb_Q[combine$Embarked != "Q"] <- 0
combine$Emb_Q <- as.factor(combine$Emb_Q)

Imputing Dataset and Uncombining Datasets

The last thing we are going to use is an imputing function to fill in the missing ages for some of the passengers and give missing fare values the mean of the column. Now we can see by using the same plotting function that there is no more missing data! Now we separate the test and train datasets into their own objects.

#Get rid of old columns 
combine <- combine %>% 
  select(-c(Sex,Embarked)) %>%
  as.data.frame()

#imputation of fare column 
combine$Fare[is.na(combine$Fare)] <- mean(combine$Fare)


#imputation
imputed <- missForest(combine,ntree = 100)

combine <- imputed$ximp

plot_missing(combine)

#seperating the data out again 

new_test <- combine[892:nrow(combine),]

new_train <- combine[1:891,] %>% mutate(Survived = train$Survived)

Training a Random Forest

This portion of code below is conducting cross-validation, or trying to discover the best parameters to use in our random forest model. mtry is describing how many variables to select from when splitting our tree, n_trees describes how many trees to make, and min_n describes the minimum amount of observations in the last leaves of a tree.

##############
## Train a Random forest
#############

## Set Possible Tuning Parameters
mtry.grid <- 1:2
n_trees <- 100
min_n <- c(1:10)
tune.grid <- expand.grid(mtry=mtry.grid, min_n=min_n)

## Split dataset into K-pieces for K-fold cross validation
K <- 5
folds <- caret::createFolds(new_train$Survived, k=K, list=FALSE)

## Run Cross-validation
misclass <- matrix(0, nrow=nrow(tune.grid), ncol=K)
for(p in 1:nrow(tune.grid)){
  for(cv in 1:K){
    
    ## Fit the RF
    rf <- ranger(formula = Survived~., 
                 data=new_train[folds!=cv,],
                 num.trees=n_trees, 
                 mtry=tune.grid$mtry[p],
                 min.node.size=tune.grid$min_n[p])
    
    ## Predict the test set
    preds <- predict(rf, data=new_train[folds==cv,])
    
    ## Calculate missclass
    misclass[p,cv] <- mean(preds$predictions != new_train$Survived[folds==cv])
  }
}
misclass <- rowMeans(misclass)
ggplot() + geom_raster(aes(x=tune.grid$mtry, y=tune.grid$min_n, fill= misclass)) +
  scale_fill_viridis()

Using Best Parameters and Predicting

Now that we have discovered our best parameters, we are going to add those to our random forest model and then predict using our test dataset!

## Retrain the RF using the best setting
mtry <- tune.grid$mtry[which.min(misclass)]
min_n <- tune.grid$min_n[which.min(misclass)]

rf <- ranger(formula = Survived~., 
             data=new_train,
             num.trees=n_trees, 
             mtry=mtry,
             min.node.size=min_n)
             
##Using test data set 
ship_preds <- predict(rf, data = new_test)

## Put into a dataframe and create a file 
predict_data <- data.frame(PassengerId = new_test$PassengerId, Survived = ship_preds)
predict_data

write_csv(predict_data,"predict_titanic.csv")

The Tree Line Doesn’t Stop Here!

You have learned how to impute data, tune parameters, and predict using a trained random forest model in R! However, this is ONLY a basic tutorial on how to harness the power of random forests. There is so much more that can be done to this particular dataset to have a more accurate model. I encourage you, if you are really interested in enhancing your data science skills, to learn more about the power of random forests!