Training a Random Forest, testing, model evaluation, and validation.
To follow along, complete the steps in the previous posts before moving on:
The first thing we’ll do is split our data into a training set and a testing set. The training set will be used to “train” our model and the testing set will be used to ensure that we haven’t overfitted that model. A good model generalizes well and can be applied to unseen data while providing similar performance, hence the seperate testing set to validate this notion.
A widely used machine learning library, scikit-learn, includes functionality to randomly split our dataframe:
import pandas as pd import numpy as np from sklearn.cross_validation import train_test_split train, test = train_test_split(df, test_size = 0.20)
This created two new dataframes with 80% of the data in train and 20% in test, which we’ll now convert to arrays using the Pandas values method:
train_data = train.values test_data = test.values test_data
array([[ 0. , 1. , 0. , ..., 0. , 0. , 0.03125 ], [ 0. , 1. , 0. , ..., 0. , 0. , 0.02678571], [ 1. , 1. , 1. , ..., 0. , 0. , 0.02564103], ..., [ 1. , 1. , 1. , ..., 0. , 0. , 0.04878049], [ 1. , 1. , 1. , ..., 0. , 0. , 0.05284553], [ 1. , 1. , 1. , ..., 0. , 0. , 0.03448276]])
Definitely not as pretty nor digestible as the dataframe representation, but this array conversion is necessary before fitting our model.
A random forest is an ensemble of decision trees developed for classification, which is particulary useful for our purpose as we’re trying to determine a 0/1 response (software developer vs. non-developer). I highly recommend taking a look at this Visual Introduction to Machine Learning before moving on to see exactly how a decision tree and random forest works.
We’ll use scikit-learn’s RandomForestClassifier to create a forest containing 100 trees and pass our training data through:
from sklearn.ensemble import RandomForestClassifier x_train = train_data[0::,1::] # Features y_train = train_data[0::,0] # IsSoftwareDev # Random forest object with 100 trees forest = RandomForestClassifier(n_estimators = 100) # Fit the training data to the IsSoftwareDev labels and create the decision trees forest = forest.fit(x_train,y_train)
Using the score method, we can determine the accuracy of the random forest predictions:
The training accuracy is 100%, i.e. all of the observations were classified as software developers (or not) correctly. This shouldn’t be shocking since our model was fit on the training data. Let’s see how it does on the unseen test set:
x_test = test_data[0::,1::] y_test = test_data[0::,0] forest.score(x_test,y_test)
The test accuracy is 74% vs. the baseline of ~50% we could have achieved by randomly guessing. The mismatches are due to overfitting, but by tweaking the model, assumptions, and features we may be able to improve the accuracy.
Here’s the receiver operating characteristic (ROC) curve:
%matplotlib inline import matplotlib.pyplot as plt from sklearn.metrics import roc_curve y_pred_rf = forest.predict_proba(x_test)[:, 1] fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf) plt.style.use('ggplot') plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr_rf, tpr_rf) plt.xlabel('False positive rate') plt.ylabel('True positive rate') plt.title('ROC curve') plt.show()
The area under the curve (AUC) is a common evaluation metric for binary classification models:
from sklearn import metrics metrics.auc(fpr_rf, tpr_rf)
And using the breakouts below we can give our model a grade:
.90-1.0 = Excellent (A) .80-.90 = Good (B) .70-.80 = Fair (C) .60-.70 = Poor (D) .50-.60 = Fail (F)
Overall, not bad for our first attempt but there’s still some room for improvement.
Finally, let’s take a look at which features prodived the greatest predictive value. The feature_importances method will calculate these values and we’ll use matplotlib to create a bar plot to visualize them:
%matplotlib inline import matplotlib.pyplot as plt feature_names = df.columns.values[1:] importances = forest.feature_importances_ imp = pd.DataFrame(importances, index=feature_names).sort_values(0) imp.plot.barh(figsize=(8,6), fontsize=12)
Without getting into what the numbers actually mean, the feature we created from MonthsProgramming and AgeFill ended up having the highest relative importance. Finishing the bootcamp program, using multiple learning resources, going to different types of code events, and listening to coding podcasts all ranked highly as well.
Ideally, we’d also do an out-of-time validation. If FreeCodeCamp conducts the same survey again next year we could really test how well our random forest model is at predicting software developers.
A lot of material has been covered over the last few posts and I would highly encourage going back and seeing if there’s anything that could have been done differently to yield better results, e.g. try a different machine learning technique, select a different set of features, feature transformations, etc.