Cleaning data and engineering features in preparation to build a machine learning model.

Now that we’ve imported and explored the data, we’re only one step away from building our machine learning model: cleaning the data. If you haven’t done so already, refer to the Data Exploration post to catch up to this point.

Let’s continue using our previously opened notebook and focus our efforts on the following variables: Age, BootcampName, BootcampFinish, BootcampRecommend, MonthsProgramming, CodeEvents, PodcastsListen, and ResourcesUse.

Missing Values

In the previous post we had noticed that some of the variables contained “NaN” values. These are survey questions that for one reason or another the respondent decided not to answer and therefore are null or missing:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890 entries, 0 to 889
Data columns (total 15 columns):
ID                   890 non-null object
IsSoftwareDev        890 non-null int64
Age                  777 non-null float64
Gender               777 non-null object
BootcampName         890 non-null object
BootcampFinish       890 non-null int64
BootcampLoan         890 non-null int64
BootcampRecommend    890 non-null int64
HoursLearning        878 non-null float64
MonthsProgramming    890 non-null int64
CityPopulation       780 non-null object
CodeEvents           890 non-null int64
PodcastsListen       890 non-null int64
ResourcesUse         890 non-null int64
SchoolDegree         787 non-null object
dtypes: float64(2), int64(8), object(5)
memory usage: 104.4+ KB

Out of the list of variables we’re concerned about, Age is the only one that has missing values. There are many ways we can go about imputing these missing values, but for simplicity let’s fill them in with the median age across the entire population.

Make a copy of Age so we can retain the original values:

df['AgeFill'] = df['Age']

Calculate and store the median age:

median_age = df['Age'].median()
median_age
29.0

Impute the median age for the missing values only:

df.loc[(df.Age.isnull()),'AgeFill'] = median_age

Additionally, the fact that age was left blank could be predictive so let’s create another variable for this:

df['AgeIsNull'] = pd.isnull(df.Age).astype(int)

To verify we did all of this correctly:

df[df['Age'].isnull()][['Age','AgeFill','AgeIsNull']].head()
Age AgeFill AgeIsNull
17 NaN 29.0 1
19 NaN 29.0 1
21 NaN 29.0 1
27 NaN 29.0 1
32 NaN 29.0 1

And finally here’s our dataframe with the newly added variables:

df.head()
ID IsSoftwareDev Age Gender BootcampName BootcampFinish BootcampLoan BootcampRecommend HoursLearning MonthsProgramming CityPopulation CodeEvents PodcastsListen ResourcesUse SchoolDegree AgeFill AgeIsNull
0 fcec97ea81a48afefd45fdaa0ba38ffb 0 31.0 male General Assembly 0 1 1 40.0 3 100,000 - 1 million 1 0 1 Bachelor's and Higher 31.0 0
1 fedcfbfd105c8f6a5242bd99355eefca 1 27.0 male Flatiron School 1 0 1 15.0 36 >1 million 2 1 4 Bachelor's and Higher 27.0 0
2 fe77569c98663547019c8cc265d77527 1 34.0 male App Academy 1 0 1 5.0 24 >1 million 2 0 1 Bachelor's and Higher 34.0 0
3 ffe5c4e4932babee53c26fa49f2a409c 0 33.0 male Other 1 0 1 18.0 36 100,000 - 1 million 1 0 7 Bachelor's and Higher 33.0 0
4 ffb4b6e4b0d1852b5c15144f3ea50f3d 0 21.0 male Other 0 0 1 10.0 7 <100,000 0 0 10 Less than Bachelor's 21.0 0

Categorical Values

BootcampName is categorical, so we’ll have to convert it to a numerical representation by creating “dummy” variables to indicate whether or not (0 or 1) the respondent attended a particular bootcamp. The Pandas get_dummies function makes this fairly quick and easy:

dummies = pd.get_dummies(df['BootcampName']).rename(columns=lambda x: x.replace(' ', ''))

This created a new dataframe with 10 dummy variables, one for each unique BootcampName:

dummies.head()
AppAcademy DevBootcamp FlatironSchool GeneralAssembly HackReactor HackbrightAcademy Other PrimeDigitalAcademy TheIronYard Turing
0 0 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 1 0 0 0

Let’s concatenate this dataframe to our existing one:

df = pd.concat([df,dummies], axis=1)

Done! Here’s the updated dataframe with the newly created dummy variables:

df.head()
ID IsSoftwareDev Age Gender BootcampName BootcampFinish BootcampLoan BootcampRecommend HoursLearning MonthsProgramming ... AppAcademy DevBootcamp FlatironSchool GeneralAssembly HackReactor HackbrightAcademy Other PrimeDigitalAcademy TheIronYard Turing
0 fcec97ea81a48afefd45fdaa0ba38ffb 0 31.0 male General Assembly 0 1 1 40.0 3 ... 0 0 0 1 0 0 0 0 0 0
1 fedcfbfd105c8f6a5242bd99355eefca 1 27.0 male Flatiron School 1 0 1 15.0 36 ... 0 0 1 0 0 0 0 0 0 0
2 fe77569c98663547019c8cc265d77527 1 34.0 male App Academy 1 0 1 5.0 24 ... 1 0 0 0 0 0 0 0 0 0
3 ffe5c4e4932babee53c26fa49f2a409c 0 33.0 male Other 1 0 1 18.0 36 ... 0 0 0 0 0 0 1 0 0 0
4 ffb4b6e4b0d1852b5c15144f3ea50f3d 0 21.0 male Other 0 0 1 10.0 7 ... 0 0 0 0 0 0 1 0 0 0

Feature Engineering

So far we’ve created 12 new variables: 10 from BootcampName and 2 from Age. By doing this, we engineered features (variables) for our model. Let’s create another feature, this one containing the proportion of time someone has been programming in relation to their age:

df['MonthsProgramming/AgeFill'] = df.MonthsProgramming / (df.AgeFill * 12)

We’ll see in the next post if this actually turns out to be a good predictive feature.

Final Preparation

As a final step, let’s drop the variables that we decided not to include in our model, as well as any redundant variables:

df = df.drop(['ID','Age','Gender','BootcampName','BootcampLoan','HoursLearning','CityPopulation','SchoolDegree'], axis=1) 

This should leave us with 19 features to build from:

df.head()
IsSoftwareDev BootcampFinish BootcampRecommend MonthsProgramming CodeEvents PodcastsListen ResourcesUse AgeFill AgeIsNull AppAcademy DevBootcamp FlatironSchool GeneralAssembly HackReactor HackbrightAcademy Other PrimeDigitalAcademy TheIronYard Turing MonthsProgramming/AgeFill
0 0 0 1 3 1 0 1 31.0 0 0 0 0 1 0 0 0 0 0 0 0.008065
1 1 1 1 36 2 1 4 27.0 0 0 0 1 0 0 0 0 0 0 0 0.111111
2 1 1 1 24 2 0 1 34.0 0 1 0 0 0 0 0 0 0 0 0 0.058824
3 0 1 1 36 1 0 7 33.0 0 0 0 0 0 0 0 1 0 0 0 0.090909
4 0 0 1 7 0 0 10 21.0 0 0 0 0 0 0 0 1 0 0 0 0.027778

Wrap up

We’ve dealt with missing values, converted categorical variables to numeric, engineered our own features, and are now ready to apply machine learning. In the next post we’ll do exactly that by building a Random Forest model to predict which survey respondents became software developers.

Additional Reading

Pandas Documentation