Importing data, exploring with Pandas, and visualizing with Seaborn.

Import Data

For this and the rest of the posts in the series we’ll explore survey responses of 15,000 people who are actively learning to code provided by Free Code Camp. I’ve extracted a subset of the data, those who have attended a coding bootcamp, which you can download in .csv format here. Ultimately, we’ll build a machine learning model to predict which respondents became software developers.

Open up a new Jupyter notebook (refer to Getting Setup if you need help) and import the .csv file using the built-in functionality of Pandas (change the pathname to where you have it stored locally):

import pandas as pd

df = pd.read_csv('/yourpathname/bootcamp.csv', header=0)

Explore

That was easy! Now that we’ve imported the data, let’s take a look at the first few rows:

df.head(3)
ID IsSoftwareDev Age Gender BootcampName BootcampFinish BootcampLoan BootcampRecommend HoursLearning MonthsProgramming CityPopulation CodeEvents PodcastsListen ResourcesUse SchoolDegree
0 fcec97ea81a48afefd45fdaa0ba38ffb 0 31.0 male General Assembly 0 1 1 40.0 3 100,000 - 1 million 1 0 1 Bachelor's and Higher
1 fedcfbfd105c8f6a5242bd99355eefca 1 27.0 male Flatiron School 1 0 1 15.0 36 >1 million 2 1 4 Bachelor's and Higher
2 fe77569c98663547019c8cc265d77527 1 34.0 male App Academy 1 0 1 5.0 24 >1 million 2 0 1 Bachelor's and Higher

Similarly, for the last five rows we can run:

df.tail() # Defaults to five when argument left blank
ID IsSoftwareDev Age Gender BootcampName BootcampFinish BootcampLoan BootcampRecommend HoursLearning MonthsProgramming CityPopulation CodeEvents PodcastsListen ResourcesUse SchoolDegree
885 01231cbb610953d802e1e0a591a66053 1 25.0 female Dev Bootcamp 1 1 1 10.0 10 >1 million 4 0 1 Bachelor's and Higher
886 00c4ee169a6097f617732072b5304ba3 0 32.0 male Other 1 0 1 10.0 4 100,000 - 1 million 0 0 3 Bachelor's and Higher
887 00c33543e86585235b2556a654e33906 0 NaN female Hackbright Academy 1 1 1 35.0 12 >1 million 4 0 3 Bachelor's and Higher
888 02590cc39b94751bc6759aab7f0c93b6 1 30.0 NaN The Iron Yard 1 0 1 12.0 18 >1 million 1 0 5 Bachelor's and Higher
889 01e25d485ad926725172f88917e755f3 0 NaN NaN General Assembly 1 1 1 30.0 11 NaN 4 2 5 NaN

You’ll notice there are some null values (NaN), which we’ll handle in the next post. To see the column details, run:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890 entries, 0 to 889
Data columns (total 15 columns):
ID                   890 non-null object
IsSoftwareDev        890 non-null int64
Age                  777 non-null float64
Gender               777 non-null object
BootcampName         890 non-null object
BootcampFinish       890 non-null int64
BootcampLoan         890 non-null int64
BootcampRecommend    890 non-null int64
HoursLearning        878 non-null float64
MonthsProgramming    890 non-null int64
CityPopulation       780 non-null object
CodeEvents           890 non-null int64
PodcastsListen       890 non-null int64
ResourcesUse         890 non-null int64
SchoolDegree         787 non-null object
dtypes: float64(2), int64(8), object(5)
memory usage: 104.4+ KB

There are 890 observations (entries) and 15 variables (columns) that exist in the dataset. We are also given the data type and number of non-null values for each variable.

Let’s get a sense of what values these variables contain. Here are all of the different bootcamps that survey respondents attended:

df.BootcampName.unique()
array(['General Assembly', 'Flatiron School', 'App Academy', 'Other',
       'The Iron Yard', 'Prime Digital Academy', 'Turing',
       'Hackbright Academy', 'Dev Bootcamp', 'Hack Reactor'], dtype=object)

There are 10 unique values in BootcampName. How many respondents fell into each of these?

df.groupby('BootcampName').size()
BootcampName
App Academy               21
Dev Bootcamp              48
Flatiron School           52
General Assembly          89
Hack Reactor              28
Hackbright Academy        21
Other                    535
Prime Digital Academy     30
The Iron Yard             39
Turing                    27
dtype: int64

Most were categorized as “Other” with the 2nd most attending “General Assembly” (89 or 10% of the total):

df.groupby('BootcampName').size() * 100 / len(df) # Percent of total
BootcampName
App Academy               2.359551
Dev Bootcamp              5.393258
Flatiron School           5.842697
General Assembly         10.000000
Hack Reactor              3.146067
Hackbright Academy        2.359551
Other                    60.112360
Prime Digital Academy     3.370787
The Iron Yard             4.382022
Turing                    3.033708
dtype: float64

Try running the same set of queries for the other categorical (object) variables, such as Gender, CityPopulation, and SchoolDegree.

For all of the numerical (float64, int64) variables, we can quickly view the summary statistics using describe:

df.describe()
IsSoftwareDev Age BootcampFinish BootcampLoan BootcampRecommend HoursLearning MonthsProgramming CodeEvents PodcastsListen ResourcesUse
count 890.000000 777.000000 890.000000 890.000000 890.000000 878.000000 890.000000 890.000000 890.000000 890.000000
mean 0.471910 31.104247 0.697753 0.334831 0.784270 24.781321 13.023596 1.884270 0.624719 3.489888
std 0.499491 7.860786 0.459490 0.472197 0.411559 20.147009 9.693881 1.613037 0.977046 1.977491
min 0.000000 11.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 26.000000 0.000000 0.000000 1.000000 10.000000 6.000000 1.000000 0.000000 2.000000
50% 0.000000 29.000000 1.000000 0.000000 1.000000 20.000000 11.000000 2.000000 0.000000 3.000000
75% 1.000000 34.000000 1.000000 1.000000 1.000000 40.000000 18.000000 3.000000 1.000000 5.000000
max 1.000000 60.000000 1.000000 1.000000 1.000000 100.000000 36.000000 9.000000 7.000000 12.000000

The average age of a bootcamp attendee is 31, 70% finished the program, dedicating 25 hours a week to learning, with about 47% becoming employed as software developers. To see the same statistics for the software developer group:

df[df['IsSoftwareDev'] == 1].describe()
IsSoftwareDev Age BootcampFinish BootcampLoan BootcampRecommend HoursLearning MonthsProgramming CodeEvents PodcastsListen ResourcesUse
count 420.0 368.000000 420.000000 420.000000 420.000000 409.000000 420.000000 420.000000 420.000000 420.000000
mean 1.0 29.739130 0.890476 0.309524 0.826190 17.320293 16.714286 2.052381 0.659524 3.500000
std 0.0 6.790151 0.312668 0.462849 0.379398 17.266725 9.742718 1.607595 0.967629 1.983825
min 1.0 15.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.0 25.750000 1.000000 0.000000 1.000000 5.000000 9.000000 1.000000 0.000000 2.000000
50% 1.0 28.000000 1.000000 0.000000 1.000000 10.000000 15.000000 2.000000 0.000000 3.000000
75% 1.0 32.000000 1.000000 1.000000 1.000000 20.000000 24.000000 3.000000 1.000000 5.000000
max 1.0 60.000000 1.000000 1.000000 1.000000 100.000000 36.000000 8.000000 5.000000 11.000000

Of those that became software developers, a higher percentage finished (89%) and recommend the bootcamp they attended, have been programming for longer, and have participated in a greater variety of coding events than the overall population. There’s a lot of good information here, but it would be much easier to glean insights if we pull these measures into a visual format.

Seaborn

Matplotlib is the main library for plotting in Python, but just as Pandas is an easy-to-use interface built on top of NumPY, Seaborn is a high-level interface built on top of matplotlib.

Unfortunately, Seaborn does not come pre-installed with Anaconda but can be added by simply entering “conda install seaborn” in a terminal window:

Seaborn Install

If you get prompted to update any dependent packages enter “y” to continue. Once completed, let’s build our first discrete plot:

%matplotlib inline
import seaborn as sns

 # Set up a grid to plot "software developer" probability by bootcamp
g = sns.PairGrid(df, y_vars="IsSoftwareDev", x_vars="BootcampName", size=4, aspect=3.25)

 # Draw a seaborn pointplot onto each Axes
g.map(sns.pointplot)
g.set(ylim=(0, 1))
sns.despine(fig=g.fig, left=True)

png

Clearly some bootcamps are much better at producing software developers than others. App Academy is at the higher end with around 70% of their attendees becoming software developers vs. Turing, which is hovering at 20%.

We can also look at discrete plots side-by-side:

g1 = sns.PairGrid(df, y_vars="IsSoftwareDev", x_vars=["Gender","SchoolDegree","CityPopulation"], size=4, aspect=1)
g2 = sns.PairGrid(df, y_vars="IsSoftwareDev", x_vars=["BootcampFinish","BootcampLoan","BootcampRecommend"], size=4, aspect=1)

g1.map(sns.pointplot)
g1.set(ylim=(0, 1))
sns.despine(fig=g.fig, left=True)

g2.map(sns.pointplot)
g2.set(ylim=(0, 1))
sns.despine(fig=g2.fig, left=True)

png

png

The difference between genders is fairly flat, but having a Bachelor’s degree or higher, as well as living in a city with more than 100k people, carries a higher incidence of software developers. Those who finished and recommend the bootcamp program they attended are also more likely to be developers.

Let’s dig a bit deeper into Gender and also pull in Age. Faceted histograms are a great way to visualize distributions by different variable combinations:

import numpy as np
import matplotlib.pyplot as plt
sns.set(style="darkgrid")

g = sns.FacetGrid(df, row="Gender", col="IsSoftwareDev", margin_titles=True)
bins = np.linspace(0, 60, 13)
g.map(plt.hist, "Age", color="steelblue", bins=bins, lw=0)

png

Even though the percent of software developers by gender is about the same, there were more males that attended bootcamps than females. For both genders, we see a spike at the 25-30 age range in the developer group, but less so in the non-developer group which is more evenly distributed.

Another awesome and easy to use function of Seaborn is jointplot, which creates a linear regression between two variables along with the marginal distributions. Let’s take a look at Age and MonthsProgramming:

sns.jointplot("Age", "MonthsProgramming", df, kind='reg');

png

The relationship between Age and MonthsProgramming is very weak (r = 0.055). On its own this isn’t very informative, but maybe if we separate the developers and non-developers:

sns.lmplot("Age", "MonthsProgramming", df, col="IsSoftwareDev");

png

The relationship is slightly stronger for developers (r = 0.13; re-run the first jointplot for the developer group to verify). Regardless of age, developers have been programming for a greater number of months. We can overlay these plots to see the difference more clearly:

sns.lmplot(x="Age", y="MonthsProgramming", hue="IsSoftwareDev", data=df)

png

Wrap up

We’ve uncovered some interesting insights and identified a handful of predictive variables to include in our model. If you want to try out some more visualizations, check out the documentation for Matplotlib and Seaborn below. In the next post we’ll munge (clean and shape) our data in order to make it machine learning ready.

Additional Reading

Pandas Documentation
Matplotlib Documentation
Seaborn Documentation