Exploring IBM HR data using Python

Intro

This tutorial series explores the IBM HR data set. This data is typically used to demonstrate the ability of various machine learning algorithms applied to HR data.

In this series, I'll use it to demonstrate the awesome power Python can bring to HR data

Sections

  • Statistics
  • Matplotlib
  • Pandas
  • Seaborn
  • Plotly
  • Findings
In [1]:
__author__ = "adam"
__version__ = "1.0.0"
__maintainer__ = "adam"
__email__ = "adam@datapluspeople.com"
In [2]:
# imports 
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [3]:
# if continuing on from the previous section, read the data from saved file

# empl_data = pd.read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
In [4]:
# read the data directly from IBM Watson Analytics
# using pandas read excel file into dataframe
url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
empl_data = pd.read_excel(url)

# save data for later
# empl_data.to_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")

seaborn

In this section, we'll continue with visualizations using the seaborn library.

Seaborn aims to use sensible defaults for style and color choices. As with pandas .plot methods, Seaborn is an extension to Matplotlib, which is where the plotting happens. Seaborn helps to make this easier and more effective.

We'll begin on our analysis in this section looking at Education.

In [5]:
# matplotlib.pyplot

# explicitly view default of matplotlib
plt.style.use('default') 

# plot Education count
plt.bar(sorted(empl_data['Education'].unique()),empl_data.groupby('Education')['EmployeeCount'].count())
Out[5]:
<BarContainer object of 5 artists>
In [6]:
# seaborn

# explicitly view default of seaborn
sns.set()

# plot Education count
sns.countplot(empl_data['Education'])
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a36c02e8>

Seaborn provides sensible defaults that improve the readability of the visualizations.

In [7]:
empl_data.groupby('Education')['EmployeeCount'].count()
Out[7]:
Education
1    170
2    282
3    572
4    398
5     48
Name: EmployeeCount, dtype: int64
In [8]:
sorted(empl_data['Education'].unique())
Out[8]:
[1, 2, 3, 4, 5]

Education appears to encode the values, we'll take a guess at what these values represent and store in a dictionary.

In [9]:
ed_level_desc = {1: 'GED', 2: 'High School Diploma', 3: 'Bachelors Degree', 4: 'Masters Degree', 5: 'PhD'}
In [10]:
ed_ranking = empl_data['Education'].value_counts()
ed_ranking.rename(index=ed_level_desc)
Out[10]:
Bachelors Degree       572
Masters Degree         398
High School Diploma    282
GED                    170
PhD                     48
Name: Education, dtype: int64
In [11]:
# one more time, as percentages
ed_ranking = round(empl_data['Education'].value_counts(normalize=True)*100,0)
ed_ranking.rename(index=ed_level_desc)
Out[11]:
Bachelors Degree       39.0
Masters Degree         27.0
High School Diploma    19.0
GED                    12.0
PhD                     3.0
Name: Education, dtype: float64

Just 3% of the employees in this dataset have a PhD (*with the assumption a '5' in Education translates to 'PhD')

In [12]:
# explore education against Job Level
ed_job = empl_data.pivot_table(values='Age',index='Education', columns='JobLevel', aggfunc='count')
ed_job
Out[12]:
JobLevel 1 2 3 4 5
Education
1 89 47 20 8 6
2 94 125 33 17 13
3 231 171 98 44 28
4 121 171 58 28 20
5 8 20 9 9 2

Now we've calculated enough data that it takes us as humans time and effort to process this 5x5 grid. Let's visualize this data using Seaborn to help us understand this easier.

In [13]:
sns.heatmap(ed_job)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a38ba940>

In this colormap, darker colors represent lower values and lighter colors higher values. The near white square contains the highest value. We can improve and customize this further to our liking.

In [14]:
plt.figure(figsize=(9,4))

# add annotations, border, and change colormap
sns.heatmap(ed_job, annot=True, linewidth=0.4, fmt='d', cmap='YlOrRd')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3cb5a90>

The greatest number of associates (231) have an education level of 3 and job level of 1. The concentration of education and job level is easily identified as well.

In [15]:
# by Job Role
plt.figure(figsize=(9,4))
role_ed_xtab = pd.crosstab(empl_data['JobRole'], empl_data['Education'], normalize='index')
sns.heatmap(role_ed_xtab, annot=True, fmt='0.0%', cmap='YlOrRd')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3cc8588>

While we clearly see that level 3 education is the dominant education across all job roles, there are some insights from this figure.

  • Heathceare Representatives and Sales Executives have the highest frequencies of level 4 education.
  • Sales Representatives skew lowest on the scale. The highest concentration of level 1, and 0% at level 5.

Finally, we'll look at what the employees studied, by Job Role.

In [16]:
plt.figure(figsize=(9,4))
# let's try blue
role_field_xtab = pd.crosstab(empl_data['JobRole'], empl_data['EducationField'], normalize='index')
sns.heatmap(role_field_xtab, annot=True, fmt='0.0%', cmap='Blues')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3e79550>
  • Much like Education level, there is a dominant education field - Life Sciences.
  • Those who studied HR landed in HR.
  • Marketing majors landed in Sales roles.
In [17]:
# does education level determine hourly rate?

sns.boxplot(x='Education', y='HourlyRate', data=empl_data)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3e61c88>

Possibly - but from this does not appear significantly, and only due to level 5 having a higher median - the other 4 levels appear similar.

In [18]:
# include gender

# default Seaborn styling
sns.set_style('darkgrid')

sns.boxplot(x='Education', y='HourlyRate', data=empl_data, hue='Gender')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
Out[18]:
<matplotlib.legend.Legend at 0x1a2a3e1fd30>

Seaborn's sensible defaults make visualizing data easier. The plots generated so far are aesthetically pleasing, but barplots and boxplots are available by default from Matplotlib. Like heatmaps, Seaborn also constructs more advanced plots that are still sent to Matplotlib for plotting, but Seaborn does the heavy lifting for us.

In [19]:
# stripplot
sns.stripplot(x='Education', y='HourlyRate', data=empl_data, jitter=True, hue='Gender', dodge=True)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
Out[19]:
<matplotlib.legend.Legend at 0x1a2a36c01d0>
In [20]:
# violinplots
sns.violinplot(x='Education', y='HourlyRate', data=empl_data)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3575e48>
In [21]:
# swarmplots
sns.swarmplot(x='Education', y='HourlyRate', data=empl_data)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a40407b8>

Finally PairPlots - a great, but computationally expensive operation, that shows you all of your data...

In [23]:
sns.pairplot(empl_data)
Out[23]:
<seaborn.axisgrid.PairGrid at 0x1a2a40b2f98>