Exploring IBM HR data using Python

Intro

This tutorial series explores the IBM HR data set. This data is typically used to demonstrate the ability of various machine learning algorithms applied to HR data.

In this series, I'll use it to demonstrate the awesome power Python can bring to HR data

Sections

  • Statistics
  • Matplotlib
  • Pandas
  • Seaborn
  • Plotly
  • Findings
In [1]:
__author__ = "adam"
__version__ = "1.0.0"
__maintainer__ = "adam"
__email__ = "adam@datapluspeople.com"
In [2]:
# imports 
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [3]:
# read the data directly from IBM Watson Analytics
# using pandas read excel file into dataframe
url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
empl_data = pd.read_excel(url)
In [4]:
# alternatively, save the file for repeated use
# we'll reference the saved file in the future portions of this analysis

# empl_data.to_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")
# empl_data = pd.read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")

pandas

pandas provides a number of built-in methods allowing us to quickly and easily explore our data. When reading our data, we stored it in a DataFrame. If that it is a new term, for now think of it sort like an Excel sheet - but way better.

Let's see what we've got...

In [5]:
# view the first 5 rows
empl_data.head()
Out[5]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 35 columns

In [6]:
# how many rows, columns are in the DataFrame?
empl_data.shape
Out[6]:
(1470, 35)
In [7]:
# how many different data points do we have?
empl_data.size
Out[7]:
51450
In [8]:
# what are the names of all the columns?
empl_data.columns
Out[8]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
In [9]:
# info about the dataframe
empl_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
Age                         1470 non-null int64
Attrition                   1470 non-null object
BusinessTravel              1470 non-null object
DailyRate                   1470 non-null int64
Department                  1470 non-null object
DistanceFromHome            1470 non-null int64
Education                   1470 non-null int64
EducationField              1470 non-null object
EmployeeCount               1470 non-null int64
EmployeeNumber              1470 non-null int64
EnvironmentSatisfaction     1470 non-null int64
Gender                      1470 non-null object
HourlyRate                  1470 non-null int64
JobInvolvement              1470 non-null int64
JobLevel                    1470 non-null int64
JobRole                     1470 non-null object
JobSatisfaction             1470 non-null int64
MaritalStatus               1470 non-null object
MonthlyIncome               1470 non-null int64
MonthlyRate                 1470 non-null int64
NumCompaniesWorked          1470 non-null int64
Over18                      1470 non-null object
OverTime                    1470 non-null object
PercentSalaryHike           1470 non-null int64
PerformanceRating           1470 non-null int64
RelationshipSatisfaction    1470 non-null int64
StandardHours               1470 non-null int64
StockOptionLevel            1470 non-null int64
TotalWorkingYears           1470 non-null int64
TrainingTimesLastYear       1470 non-null int64
WorkLifeBalance             1470 non-null int64
YearsAtCompany              1470 non-null int64
YearsInCurrentRole          1470 non-null int64
YearsSinceLastPromotion     1470 non-null int64
YearsWithCurrManager        1470 non-null int64
dtypes: int64(26), object(9)
memory usage: 402.0+ KB

Python performs it's work 'in-memory' meaning your computer is holding all of this data in memory. HR data isn't typically 'big data', but it's import to have an understanind of the resource demands of your data analysis.

In [10]:
empl_data.memory_usage()
Out[10]:
Index                          80
Age                         11760
Attrition                   11760
BusinessTravel              11760
DailyRate                   11760
Department                  11760
DistanceFromHome            11760
Education                   11760
EducationField              11760
EmployeeCount               11760
EmployeeNumber              11760
EnvironmentSatisfaction     11760
Gender                      11760
HourlyRate                  11760
JobInvolvement              11760
JobLevel                    11760
JobRole                     11760
JobSatisfaction             11760
MaritalStatus               11760
MonthlyIncome               11760
MonthlyRate                 11760
NumCompaniesWorked          11760
Over18                      11760
OverTime                    11760
PercentSalaryHike           11760
PerformanceRating           11760
RelationshipSatisfaction    11760
StandardHours               11760
StockOptionLevel            11760
TotalWorkingYears           11760
TrainingTimesLastYear       11760
WorkLifeBalance             11760
YearsAtCompany              11760
YearsInCurrentRole          11760
YearsSinceLastPromotion     11760
YearsWithCurrManager        11760
dtype: int64

the memory_usage method provides us with the memory used by each column in our new DataFrame. To get the entire amount of consumed memory, we can just add each item together.

In [11]:
memory_used = empl_data.memory_usage().sum()
print(f'The employee DataFrame is using {memory_used} bytes in memory.')
The employee DataFrame is using 411680 bytes in memory.

We are using approximately 411kb of system memory to store the DataFrame. Scroll back up to the output of the .info() method. This actually contained the memory usage as well, though the value is slightly less. Why? The memory_usage method includes the memory used by the index of the DataFrame.

But there's still more, and to enable a more accurate summary, we'll pass an optional item that accounts for the full usage of contained objects.

In [12]:
memory_used = empl_data.memory_usage(deep=True).sum()
print(f'The employee DataFrame is using {memory_used} bytes in memory.')
The employee DataFrame is using 1177169 bytes in memory.

Nearly 3x the original result. You should have no problem holding this amount, but with much larger data sets and complex analysis, you may want to be aware of memory usage when processing on a local computer.

Summary Statistics

Enough with the background info, let's see the data!

In [13]:
# statistics about the DataFrame's numerical fields
empl_data.describe()
Out[13]:
Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 1470.000000 1470.000000 1470.000000 1470.000000 1470.0 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 ... 1470.000000 1470.0 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000
mean 36.923810 802.485714 9.192517 2.912925 1.0 1024.865306 2.721769 65.891156 2.729932 2.063946 ... 2.712245 80.0 0.793878 11.279592 2.799320 2.761224 7.008163 4.229252 2.187755 4.123129
std 9.135373 403.509100 8.106864 1.024165 0.0 602.024335 1.093082 20.329428 0.711561 1.106940 ... 1.081209 0.0 0.852077 7.780782 1.289271 0.706476 6.126525 3.623137 3.222430 3.568136
min 18.000000 102.000000 1.000000 1.000000 1.0 1.000000 1.000000 30.000000 1.000000 1.000000 ... 1.000000 80.0 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 465.000000 2.000000 2.000000 1.0 491.250000 2.000000 48.000000 2.000000 1.000000 ... 2.000000 80.0 0.000000 6.000000 2.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 802.000000 7.000000 3.000000 1.0 1020.500000 3.000000 66.000000 3.000000 2.000000 ... 3.000000 80.0 1.000000 10.000000 3.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 1157.000000 14.000000 4.000000 1.0 1555.750000 4.000000 83.750000 3.000000 3.000000 ... 4.000000 80.0 1.000000 15.000000 3.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1499.000000 29.000000 5.000000 1.0 2068.000000 4.000000 100.000000 4.000000 5.000000 ... 4.000000 80.0 3.000000 40.000000 6.000000 4.000000 40.000000 18.000000 15.000000 17.000000

8 rows × 26 columns

Describe allows us to see descriptive statistics for the numerical Series in the DataFrame Note that non-numeric fields, such as 'Department' are not included in the results.

Very helpful, but I don't want to scroll horizontally. Let's fix that.

In [14]:
# transpose the DataFrame
empl_data.describe().transpose()
Out[14]:
count mean std min 25% 50% 75% max
Age 1470.0 36.923810 9.135373 18.0 30.00 36.0 43.00 60.0
DailyRate 1470.0 802.485714 403.509100 102.0 465.00 802.0 1157.00 1499.0
DistanceFromHome 1470.0 9.192517 8.106864 1.0 2.00 7.0 14.00 29.0
Education 1470.0 2.912925 1.024165 1.0 2.00 3.0 4.00 5.0
EmployeeCount 1470.0 1.000000 0.000000 1.0 1.00 1.0 1.00 1.0
EmployeeNumber 1470.0 1024.865306 602.024335 1.0 491.25 1020.5 1555.75 2068.0
EnvironmentSatisfaction 1470.0 2.721769 1.093082 1.0 2.00 3.0 4.00 4.0
HourlyRate 1470.0 65.891156 20.329428 30.0 48.00 66.0 83.75 100.0
JobInvolvement 1470.0 2.729932 0.711561 1.0 2.00 3.0 3.00 4.0
JobLevel 1470.0 2.063946 1.106940 1.0 1.00 2.0 3.00 5.0
JobSatisfaction 1470.0 2.728571 1.102846 1.0 2.00 3.0 4.00 4.0
MonthlyIncome 1470.0 6502.931293 4707.956783 1009.0 2911.00 4919.0 8379.00 19999.0
MonthlyRate 1470.0 14313.103401 7117.786044 2094.0 8047.00 14235.5 20461.50 26999.0
NumCompaniesWorked 1470.0 2.693197 2.498009 0.0 1.00 2.0 4.00 9.0
PercentSalaryHike 1470.0 15.209524 3.659938 11.0 12.00 14.0 18.00 25.0
PerformanceRating 1470.0 3.153741 0.360824 3.0 3.00 3.0 3.00 4.0
RelationshipSatisfaction 1470.0 2.712245 1.081209 1.0 2.00 3.0 4.00 4.0
StandardHours 1470.0 80.000000 0.000000 80.0 80.00 80.0 80.00 80.0
StockOptionLevel 1470.0 0.793878 0.852077 0.0 0.00 1.0 1.00 3.0
TotalWorkingYears 1470.0 11.279592 7.780782 0.0 6.00 10.0 15.00 40.0
TrainingTimesLastYear 1470.0 2.799320 1.289271 0.0 2.00 3.0 3.00 6.0
WorkLifeBalance 1470.0 2.761224 0.706476 1.0 2.00 3.0 3.00 4.0
YearsAtCompany 1470.0 7.008163 6.126525 0.0 3.00 5.0 9.00 40.0
YearsInCurrentRole 1470.0 4.229252 3.623137 0.0 2.00 3.0 7.00 18.0
YearsSinceLastPromotion 1470.0 2.187755 3.222430 0.0 0.00 1.0 3.00 15.0
YearsWithCurrManager 1470.0 4.123129 3.568136 0.0 2.00 3.0 7.00 17.0

Transposing the output of describe allows us to view without scrolling. This works well for this DataFrames with many columns.

Working with DataFrames

We've managed to get the data in to the DataFrame, but how do we now get data out of the DataFrame?

We don't always want the full DataFrame, we may want to select items that only meet certain criteria.

Let's have a look at how to do just that.

In [15]:
# select all the Training counts from last year
empl_data['TrainingTimesLastYear']
Out[15]:
0       0
1       3
2       3
3       3
4       3
5       2
6       3
7       2
8       2
9       3
10      5
11      3
12      1
13      2
14      4
15      1
16      5
17      2
18      3
19      3
20      5
21      4
22      4
23      6
24      2
25      3
26      5
27      2
28      4
29      2
       ..
1440    3
1441    2
1442    3
1443    2
1444    4
1445    3
1446    2
1447    4
1448    5
1449    4
1450    2
1451    1
1452    3
1453    2
1454    3
1455    2
1456    2
1457    2
1458    5
1459    2
1460    3
1461    3
1462    2
1463    2
1464    2
1465    3
1466    5
1467    0
1468    3
1469    3
Name: TrainingTimesLastYear, Length: 1470, dtype: int64

The selection returned to us a Series containing all of the Training data. The '..' applied by the Jupyter Notebook condensed the output so we didn't have to see all 1,469 rows.

We'll now store this Series in a new 'training' variable and generate some statistics to help us understand the Training in the organization.

In [16]:
# store the result
trainings = empl_data['TrainingTimesLastYear']

As the result is now stored in memory, it is not returned to the screen. To view it, we'll have to explicitly ask for it.

In [17]:
# view the training variable contents
trainings
Out[17]:
0       0
1       3
2       3
3       3
4       3
5       2
6       3
7       2
8       2
9       3
10      5
11      3
12      1
13      2
14      4
15      1
16      5
17      2
18      3
19      3
20      5
21      4
22      4
23      6
24      2
25      3
26      5
27      2
28      4
29      2
       ..
1440    3
1441    2
1442    3
1443    2
1444    4
1445    3
1446    2
1447    4
1448    5
1449    4
1450    2
1451    1
1452    3
1453    2
1454    3
1455    2
1456    2
1457    2
1458    5
1459    2
1460    3
1461    3
1462    2
1463    2
1464    2
1465    3
1466    5
1467    0
1468    3
1469    3
Name: TrainingTimesLastYear, Length: 1470, dtype: int64

Let's now use this to answer the following questions:

  • How many total trainings took place last year?
  • What is the average number of trainings employees received?
  • What is the maximum number of trainings received?
  • What is the minimum number of trainings received?
  • For each number of trainings, how many employees are in each group?
  • How many employees received the minimum number of trainings?
In [18]:
# total number of trainings
trainings.sum()
Out[18]:
4115
In [19]:
# total number of employees
trainings.count()
Out[19]:
1470
In [20]:
# average number of trainings
trainings.mean()
Out[20]:
2.7993197278911564
In [21]:
# maximum number of trainings
trainings.max()
Out[21]:
6
In [22]:
# minimum number of trainings
trainings.min()
Out[22]:
0
In [23]:
# number of employees in each group
trainings.value_counts()
Out[23]:
2    547
3    491
4    123
5    119
1     71
6     65
0     54
Name: TrainingTimesLastYear, dtype: int64
In [24]:
# number of employees not receiving training last year
trainings.value_counts()[0]
Out[24]:
54
In [25]:
# percentage of employees not receiving training
print('{:.0f}% of employees did not receive training last year.'.format(((trainings.value_counts()[0])/trainings.count())*100))
4% of employees did not receive training last year.

Section Recap

In this section we covered a lot of ground after just a few simple steps:

  • Read in a dataset directly from a website
  • Explored the dataset for understanding of it's contents
  • Analyzed the amount of memory consumed by this dataset
  • Generated Summary Statistics of all the numerical fields
  • Extracted the training column, returning a Series
  • Evaluated training activity within the organization
  • Discovered our first insight - 4% haven't received any training!