GENERATING FATA

Fake dATA

Intro

IBM Watson Data

For many of the early workbooks here, we've stood on the shoulders of others. We simply imported a dataset that was created for the Watson HR Analytics work.

In [125]:
# imports
import pandas as pd

# updated 2019-08-13
# IBM has removed the file from their server

# deptecated code
# read the file 
# url = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
# empl_data = pd.read_excel(url)

# read local file for demonstration
file = 'Dropbox/WFA/data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx'
empl_data = pd.read_excel(file)
empl_data.head()
Out[125]:
Unnamed: 0 Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 ... 1 80 0 8 0 1 6 4 0 5
1 1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 ... 4 80 1 10 3 3 10 7 1 7
2 2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 ... 2 80 0 7 3 3 0 0 0 0
3 3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 ... 3 80 0 8 3 3 8 7 3 0
4 4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 36 columns

This is great to get us started, and gave us a dataset that many others had used - in blog posts, Kaggle competitions, and otherwise. Now, we're ready for more and would like to generate our own dataset for continued development and exploration.

pydbgen

Random Database/Dataframe Generator

github: https://github.com/tirthajyoti/pydbgen read the docs: https://pydbgen.readthedocs.io/en/latest/

from pydbgen documentation: </i>Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one’s own choice?

After all, databases break every now and then and it is safest to practice with a randomly generated one :-)</i>

That sums it up very well - we need data to practice on, and in a safe way. Especially when we're dealing with PII and senstive data, as we are regularly in HR. It's so commonplace that some, unfortunately, are densensitized to the senstitive nature and requirements, and make a blunder posting to an S3 bin or a similiar, but disastrous mistake.

Generating our own fake data protects us from ourselves. pydbgen allows us to do this very quickly, and generates very realistic data.

Installing pydbgen

As of this writing in August 2019, pydbgen is not available on conda (my preferred installation method).

On both Windows and Linux use pip.

pip install pydbgen
In [126]:
# load pydbgen

import pydbgen
from pydbgen import pydbgen
In [127]:
db = pydbgen.pydb()

df = db.gen_dataframe(num=100, fields=['name', 'street_address', 'city', 'state', 'zipcode', 'country', 'company', 'job_title', 'phone', 'ssn', 'email', 'month', 'year', 'weekday', 'date', 'time', 'latitude', 'longitude', 'license_plate'], )
In [128]:
df.head()
Out[128]:
name street_address city state zipcode country company job_title phone-number ssn email month year weekday date time latitude longitude license-plate
0 William Anderson 7953 Mandy Turnpike Swain Texas 78280 Myanmar Brown-Vasquez Scientific laboratory technician 352-368-1239 228-58-3135 WAnderson@datapluspeople.com None 1999 Monday 2002-11-17 10:39:23 -6.9832325 -30.181752 BIH-274
1 Dawn Molina 554 Heather Turnpike Apt. 311 Pepin Oklahoma 75571 Malaysia Martinez, Thomas and Henry Chartered accountant 245-361-8447 252-39-2457 Dawn.Molina@datapluspeople.com None 1990 Friday 2015-10-26 01:48:58 41.9638895 -33.070358 EYV-268
2 Timothy Alexander 6693 Donald Plain Moore Delaware 94146 Chad Diaz-Bruce Camera operator 701-463-6626 602-26-0601 Alexander_Timothy67@datapluspeople.com None 1991 Saturday 2009-12-31 01:51:05 -46.888624 -32.441572 AAN-6293
3 Bradley Walter 24543 Adams Fort Sydney Indiana 33266 Ecuador Jackson-Lang Company secretary 420-550-7054 563-67-3139 BradleyWalter94@datapluspeople.com None 1970 Thursday 1979-02-10 22:24:59 -7.668391 -166.274743 8QSM719
4 Daniel Allen 9189 Cynthia Ramp Noblestown Kentucky 76651 France Glass PLC Biochemist, clinical 538-078-0566 533-98-1206 Daniel.Allen@datapluspeople.com None 1978 Sunday 1978-06-02 15:35:34 -24.511857 -35.220806 CTZ-3918

pydbgen Summary

Overall, pydbgen is a great, quick way to generate any amount of data. The limitation is primarily the data types that are supported currently. The fields shown above in this example are the extent of fields available as of this writing. These are a great start and for certain situations, these are more than you need. A field such as License Plate is a nice inclusion.

The documentation a little lacking. For example, the documentation (as of this writing) does not mention the 'Domains.txt' file required to generate email addresses. The documentation, however, does point us to Faker - which pydbgen builds upon to generate the fata (fake data). We'll explore Faker in the next section.

Faker

In [129]:
from faker import Faker
In [130]:
fake = Faker()
In [131]:
fake.name()
Out[131]:
'Michael Morris'
In [132]:
fake.address()
Out[132]:
'6928 Richard Fort Suite 784\nEast Nicole, SC 52141'
In [ ]:
 
In [ ]:
fake.
In [ ]:
fake_df = pd.DataFrame(columns = ['name', 'ssn'])
In [ ]:
name_list = []
ssn_list = []
dob_list = []
address_list = []
city_list = []
state_list = []
country_list = []
postal_list = []
id_list = []
email_list = []
username_list = []


for i in range(1000):
    name_list.append(fake.name())
    ssn_list.append(fake.ssn())
    dob_list.append(fake.date_of_birth())
    address_list.append(fake.street_address())
    city_list.append(fake.city())
    state_list.append(fake.state_abbr())
    country_list.append(fake.country_code())
    postal_list.append(fake.postalcode())
    email_list.append(fake.email())
    id_list.append(fake.random_int())
    username_list.append(fake.user_name())
    
    
    
fake_df['name'] = name_list
fake_df['ssn'] = ssn_list
fake_df['dob'] = dob_list
fake_df['address'] = address_list
fake_df['city'] = city_list
fake_df['state'] = state_list
fake_df['country'] = country_list
fake_df['postal'] = postal_list
fake_df['id'] = id_list
fake_df['email'] = email_list
fake_df['username'] = username_list
In [ ]:
fake_df

Customizing with Faker

Faker allows for the creation of your own providers.

In [ ]:
from faker.providers import BaseProvider
import random

# create the provider. The class name for Faker must be 'Provider'
class Provider(BaseProvider):
    def gender(self):
        num = random.randint(0,1)
        if num == 0:
            return 'Male'
        else:
            return 'Female'
In [ ]:
fake.add_provider(Provider)
In [ ]:
fake.gender()
In [ ]:
# add this to the DataFrame
gender_list = []

for i in range(1000):
    gender_list.append(fake.gender())

fake_df['gender'] = gender_list

fake_df['gender'].head()
In [ ]:
fake_df.info()
In [ ]:
# convert gender column to category
fake_df['gender'] = fake_df['gender'].astype('category')

fake_df.info()
In [ ]:
fake_df['gender'].head()
In [ ]:
#export the file
fake_df.to_csv('~/Downloads/FATA.csv')

Conclusion

This tutorial showed a few ways in which we can generate fake data - FATA - to allow us to continue to explore and analyze HR data. You could combine this to anonymize your real HR data, to be able to include names, ssn's, etc. all without compromising one of the most fundamental parts of working with HR data - privacy and respect of people's information.