Happy New Year, all! I cannot believe it is already January 2018!
This post is the second part of my series on “Working with Incomplete Data,” and I’ll be going through another computational method for filling in the gaps in our data. This time around I will be talking about using machine learning to impute individuals’ gender (male or female) using only their names. In going forward, I assume that you have some basic Python knowledge and are capable of running programs in Python.
** As a sociologist, I fully acknowledge and recognize the socially constructed nature of “gender.” In using these methods, we should recognize that we are making very significant assumptions about gender that may or may not be true to “reality.” That said, names are often perceived as being intrinsically linked to either side of the male/female binary. Using these methods to account for gender differences may be useful, especially in situations where perceptions about individuals’ gender category based on their names may influence outcomes, as seen in many audit studies. **
These algorithms use probabilistic methods to determine how likely an individual is to be male or female based on their first name. Much like Part 1 of this series, this method is ideal for scenarios in which you lack demographic data for the individuals in your study.
Unlike last time, the algorithms for accomplishing this task are all open-source (yay!), meaning anyone can access and use them. While I use Python here, I know similar algorithms/packages exist for other popular languages such as R.
First, you will need a list of names. This can be a simple Excel spreadsheet saved as a CSV (comma separated values) file. You can simply place the first names (e.g. Colin) in the first column of the spreadsheet (call this column “first”) and the last names (e.g. Burke) in the second column (call this column “last”).
Next, you will go here and download the files from the repository (two of these should end with “.py”). For the purposes of this post, we will use GenderPredictor from Stephen Holiday, which uses NLTK’s Naive Bayes classifier. It claims to hold an accuracy of about 82% for American names, but this can vary depending on what names you are using. If you are using this for research or something fairly important, it is worth pointing out that there are several of these types of classifiers that are more robust and more accurate (easy to find with a quick search through GitHub).
After creating your list of names and downloading the Python files, we are going to create a simple program (below) to run the GenderPredictor and output a new CSV file with the names and their genders as determined by the classifier:
# import predictor module
from genderPredictor import genderPredictor
# import pandas module
import pandas as pd
# load in name CSV file
df = pd.read_csv("blognames.csv")
# run the gender predictor
if __name__ == "__main__":
gp = genderPredictor()
# run predictor on list of names
for name in df['first']:
print '\n%s is classified as %s'%(name, gp.classify(name))
gender = gp.classify(name)
df.loc[df['first'] == name, 'gender'] = gender
# print approximate accuracy of predictor
print 'Accuracy: %f'%accuracy
# Create new CSV file with names and genders
After running your program, you should have a new file with columns for first name, last name, and gender! While this is a simple way to impute gender data using names, the short snippet of code we put together here can be used as a blueprint to run any number of programs. With a really basic knowledge of python, we can employ algorithms like GenderPredictor to enhance our data and add new variables that may lead to further insights. Consideration of the role of things like gender and ethnicity can be crucial to furthering our understanding of the phenomenon under study.
I hope you found this helpful!