Measuring the Accuracy of Gender-Identification Programs


I’m proud to say that the first phase of my project is well underway. With the help of my lab, I’ve made a program that analyzes the gender of authors of Tweets. The program compares each Tweet’s handle to census data on male and female names. If a handle contains a conventionally female name, then it is categorized as female. If the handle contains a male name, then the author is categorized as male.

Although the program works, I still have concerns about its accuracy. After analyzing a small sample of Tweets, I’ve noticed that many handles contain both male and female names. One of the (many) pitfalls of using first names as a basis for assigning gender to tweet authors is that many female first names contain a male name and vice versa. For example, take a handle like @Joelsaw. When I was creating a small sample of tweets on which to test my program, I included tweets by @Joelsaw because it seems like an unequivocally male handle, since it contained the first name “Joel”. Little did I realize that “@Joelsaw” also contains the female name “Elsa”. As a result, the program coded “@Joelsaw” as female.

To combat this accuracy issue, I look to how other researchers have combatted this problem. In “What’s in a Name? Using First Names as Features for Gender Inference in Twitter”, researcher Derek Ruths created a author-gender identification program that uses crowd-sourcing to gauge the accuracy of the the gender-inference process. Ruths tested the accuracy of his program by having Amazon’s Mechanical Turk coders guess the gender of a Twitter user based on the user’s profile picture. For each tweet in his dataset, the Mechanical Turk coder would view the profile picture of the user and classify the user as male, female or unknown. Each profile picture would be classified three times by three different coders for better accuracy. Finally, researchers would compare the gender assignment from the Mechanical Turk trials to the gender inference from the program to assess the accuracy of the program.

Although profile pictures are not a part of the metadata included with the Twitter dataset that I am working with, I wonder if this “crowd-sourcing” technique would be an effective way to gauge the accuracy of my gender-inference program. Instead of looking at profile pictures, Mechanical Turk coders might look at the handles and code them as male or female. For example, human coders might be effective at solving problems like the “@Joelsaw” issue.



  1. Hey Yussre! This doesn’t exactly answer your question, but there were 3,041 unique handles that were classified as “male” or “female” and had the hashtag “#politics”. I omitted all gender neutral names in an earlier step. Before I even looked at the tweets, I compiled a list of male and female names from some census data. Any names that were on both the male and female list were omitted. Unfortunately, that meant that even if there were 1,000,000 male “Johns” born in 1978, if there was even a single female “John” born during that same year then the name had to be omitted. Making name-gender a dummy variable like is an all-or-nothing approach, but the only alternative would be to use the census data to estimate the likelihood of the name belonging to a male vs the likelihood of it belonging to a female. If I took this approach, then the name John would read as 99.99% male, and the variable “gender” would be continuous. Down the road, making gender a continuous variable might help me hone the certainty of my findings, but for it makes the most sense to read gender as a dummy variable.

  2. Hey, Emily! This sounds really cool. I think MT would be a cool way to combat that issue. How many Twitter users are in your data set? And just out of curiosity, are you using gender-neutral names or omitting those from your data set?