What’s in a name

I have spent the last few weeks immersed in names. As far as my project is concerned, names are the only way to determine gender. The first chunk of my project involves categorizing tweets by author gender. In order to do this, I am comparing twitter usernames to a master list of male and female names. Usernames that contain a male name will be categorized as male and usernames that contains a female name will be categorized as female.

There are a few considerations that make this process not-so-simple. First of all, I will have to discard tweets from all users who don’t fall neatly into my categorization scheme. For example, if a user’s twitter handle does not contain one of the top 1,000 most popular male or female names, then their tweets will be discarded from the corpus. As a result, people with non-English names will be systematically excluded, along with people who choose not to include a name in their Twitter handle. Similarly, androgynous names will also be discarded. Although discarding these tweets will undoubtedly introduce bias into my study, it will increase the validity of my independent variable– gender. As long as I can say with certainty that a tweet’s author is male or female, then discarding the users who don’t fit with my categorization scheme is worthwhile.

To compile the master lists of male and female names, I downloaded naming data from the social security website. This data contained an annual list of the 1,000 most popular baby names going back to the eighteen hundreds. Since I wanted a naming list that could capture the most average twitter user, I decided to use datasets from the years surrounding the average twitter user’s birth year. The average twitter user’s birth year is 1978, so I compiled datasets from 1973-1983. I then separated the lists by gender—names given exclusively to females were sorted into one list and names given exclusively to males were sorted into another. Androgynous names were discarded.

The next step in this process is to isolate the usernames from the twitter dataset and compare them to the list of names. More on that next week!