Musings on Data Cleaning: Everyone’s Favorite Process!

Hi everyone, welcome to my musings on data cleaning! I have been spending the last couple of weeks cleaning my data and getting it ready for analysis. While I don’t think data cleaning is anyone’s favorite part of research, it is an incredibly important and essential step in the process, and sitting in front of a computer all day can really put a girl in a philosophical mood.

When I first sat down to begin, I felt overwhelmed by the sheer amount of data in the AddHealth repository. Since it is a longitudinal study, I needed to sift through multiple waves of data to create my primary dataset that would include sampling weights, predictors, covariates, demographics, and outcomes. As I worked on this task, most of my thoughts were spent on the actual task itself, but part of me started thinking about how cool it was that I, a 21 year old student, had the opportunity to work with such a large and comprehensive dataset!

Once I created my new dataset, the real work began. In order to run unbiased estimates that could scale to the general population, I needed to include a sampling weight in my analysis (Chen & Chantala, 2014). However, in order for that sampling weight do its job, I needed to change missing data into something the weight could recognize. The weight recognizes missing data as a blank rather than a numeric code, so I spent significant time recoding my variables such that missing data appeared accordingly. After I recoded my variables, it was time to make subpopulations. Once again, in order for the weight to do its job, it needed complete data for each variable in the analysis. To create a subpopulation, I excluded responses that had missing data and created new, mini variables that specified a complete subpopulation for each analysis I would run. For example, say someone responded to the question about binge drinking but left the question about arrests blank. That person’s data would be used for the the analysis with binge drinking but be excluded from the analysis about arrests.

My last important step involved creation of variables. The AddHealth data is really cool in that it asks lots of specific questions and generates very specific responses. I needed to turn those responses into comprehensive variables that meant something for me and my analysis. For example, AddHeath contains items about exposure to and interaction with violence, such as type of violence, consequences of the violence, etc.. Since one of my questions is whether or not parental incarceration has a unique impact on emerging adults (separate from exposure to violence, childhood abuse, and other risk factors), I needed to create a single “exposure to violence” variable. In order to do this accurately, I turned to literature previously published using the AddHealth data that extensively investigated violence. I followed their protocol to create a single “exposure to violence” variable (Farrell & Zimmerman, 2017). I did the same for creating a single “exposure to abuse” variable. It was fun to take a break from the data and read other studies, and super cool to be able to use protocol of other, established researchers for my thesis!

After spending a several days like this, I had what likely was a computer screen-induced existential moment. When I came to college, I planned to major in business wanted to go full-time into consulting. I never thought that I would pursue research at all–much less to the extent that I am now. Even crazier, I had never considered a career where I would spend significant time with stats, or on code, or with data. This thought stayed with me as I finished up the final steps of my data cleaning and made me really excited to move into the next step of the research process–analysis!

References & Further Reading

Chen, P., & Chantala, K. (2014). Guidelines for Analyzing Add Health Data.

Farrell, C., & Zimmerman, G. M. (2017). Does offending intensify as exposure to violence aggregates? Reconsidering the effects of repeat victimization, types of exposure to violence, and poly-victimization on property crime, violent offending, and substance use. Journal of Criminal Justice, 53, 25–33.