Cooperation the Key to Clean Data
Cleaning dirty data is not just a matter of mastering the technical challenges. It requires making sure your staff is working closely with the business every step of the way.
At the U.S. Centers for Disease Control and Prevention, such real-time data validation underpins data gathering, according to CIO Jim Seligman. When laptop-wielding field workers quiz 40,000 U.S. households a year for the "National Health Interview Survey," automatic edits make sure that responses are as complete as possible while the survey is taking place. Some edits are "skip patterns,"designed to prevent erroneous questions from being asked in the first place. If the respondent is male, for example, he won’t get the question about mammographies. Other edits are consistency checks: Respondents are asked their age, but also their date of birth—and the two are compared.
It may sound trivial, but from such small foundations, clean data is built. "Any time a human being has something to do with entering data, there’s the potential for error—whether it’s misreading something, misinterpreting something or miskeying something," Seligman says. And very often, it takes humans working with machines to clean up the mess.
Malcolm Wheatley is a freelance writer living in Devon, England.



