Cooperation the Key to Clean Data

Cleaning dirty data is not just a matter of mastering the technical challenges. It requires making sure your staff is working closely with the business every step of the way.

1 2 Page 2
Page 2 of 2

To facilitate its work, the Bureau has developed an approach of building feedback and validation loops into each survey and questionnaire in order to make sure that human-generated information is as accurate and reasonable as possible, says Richard Swartz, associate director for IT and CIO at the Census Bureau. Whenever the completed questionnaires are returned from businesses and individuals, and scanned into the Bureau’s computers, checks called "edits" take place that test the responses to make sure they are complete and reasonable, Swartz explains. Are the required fields complete? If not, how should nonresponses be dealt with?; should records be ignored, or should responses be "created" by estimating or putting in an average value so as to avoid throwing out a whole record just because of one odd or missing data item? Are responses reasonable? Can a 96-year-old describe herself as unemployed, and is that 80-year-old man really the father of a new baby? Is data consistent? Could a company with three people on the payroll really have a salary bill of more than $1 million?

At the U.S. Centers for Disease Control and Prevention, such real-time data validation underpins data gathering, according to CIO Jim Seligman. When laptop-wielding field workers quiz 40,000 U.S. households a year for the "National Health Interview Survey," automatic edits make sure that responses are as complete as possible while the survey is taking place. Some edits are "skip patterns,"designed to prevent erroneous questions from being asked in the first place. If the respondent is male, for example, he won’t get the question about mammographies. Other edits are consistency checks: Respondents are asked their age, but also their date of birth—and the two are compared.

It may sound trivial, but from such small foundations, clean data is built. "Any time a human being has something to do with entering data, there’s the potential for error—whether it’s misreading something, misinterpreting something or miskeying something," Seligman says. And very often, it takes humans working with machines to clean up the mess.

Malcolm Wheatley is a freelance writer living in Devon, England.

Copyright © 2004 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
Download CIO's Roadmap Report: Data and analytics at scale