Cooperation the Key to Clean Data
Cleaning dirty data is not just a matter of mastering the technical challenges. It requires making sure your staff is working closely with the business every step of the way.
Key to the project’s completion was a decision to closely involve the business owners of the data (in this case, the marketing department) in developing the data-cleaning standards. "I’d advise [anyone working on data quality] to work closely with the business users to define the matching rule," Kellett says. "What constitutes a match? Last name, or last name and first name? Or these, plus a matching credit card? And when a duplicate is detected, what rules determine which record will be the survivor record? For instance, is Bob Smith the same as Robert Smith? And is the new address revealed by a car rental due to a house move, or the acquisition of a summer cottage? Or just the wrong address completely?"
The turnaround in the project’s fortunes has been so complete, Kellett says, that Cendant has been able to launch a loyalty program across nine of its chains—including Days Inn, Howard Johnson, Ramada, Super 8 Motel and Travelodge. Customers can now collect points (much like frequent flier miles) every time they stay in a Cendant hotel. Such a program would have been impossible without the single customer view that the cleaned-up data warehouse provides.
Editing Out Inaccuracies
Even better than cleaning dirty data is making sure it can’t be soiled in the first place. Organizations heavily reliant on accurate information, such as the U.S. Census Bureau, are leading the charge when it comes to building real-time validation into data as it is generated. The Bureau undertakes hundreds of surveys a year into demographics, the economy, trade data and much else. And needless to say, clean data is imperative.
To facilitate its work, the Bureau has developed an approach of building feedback and validation loops into each survey and questionnaire in order to make sure that human-generated information is as accurate and reasonable as possible, says Richard Swartz, associate director for IT and CIO at the Census Bureau. Whenever the completed questionnaires are returned from businesses and individuals, and scanned into the Bureau’s computers, checks called "edits" take place that test the responses to make sure they are complete and reasonable, Swartz explains. Are the required fields complete? If not, how should nonresponses be dealt with?; should records be ignored, or should responses be "created" by estimating or putting in an average value so as to avoid throwing out a whole record just because of one odd or missing data item? Are responses reasonable? Can a 96-year-old describe herself as unemployed, and is that 80-year-old man really the father of a new baby? Is data consistent? Could a company with three people on the payroll really have a salary bill of more than $1 million?



