Cooperation the Key to Clean Data
Cleaning dirty data is not just a matter of mastering the technical challenges. It requires making sure your staff is working closely with the business every step of the way.
Law’s mission was to review all the data, but he had to concentrate his team’s energies on cleaning six critical data fields: the NATO item identifier, the NATO supply classification, the unit of issue, the supplier code, the packaging code and the hazard code. These six fields were chosen based on which ones would have the biggest impact on the supply chain if they were wrong.
"The first step was to identify homonyms and synonyms," says Paul Nettle, manager of data cleaning for TCP. Homonyms, he explains, are two or more different items with the same identifier, such as rations and radio valves. Synonyms are the same items with more than one identifier—the same radio valve kept in two places in a warehouse under two different numbers, for example.
"Synonyms are merely inefficient," Nettle observes. Overstocking and overbuying result from such data mistakes, rather than troops being shipped the wrong gear.
Next, the IT team employed data-profiling software to crawl though the data, checking it for valid NATO numbers. The troubling finding: 119,000 numbers (about one in 10) weren’t valid. The radio valve, it turned out, was a valid NATO part number, but the rations came from a satellite system where nonstandard rules had been used. Every one of them had to be sent to a NATO office in Glasgow for codification, and then corrected in each system in which it occurred. Nettle and his team also discovered they had quite a bit of relabeling to do at the depot, since much of the inventory sitting on the shelves was now incorrectly labeled.
The next step was "fuzzy matching," using software to look for duplicates and errors introduced by keyboard entry. "The ability to ignore [minor mistakes in] punctuation and figure out when a 3 had been erroneously substituted for an 8 was important when dealing," Nettle says. Such numerical errors, after all, could change the entire meaning of the text, while punctuation mistakes merely provided Nettle’s team with much needed amusement.
By August 2001, they had completed the relatively easy (if time-consuming) task of examining item identifiers to see, for instance, if an item held the valid NATO number. Now they had to find a way to correct the other data fields. Here, the challenge was more difficult. For things such as unit-of-issue labels, packaging codes and supplier details, hard and fast rules to tell clean data from dirty data didn’t exist. For example, supplies of aircraft oil: A military unit in the Gulf might order 250 liters of oil, expecting 250 one-liter cans—only to receive 250 separate 250-liter drums of the stuff. The reason? On the Royal Air Force system responsible for ordering the oil, 250-liter drums, not one-liter cans, were the unit of issue. Neither label was technically an error, but clearly, such inconsistencies could quickly cripple a supply chain. To make sure such a disaster would never occur, the TCP team turned to a data-profiling tool, which highlighted errors and inconsistencies in the various codes. The software provides easy-to-understand, computer-generated diagrams to spot unusual data formats that could be erroneous.



