Understand why poor data quality is no longer a little problem
Find out how costly dirty data can be
Learn how to cleanse data
For years, the Montana Department of Corrections was a prisoner of data quality problems. Aging IT systems perpetrated countless data entry offenses in reports that the prison system was required to submit to state and federal authorities. And while the department’s IS group put in hours of manual labor to try to maintain some level of reporting integrity, overall confidence in the quality of data was nonexistent and morale in the IS group was low. The situation came to a head four years ago when the department nearly lost a coveted $1 million federal grant. The culprit: information systems that, lacking business rules and a data dictionary, failed to accurately forecast how many of a particular type of offender would be incarcerated. “We had an egregious data quality problem. Not to the point where we were losing offenders?but we weren’t able to accurately portray how many we thought we’d have over the next two to five years,” says Dan Chelini, bureau chief for information services at the Helena, Mont.-based department.
With the go-ahead from the state prison’s board of directors, Chelini’s department mounted an aggressive campaign, from late 1997 to mid-1999, to turn around data quality as part of an overhaul of the prison system. The first step was to bring in a team from Information Impact International, a consultancy specializing in data quality, to evaluate organizational processes, acquaint the department with the concept of data stewardship and set up a methodology for data entry. Although some employees were leery at first of the new demands, they bought into the new standards once trained in basic data modeling and data cleansing techniques. A data validity officer was also appointed to rally support for the program and enforce the new rules.
The program officially launched August 2000, and the department claims to see some real results. Instead of a handful of programmers holding all of the responsibility for prisoners’ information, 30 data stewards from all walks of prison life?from probation officers and attorneys to the guy who showers prisoners when they first enter a facility?now function as data quality gatekeepers. They are accountable for accurately entering information on prisoners, such as names, last known addresses and identifying scars and disfigurements. The Montana Department of Corrections’ data quality problem has been detained. “For the first time in years, we’re meeting deliverables” such as reports to federal overseers, says Data Validity Officer Lou Walters. “People are involved and excited about pushing data quality.”
Although companies deal with customers, not prisoners, the increasing need for accurate data is driving many organizations?in finance, health care, retail and other segments?to launch formal initiatives to bolster the quality of customer information in core business systems. Until recently most organizations haven’t felt a lot of urgency or enthusiasm about cleaning up dirty data; inaccurate and multiple listings of customer information were seen as trivial problems and a tolerable price of doing business. But the current trend in many industries toward data warehousing and data mining has increased the value of good data and the costs of cleaning up databases. That task is anything but trivial, and the costs, which include the direct costs of hiring people and consultants and the indirect costs of missed sales opportunities, are significant.
“Our studies in cost analysis show that between 15 percent to greater than 20 percent of a companies’ operating revenue is spent doing things to get around or fix data quality issues,” says Larry English, principal of Information Impact in Brentwood, Tenn. Some organizations, like the Montana Department of Corrections, are creating full-time positions around data quality and instituting homegrown methodologies to ensure that information stays consistent and is usable across different types of applications. Other companies are purchasing data cleansing services and customer identification and standardization software from companies such as Vality Technology in Boston and Innovative Systems in Pittsburgh to clean up their act.
Stage One Denial
For many companies, dirty data remains an unknown problem. “Old systems have limped along for years basically hiding data quality problems, either through departments putting out multiple versions of reports or leaving the reconciliation work to find the real answers to people who do this stuff by hand,” says Ken Orr, founder of the Ken Orr Institute, an IT consultancy based in Topeka, Kan., and a consultant with the Cutter Technology Council, an IT think tank. Vality’s term for this syndrome is “data denial,” says Dave Stanvick, vice president of marketing communications. “Data quality has remained a closeted issue in IT because there’s little visibility at the management level that the problem is occurring. Generally, data would have gone through many days of manual rework before it’s presented in a report to senior management.”
This kind of laborious scrap and rework, as it’s called, fuels one of the most dangerous misperceptions surrounding data quality: that dirty data is all about simple inaccuracies like misspelled names, incomplete addresses and missing data fields. Throw some manpower at a database cleanup job, the theory goes, and the problem will go away. Not so, experts caution, who say that data scrubbing is a first step. The more critical move, they say, is to create standards for how data on customers or products is represented so that it maintains its integrity, whether used for billing purposes or to drive a direct-marketing campaign. This is also the only real way companies can get a customer’s composite picture across all parts of an organization?a practice necessary for delivering the personalized service that many customers demand.
“How information is used is changing,” explains Mary Knox, senior research analyst at Gartner Financial Services in Durham, N.C. “The focus used to be on data processing?where the value of data consisted in the context of a specific application. But data that’s perfectly fine for the original application takes on new meaning and could very well cause big problems if you try to use it in a different way.”
Consider a typical scenario: Let’s say a Jon B. Smith at 123 Main St. in Lowell, Mass., exists in a bank’s mortgage origination system and a John Smith at 123 Main Drive in Lowell, Mass., comes up in the bank’s system for car loans. Without knowing for certain if the two Smiths are the same individual, companies can still process bills and get paid?albeit with some potential for duplicate work and a confused customer or two. That level of uncertainty doesn’t fly, however, if a company attempts to use that same data to pinpoint cross-selling opportunities based on a customer’s profile. And the situation worsens when the company tries to identify all possible customers in a particular household. A sales contact from this database could alienate customers with improperly targeted pitches or end up in the dead-letter office.
It’s this inability to effectively uncover patterns in customer data?despite the millions of dollars now being poured into data warehouse projects?that’s starting to raise data quality red flags. But even as companies acknowledge the problem, most have yet to embark on any formal campaign to measure the hidden costs of poor data quality. “Most companies don’t have the time, energy and drive to do the kind of formal analysis it takes to evaluate the impact of dirty data on their businesses, except when a huge explosion takes place,” says Stuart Madnick, the John Norris Maguire professor of information technology at MIT’s Sloan School of Management in Cambridge, Mass. He also coheads MIT’s Total Data Quality Management Program, a research program devoted to the theory and practice of improving data quality. “The real cost of data quality has to do with how it impacts business, and that analysis is not trivial.”
What’s easy to determine, Madnick says, are the direct costs?for example, what it costs to employ personnel to manually check and correct database records and reports, or the expense of materials and postage for redundant mailings or product returns. But there are less-obvious, hard-to-measure expenses as well. These might be costs associated with warehouse space used to house excess inventory that was ordered because of faulty data, or equipment and facilities allocated to personnel who are strictly employed for the purposes of data quality workarounds. Finally, there are sales prospects neglected because data is unreliable and customers lost because of too few or too frequent marketing contacts.
Stage Two Acceptance
How can your company establish good data quality? At a few companies, the scrap and rework mind-set is slowly being edged out by a new culture predicated on making data quality improvements a continuous process, and giving employees at many levels responsibility for data quality. This means moving data quality concerns out of the back office, and making every employee who handles customer data accountable for ensuring that it adheres to the organization’s established data guidelines. Buy-in from top management is essential to making this kind of radical organizational shift. “Management has to feel the pain of the status quo?they must understand the costs that have become an accepted way of doing business…because they’ve been so far removed from them in the past,” says Information Impact’s English.
Health care is one industry where the executive ranks have hardly been able to ignore data quality issues. In that segment, high-end data cleansing software packages are fairly common because the stakes are so high. At Saint Alphonsus Regional Medical Center in Boise, Idaho, for example, proper patient identification is the CIO’s number-one priority. Without high-quality data to make these identifications, health-care organizations like Saint Alphonsus put themselves at risk for everything from billing snafus to misdiagnoses that can endanger patients’ lives (and engender huge lawsuits).
“Everything flows from making the proper patient ID,” explains Leslie Kelly Hall, vice president and CIO of Saint Alphonsus. “We have more than 500,000 patients in our master patient index, which represents a good deal of Idaho’s population. We can’t begin to cut automation costs without first getting the correct identification of the patient, the provider or the insurer.”
Saint Alphonsus employs Healthcare.com’s EMerge master person index, which embeds Vality’s Integrity data cleansing and standardization software, to ensure that it identifies patients with the highest degree of accuracy. Once confirmed in EMerge, the patient ID is broadcast to 46 connected systems, including those running various labs, pharmacies, electronic medical records and billing. “To the degree to which we can automate the process, we eliminate human error, which leads to dramatic savings and improvements in care,” Hall says.
The Prudential Insurance Co. of America also has managed to weave data quality best practices into its day-to-day operations. But that wasn’t always the case. Problems came to light around 1996 as the insurance giant embarked on a data warehouse project to get a companywide view of customers across eight lines of business (LOB), from traditional casualty and property insurance to financial services, in pursuit of data mining opportunities. “In the process of pulling together all the LOBs into one enterprise data warehouse, we realized that we had a lot of differences across data,” says Pat Komar, Prudential’s vice president of information services in Newark, N.J. Each LOB had developed its own set of codes for describing elements like customer name and policy number. That wasn’t a problem as long as the LOB data was siloed, but the disparate terminology threatened to throw a wrench into the data warehouse. “Data was going through all kinds of transformations, and what was accurate for a line of business might not be accurate for the enterprise,” Komar says.
That realization spawned a massive campaign to standardize data across the various LOBs, orchestrated by Komar with the support of Prudential’s line-of-business CIOs and its corporate CIO. This meant garnering consensus on naming conventions for what’s now close to 3,000 terms describing things like customer, policy and claim. “Each LOB had different product codes, but we had to have agreement for the enterprise warehouse,” Komar says. During months of working lunches, Komar’s team assembled a committee of data experts from the various LOBs and got input on how core types of customer information should be labeled and modeled. The data SWAT team also appointed managers to ensure that the new standards were followed.
As a result of this standardization, Prudential was able to assemble a six-terabyte data warehouse built on a federated architecture, which flexibly combines data from a central repository and the company’s 20-plus independent data marts. Komar says the approach is far more effective in producing a cross-functional view of customers than the earlier unintegrated series of data marts. In the meantime, adherence to data quality remains ongoing and pervasive: Prudential conducts audits to ensure naming standards are strictly enforced, and the company breaks out the occasional scrubbing tool to keep data files squeaky clean. “We get together once every couple of months to talk about data problems,” Komar says. “It’s a collaborative process to get things changed if they don’t work.”
Stage Three leverage
Most companies getting serious about data quality share an ulterior motive: a desire to gain a more intimate portrait of their customers, the goal being better service and increased sales. In banking, the need for reliable data to fuel customer relationship management initiatives is particularly acute given the rampant consolidation and mergers of the past few years. “Today, it’s all about relationship banking,” says Mohammad Rifaie, senior manager of information resource management at Toronto-based Royal Bank of Canada, Canada’s largest bank. “We have to know all the connections a client has to the bank in order to provide meaningful services. This issue of data standardization or linking data to know who is doing what at all of the bank’s touch-points becomes critical…and that’s what’s giving province to data quality.”
Using Vality’s Integrity matching tool, the bank can now confidently run queries across different systems. It can determine, for example, who between the ages of 35 and 45 bought mortgages in Ontario over the past six months so that they can be targeted for mailings for additional services like equity lines or home improvement loans. Integrated, reliable data also allows customer reps to know if an individual with a personal account has, say, a commercial account as well so that the bank can tailor service accordingly. “You don’t want to tell someone calling to complain about charges on their checking account that it’s our policy that they pay the fees, when they have a commercial account with us for $20 million,” Rifaie says. “That’s a fast way to have them take their $20 million elsewhere.”
The Integrity tool also protects Royal Bank of Canada against more mundane data quality mistakes. In the past, when entering customer information into databases, bank employees would often enter garbage character strings for the postal code if they didn’t have the proper information at hand. Before Integrity was implemented in 1996, this workaround caused a bit of a stir. When the bank tried to target a particular geographic area promoting a popular Christmas loan, which annually provided an important chunk of the bank’s new assets, a notable percentage of clients in that area came up with the postal code H0H0H0, a garbage string that passed the system’s edit checks because Canada uses alphanumeric postal codes. “If the postal code’s not accurate, you’re not getting accurate information for mailings,” says Rifaie.
Targeted direct mail is so critical to St. Louis-based CPI Corp., the owner and operator of Sears Portrait Studios, that data cleansing and standardization have always been part of the company’s data warehouse efforts. CPI relies on First Logic’s data quality suite, I.D. Centric, to cleanse customer information on three levels: standardize name and address data, correct addresses and flag those considered unmailable, and run names and addresses against the company’s close to 1,000 studio locations to avoid mailing promotions to its own sites, says Jerome Pion, a programmer and analyst at CPI. The presence of I.D. Centric enables CPI to avoid a repeat of an embarrassing incident that happened in Canada a few years ago, in which a customer received a follow-up mailing addressed to her and “her two brats”?language inserted by a studio employee into the CPI database. The woman angrily called CPI to say that she had become a former customer.
Another important role for the software’s matching capabilities is identifying all potential customers in a single household for demographic analysis?even if different individuals frequented different stores. This helps CPI get a full transaction history so that it can do things like send bilingual mailings, if required, lure one-time customers back with discounts and send reminders on children’s birthdays?one of CPI’s most popular photo events, Pion says.
Having this kind of reliable insight into customers’ purchasing behavior helps CPI build longstanding relationships. “Once we acquire a customer, it’s more efficient to retain that customer and build a relationship with him than to acquire an additional customer,” notes Tim Hufker, CPI’s chief technology officer. “Having clean data gives us a more efficient and effective way to build relationships and keep customers coming back.”
The Next Stage Webification
The Web is quickly becoming a key driver of data cleanliness (or dirtiness) as it gains ground as a way for customers and other external parties to input and access business information. CPI plans to increase its reliance on the Web as a vehicle for promoting special offers and collecting customer information, which will make data quality tools like I.D. Centric even more important, Pion says. CPI hopes to capture information from its website’s visitors, run the data through the address standardization software and match it to customers in CPI’s data warehouse. “If we get a matching customer, we now have an e-mail address to send mailings, which is faster and cheaper [than paper mail],” says Pion.
AT&T’s CRM technologies group sees its Web efforts potentially intensifying its data quality problems, according to David Binkley, senior technical staff member of the CRM technologies group in Piscataway, N.J. “Opening up address changes from the Web is a whole new ball game,” Binkley says. “At least when you’re dealing with 2,000 to 3,000 customer reps, you can put methods and procedures in place to get names and addresses entered correctly. When you’re dealing with the Web, you have no control over what’s going on out there.” In response, the CRM technologies group has deployed Trillium 3.0 data cleansing and standardization software from Billerica, Mass.-based Trillium Software, to clean up customer data coming in from local exchange carriers.
The expediency of Web transactions is another area that’s going to give companies data-quality headaches, says Gartner’s Knox. “If you’ve got bad data on customers, they’re going to be aware of that more rapidly” because they can view their data profile online, she explains. “They’re also more aware of what you know and don’t know about them…so a company will look foolish if it doesn’t realize a customer is the same person across multiple accounts.”
Looking foolish or out of touch with its guests’ needs is the last thing Wyndham International wants in the hypercompetitive business hotel market. When Wyndham, the fourth-largest U.S. hotel chain, recently surveyed its frequent business guests, it found that, to most business travelers, staying in a hotel is no vacation. In lieu of reward programs, this important clientele preferred stays that felt more like home. They wanted to be remembered from one trip to the next, drawing comfort from amenities like having their favorite newspaper delivered each morning or keeping the fridge stocked with their preferred snack foods and beverages.
Wyndham’s response was Wyndham ByRequest, a program that leverages preference data collected from customers to give them a more personalized experience. But to build the customer info database, the hotel chain had to devise a strategy for achieving near-perfect data quality to avoid making the kind of mistakes that could alienate good patrons. “We have to make sure, for instance, that we don’t leave a bag of candy for a diabetic,” explains Daniel Pritchard, former director of systems development at Wyndham in Dallas. “When you’re telling people that you’re going to go out of your way to remember them, you don’t get too many opportunities to fail on your promise.”
What started out as a pilot project in November 1999 has become the hotel chain’s major marketing thrust, Pritchard says. Customers are asked to spend a good 10 to 15 minutes filling out preference information, either on the company’s website or through brochure cards placed in every room. With several avenues for collecting preference data, it was essential for the hotelier to put a stringent data quality program in place. Wyndham is using Innovative Systems’ Data Quality Suite to clean up customer data and match up customer profiles, and it’s even running data that exists separately from the ByRequest program through the cleansing solution to ensure the highest levels of consistency.
“Data quality is the cornerstone to the usefulness of our data warehouse,” Pritchard says. “Without the ability to make sense of poor quality data, we knew the ByRequest program wouldn’t have any merit.” Now, when a guest specifies she’d like a good bottle of cabernet and a feather pillow, the Wyndham team can rest easy that she’s not getting foam pillows and a beer.