by Andy Hayler

A data quality conundrum

Opinion
Jan 27, 2011
IT StrategyMobile AppsTelecommunications Industry

The data quality market is a remarkably fragmented one – my company tracks over 60 vendors offering data quality solutions in one form or another.

Most of the data quality industry has concentrated on one specific type of data: names and addresses. It is easy to see why – every company has to deal with names and addresses of customers and suppliers. These are subject to inherent change as people move house, and customers often change names when they get married.

In addition to this there are plenty of opportunities for confusion – will the call centre operator record my name as “A. Hayler”, “Andy Hayler”, “A.D. Hayler” or “Andrew Hayler”, even assuming that they type the name in correctly? Large name and address files will frequently have 20 per cent error rates, and in one project that I was involved with many years ago a data clean-up exercise reduced the database of business customers from 20,000 to 5000 once all the duplicates and out-of-date entries were removed.

Because companies sent out marketing literature and bills by mail, there is clearly a real cost if 20 per cent or more of your addresses are wrong, quite apart from any irritation that customers may have when receiving multiple communications from different parts of the same company. Marketing costs may be unnecessarily high, and things get worse if the problem extends to the sending of actual deliveries and invoices. Consequently an industry has grown up to provide software that helps companies tackle this problem.

Data quality software typically uses a mix of algorithms to look for typing errors, while providing dictionaries of common variants on names: “Andy = Andrew”, “William = Bill = Wilhelm” for instance, in order to spot likely duplicates. In the early days such software was applied in batch to check files after the event, but these days it plugs into other transactional applications to try and spot errors before they happen. For example an account manager might try and enter a new client only for the software to point out that the account, perhaps with a slightly different name, may already exist at the same address.

A greater emphasis has been given to such software as companies need to check customer credit ratings, and as law enforcement agencies try to pre-emptively track suspicious persons. Identity resolution software is widely deployed in such situations, meaning that booking an airline ticket if your surname is Bin Laden is doubtless a tedious experience. By checking attributes such as a date of birth, an address or a social security number, the software can suggest duplicate identities that may be stored in different systems.

In general the industry does a decent job of tackling name and address data, and there are many effective solutions on the market at assorted price points. Some can offer enhanced features such as information about whether a particular address is in a flood plain, or which voting constituency it is in. In the case of businesses software can hook up to databases to tell you the number of employees at an address, the revenues of the company and other snippets, as well as plotting the address on a map.

This is all fine, but there is a problem: data quality problems are not restricted to customer name and address. In a 2009 survey of 127 large companies that my firm did we found that 76 per cent of companies find it “somewhat difficult” or worse to standardise their product data, and only nine per cent of respondents felt that data quality was “mostly about name and address”. Indeed names and addresses ranked only third, behind product and financial data, in terms of the priorities of the survey respondents. Yet only a tiny fraction of data quality software makes any serious attempt to deal with data beyond names and addresses.

This gap between what customers need and what the industry has delivered has arisen partly because tackling name and address is a relatively simple problem: such data is well structured, with few attributes, and there are plenty of published algorithms around that can be easily deployed to spot common typing errors. However, product data is a much more slippery thing. It often appears in unstructured files, frequently with hundreds of attributes. Our experience is that product files typically see error rates up to around 30 per cent, yet there are very few solutions that specialise in tackling product data – Silver Creek was bought by Oracle, leaving Datactics and Inquera as independent examples.

If vendors can manage to produce data quality software that can genuinely tackle other data domains, such as financial and product data, then there ought to be a willing market out there, since at present such correction work is typically done by hand or outsourced to India (to be done by cheaper hands). It does seem curious to me that there is such a clear disconnect between what vendors provide and what customers appear to want and need.

Andy Hayler is founder of research company The Information Difference. Previously, he founded data management firm Kalido after commercialising an in-house project at Shell