by Nancy Couture

How to implement a robust data quality solution

Opinion
Jan 12, 2016
Agile DevelopmentAnalyticsData Warehousing

Data quality is becoming a key concern for companies who rely on data on a daily basis. Without a purposeful data quality program, information becomes inconsistent and unreliable.

This first series of articles describe foundational steps that enable agile data warehouse development.  My prior articles published thus far describe how to:

The next focus for setting yourself up for a best in class agile data warehouse environment is to develop a robust data quality solution.

According to TDWI, the cost of bad data is more than $600 billion annually in the U.S.  There are many negative consequences of low data quality, including:

  • Low customer satisfaction
  • Loss of customers
  • Misguided business decisions
  • Missed opportunities
  • Financial inaccuracies and mistakes
  • Legal and monetary penalties
  • Negative company image

All too often, companies invest in a data warehouse, but a proactive data quality solution is an afterthought.  Developing a well-planned and scalable data quality capability as part of your foundational work can go a long way in improving the quality of your data.  If done well, it will also improve the business stakeholder confidence in your data.

First of all, let’s define data quality.  Way back in 1996, when I was first developing data quality processes, it was simply defined as “fitness for use”, which is still an appropriate high level definition.  In order for data to be “fit for use,” an organization will need to define what aspects are most important to them.  Below is a quote from a prior co-worker, who has focused on all aspects of data quality throughout her career.

“Data and information quality thinkers have adopted the word dimension to identify those aspects of data that can be measured and through which its quality can be quantified. While different experts have proposed different sets of data quality dimensions … almost all include some version of accuracy and validity, completeness, consistency, and currency or timeliness among them.”

— Sebastian-Coleman, Laura [2013]. Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework

 Rather than trying to focus on every dimension, start by focusing on the basics of completeness and timeliness, then move on to validity and consistency. These four dimensions can truly enhance the quality of enterprise data as well as stakeholders’ confidence in the data they consume.

Completeness is first and foremost. Stakeholders need to know that what’s in the source is accounted for in the target. You can ensure completeness in a variety of ways. For example, a record-balancing capability that records a count at the end of one flow and at the beginning of another to ensure all records are accounted for. The ultimate goal is to validate that every record and its corresponding information from a source is handled appropriately during processing. This source-to-target validation should be monitored and reported to the organization’s data consumers.

Timeliness should be a component of service-level agreements (SLAs) and identify such criteria as acceptable levels of data latency, frequency of data updates, and data availability.  Timeliness can then be measured against these defined SLAs and shared as part of the data quality metrics.

Validity is a key data quality measure that indicates the “correctness” of the actual data content; for example, confirming that all the characters in a telephone number field are digits, not alphabetic characters. This is the concept that most data consumers think about when they envision data quality. Validity can be assessed through data profiling, data cleansing, and inline data quality checks that may perform comparisons of incoming values to expected values or to values defined within a stated range of acceptability. Alerts can be set, depending on the validity checks used.  The results of the validity checks should be measured and shared as part of the data quality metrics.

Consistency is crucial to continued consumer confidence. Once data quality metrics are being monitored and reported to the business stakeholders for completeness, timeliness, and validity, then consistency can be measured by assessing changes in these patterns over time. These results can be added to the data quality metrics reporting that is shared with business stakeholders.

Complete transparency of data quality metrics and reporting to your organization’s data consumers will lead to greater confidence in the quality of the underlying data.

Stakeholder confidence will continue to increase if you are able to proactively identify issues through active data quality monitoring before the data consumers find them. This is one of the greatest achievements of a robust data quality program.

The next article will cover the next step in building the foundational approach to agile data warehouse development:  giving the development team the ability to self manage their agile development approach, incorporating continuous improvement.