Welcome back to our series on managing the data landscape and making sure you get the most value out of your data. In our first article in the series, “5 Critical Success Factors to Turn Data Into Insight,” and the ones that follow it, we seek to define these five capabilities that play key roles in the success and repeatability of an actionable analytics program:
Business alignment: Determine context and value of using information.
Data understanding: Seek to better understand data assets and manage accordingly.
Data quality: Define accuracy for the purpose for which data is being used.
Data-centric processes: Increase understanding as new data is created, used, managed and measured as part of operational processes.
Data-centric resources: Embed data-oriented knowledge and skills throughout the staff.
Our topic for this installment is data quality, which we can simply define as data “fit for purpose.” Obviously, data used for analytics has to be accurate. But there is a lot more to data quality. After all, if the data is not of sufficient quality – even if you know where to find it and how to use it – it still may not serve the purpose you need to deliver the insight you are seeking.
Data quality is more than just accuracy; the lens directed toward data quality is also focused on usage of that data. How data is used instills in data many more dimensions of data quality, including timeliness, relevance or accuracy.
Let’s look at key data quality dimensions:
Completeness. Are data values missing or unusable?
Timeliness. Is the data available for use in the time frame in which it is expected?
Conformity. Are there expectations that data values will have a specified format? Does the data meet that format?
Uniqueness. Are there multiple representations of the same data objects within a given data set?
Integrity. Which data elements are missing important relationship linkages? Can you trace the lineage of data?
Consistency. Is there different information about the same underlying data object in multiple environments?
Accuracy. Do data objects accurately represent the real-world business values expected?
Quality is in the eye of the business
Another often overlooked and important aspect of data quality is where it represents the intersection of business goals and alignment with data understanding. An organization’s business goals should drive the definition and prioritization of data quality dimensions and the respective requirements. Data understanding will tell you whether the data meets those business requirements. If it doesn’t comply, it should also indicate who to contact, such as the appropriate data steward, to determine the steps needed to improve the quality, based on your needs.
There is a recent line of thinking out of the big data movement that data quality is not as important in a high-volume environment. That is, the massive volumes of data will dampen out data quality issues. This can be true. It can also be entirely untrue. You need to give deliberate thought and engineering to the role and effects of data quality in whatever type of analytics solution you are planning.
How then do we incorporate these aspects of data quality into achieving better insight from your data?
Determine what data you need and where to get it
Our previous article, “Mastering and Managing Data Understanding,” talked about understanding the data landscape and understanding the nature of your data. The next logical step is what data elements do you need? Where do they come from and are they available and suitable for the intended purpose?
If you are using data for reporting, BI or analytics, you need to first understand the presentation and manipulation of the data. Normally, you are not grabbing discrete data elements. You are grabbing data and then processing it through an algorithm or formula. You are presenting the results as an analysis, scorecard or report. So the metric, KPI or algorithm determines what data you need. Since our prior article also covered defining a data landscape and inventory, the next step is sourcing the data. Now you need to apply the data quality dimensions to make sure the desired source of your data is appropriate for its usage.
Review the purpose of the metric or report — i.e., when it is produced, what is done with the result? This will give you clues to which data quality dimensions are relevant to your efforts. An operational metric will depend on timeliness. An elaborate algorithm will require adequate coverage and need to be of the correct historical value, without excessive decay.
Align with business needs
Once you have an idea of use and source, double-check that you are very clear on the type of business use of the data. Again, an operational context may dictate a higher level of accuracy than a large-scale statistical exercise. Make sure that any users of the data and resulting calculations are very clear as to how they will apply the results and what actions will they take. Even if you are looking for new insights from unasked questions, you still need to understand how the organization will react to whatever new insights are uncovered.
Build business case and metrics requirements
It really helps to present what value will be gained by ensuring the right level of data quality. Often when data quality issues are uncovered, areas of the company may be unwilling to invest in cleaning up the problems because they are not clear on the link between data quality and their business challenges and productivity issues. A business case goes a long way in avoiding this obstacle. You will almost certainly need to describe how quality impacts the overall ROI of the analytics solution and supports executive buy-in.
Identify or validate information availability
This seems simple, but very often a source will be identified, data elements are targeted and the analyst finds out the data set is off limits or contains protected information. Understanding, in advance, the classification, sensitivity or availability of the data needed for a specific purpose will prevent counterproductive activities like trying to source a data set that is protected.
Profile the data
Once you know the elements and data quality dimensions, then you need to examine the source data, or profile it, to understand where data issues are hiding. Focus data quality efforts on the prioritized data elements and dimensions — don’t profile everything in the sources. You will uncover additional issues that are out of scope, so be prepared to collect this information in an easily referenced repository do be addressed at a future time.
Communicate and remediate
After you have an idea of the health of the data (across the data quality dimensions you need), you need to communicate the issues and determine the best way to fix the data. Both communication and remediation are important, and these activities go hand in hand. Data quality tools can help both remediation and communication, but there are times you may need to consider manual adjustments. Remediation will, of course, improve quality. Communication is also important to improve visibility to quality of data that supports business decision-making.
Measure quality and align with DG
Sustaining the desired level of data quality is important. Once you determine what dimensions are to be used and set a target of quality that makes the data fit for purpose, consider ongoing profiling and presentation of data quality levels. We find that data quality scorecards that rate accuracy, timeliness, accessibility, etc. are effective. Typically, we profile the data and show how to categorize and measure quality aspects. We develop a scorecard for use as a point-in-time assessment, and also as an ongoing method of evaluating data quality.
Ultimately, all data quality processes should align with existing initiatives defined in your data governance (DG) program to ensure they are sustainable and consistent across the organization.
The significant takeaway for data quality is you are not looking for “perfect data.” There is no economical way to do that. Perfect for one use is insufficient for another use.
The key is to extract what the business needs out of a well-documented data landscape, determine the data quality requirements that will affect your success, then profile and correct the data so it is useful for its purpose.
John Ladley is president and chief delivery officer of First San Francisco Partners. An information technology thought leader with 30 years of experience in strategic planning, design and implementation of enterprise information management systems, John is proficient and knowledgeable, with capabilities that are balanced between devising business technology strategies and plans and finding practical solutions to business problems.
John is a recognized authority and speaker on enterprise information management, information architectures, data governance, mobile device management, data quality, business intelligence and analytics, data warehousing and knowledge management. He has extensive experience in the insurance, financial services, manufacturing, transportation and consumer products industries. His books are recognized as authoritative sources on the topics of information management and data governance. John is currently examining the role of chief data officer and assisting several organizations with data strategies.
The opinions expressed in this blog are those of John Ladley and do not necessarily represent those of IDG Communications Inc. or its parent, subsidiary or affiliated companies.