Higher Quality Data Strengthens the Business

BrandPost By Adam Bowen

Nov 16, 2017

A DataOps approach reduces friction and helps transformation teams deliver reliable data.rn

In Part 2 of this series, I look at the impact of DataOps on data defects. Read Part 1 here.

Decisions and actions based on incorrect information carry far greater destructive potential than data-delayed decisions and actions do. Scott Emigh, CTO of Microsoft’s U.S. Partner organization, recently said, “Your analytics are only as good as the quality of data on which you reason.”

I extrapolate this same logic to machine learning (ML), where technology is “doing the reasoning” for us. With the recent proliferation of ML, decisions are happening exponentially more often every second of every day, sometimes with good data, other times with defect-laden data.

This makes data quality more imperative than ever. Would you rather make one decision a day with a 10% data defect rate or make that decision one million times with a 0.1% defect rate? Likewise in DevTest, all of the automated testing in the world means nothing if the tests are undermined by unreliable data.

Scott Prugh, VP of Development at CSG International, has observed that automated tests running on undersized data sets – several hundred MB vs several hundred GB – are a major contributor to production failures.

Transformation Is Only as Good as the Data

By focusing on the DataOps areas of version control and transformation, companies can tackle the friction that manifests in data defects. First, where applicable, bringing data under version control allows data operators and consumers to start their work with data originated from a discrete point in an immutable data repository, which greatly improves integrity and trust of the data from the beginning.

Secondly, most data needs to undergo some sort of transformation before it is ready for the data consumer, such as subsetting, synthesis, ETL, endian conversion, relational-to-NoSQL, etc. Transforming datasets inadequately is bad for business, resulting in myriad defects such as duplicate data, missing data, datatype mismatch, missing corner cases, and improper data sequence, for example.

The way to combat these errors is to repeatedly refine and automate these activities and subject transformation activities to their own quality tests before committing the transformed data back into version control.

I have visited many companies around the globe that have armies of people devoted specifically to data transformation activities, and I’ve seen numerous problems caused by two particular problems.

First, without a DataOps-centric approach, businesses struggle to deliver data where on time to where it’s needed. This leaves transformation teams repurposing old data sets again and again, which invariably causes serious quality issues. In these instances, the data operators are also data consumers, hindered by the storage of data in siloes.

It doesn’t have to be this way. By adopting the self-service and automation capabilities discussed in the previous post, data transformation groups can obtain fresh full data sets as needed and provide higher-quality data for BI and analytics.

The second problem area is the dynamic nature of transformations. Data consumers are not always aware that the data they are leveraging today is different than yesterday. How many times have we each asked the question, “What changed?” only to be told, “Nothing.”?

DataOps Improves Testing and Analysis

Committing the transformed data sets to version control allows data consumers to have a high degree of confidence in their activities and products. By addressing both these issues at once, companies can address the business constraint, rather than just shifting the pain.

The DataOps approach has enabled companies to increase the number of defects found in development, while dramatically decreasing the number of overall total defects. This has allowed them to achieve the fast-feedback/higher quality DevTest loops promised in shift left methodologies.

A deeper explanation of this key performance indicator is covered by my friend and boss, Eric Schrock here.