Growing numbers of organizations are augmenting their enterprise data warehouses (EDWs) with the open source Apache Hadoop platform, and gaining the associated benefits — from cost economies to a highly scalable repository for data in its many forms. Now the focus is shifting to the steps an organization can take to make Hadoop a true enterprise-class platform.
In recent years the Hadoop ecosystem has made enormous strides in the effort to make Hadoop enterprise-ready, yet we must remember that the platform is still growing on the maturity scale. Two of the key initiatives that will enable Hadoop to build the strength of a true enterprise-class platform include a laser focus on data lineage and data quality.
Let’s start with lineage. To understand your data, you have to understand where it came from. This is something that is done outside of Hadoop environments. In a large enterprise, you can have many users feeding data into Hadoop, and oftentimes database administrators don’t have a view of where the data is coming from or where it is going. They might have five copies of the same data, leaving the IT shop in a bit of chaos.
So why is data lineage important? Because if you don’t know where the data is coming from, you can’t trust what it is telling you. For example, you wouldn’t want to build predictive models for sales in the U.S. based on data solely from Canada, because you’re talking about two different markets.
Lineage helps you understand what you have and where it came from, and it alerts you to duplicate copies of data coming into the system. These are the elements that build the integrity of an enterprise-class platform. Constructing the custodial change of data is key to determining mastery, context, scope and suitability for analytics needs.
To fix the issues in data lineage, we must augment the ingestion layer in the Hadoop platform. This layer tracks where data came from and where it then went, and at the same time it allows the system to capture metadata — or data about data. This metadata makes it easier to index, search and manage data in all its forms.
Data quality is another issue lurking in Hadoop systems. When many people in the organization are putting data into Hadoop, without any filtration or quality checks, you can be left with a GIGO problem — garage in-garbage out. Bad data leads to bad decisions.
To avoid this trap and all the burdened costs, you must clean data as it comes into the system. While there are many techniques used to scrub data, the important point is that you must put mechanisms in place to identify and address anomalies and inconsistencies in data. Without those mechanisms, you cannot trust the data.
These are just two of the mile markers we need to reach in the drive to make Hadoop an enterprise-ready platform. In subsequent posts I will look at some additional mile markers. For now, we must keep this thought in mind: We’re talking about a journey, not a quick trip. The goal is to make steady, incremental progress, and to get to our destination one-mile marker at a time.
Mike King is an Enterprise Technologist specializing in Big Data at Dell Technologies.