Building a Global Meta-data Fabric to Accelerate Data Science

patricia blogpost q2
Dell EMC

Building a Global Meta-data Fabric to Accelerate Data Science

By Patricia Florissi, Ph.D.

In an earlier blog post, Distributed Analytics Meets Distributed Data, I wrote about the concept of a World Wide Herd (WWH), which creates a global network of Apache™ Hadoop® instances that function as a single virtual computing cluster. WWH orchestrates the execution of distributed and parallel computations on a global scale, even across multi-clouds, pushing analytics to where the data resides. This approach enables analysis of geographically dispersed data without requiring the data to be moved to a single location before it can be analyzed. Only the privacy-preserving results of the analysis are shared.

WWH is a tremendous concept, and a key part of future data strategies as we look to a world that is projected to have 200 billion connected devices by 2031. Data will increasingly be inherently distributed with limited data movement. WWH makes that data accessible for analysis, wherever the data happens to be.

That’s the big picture. But how do we really make data that is scattered around the world, in many different formats, locatable, accessible and useable for analysis by data scientists via a World Wide Herd? This is the topic for today’s post.

Let’s begin with a high-level architectural overview. At a simplified level, WWH has three tiers: a data fabric at the physical infrastructure level, a meta-data fabric in the middle, and an analytics fabric at the top level, where the data scientist works. For this “how-it-works” discussion, I will focus on the middle layer, the meta-data fabric.

figure 1 Dell EMC

Figure 1: The three layers in WWH

Meta-data, of course, is data about data. In the case of WWH, its meta-data fabric abstracts physical data resources, such as a file or a blob store, into meta-resources, which contain meta-data about the physical resources themselves. Two different actors contribute meta-data to meta-resources:

  • Data architects map meta-data related to the physical properties of the data, including the physical location of data that helps the analytics fabric to locate and address the data; and
  • Meta-data curators map meta-data related to the semantic properties of the data, including whether it stores genome data or financial data, and weather it stores Personally Identifiable Information (PII).
figure 2 Dell EMC

Figure 2: Actors for meta-data creation

The meta-data fabric itself is a collection of distributed runtime engines, referred to as catalog nodes. Each catalog node stores all the meta-data information about all the data in the local data-zone where it is located, and knows about at least another catalog node, also referred to as a next hop, allowing for the meta-data fabric to be fully connected and accessible from any node.

figure 3 Dell EMC

Figure 3: Meta-data fabric distributed runtime engine

The meta-data fabric ties all the physical nodes together for the analytics fabric layer and enables the automation of key functions during the code execution through the meta-data amalgamation process.

This architecture frees the data scientists from a great deal of the heavy lifting. The data scientist doesn’t need to know how to locate, address or access the data. WWH takes care of all of that coordination. The data scientist simply interacts with a virtual computing node, and the catalog tells this node how to address the data.

To make this story more tangible, consider, for example, a team of data scientists that wants to study the relationship between high blood pressure and heart disease among different associates, or groups of individuals who share common statistical factors, such as age and ethnicity. The data scientists start a WWH computation in a virtual computing node they have access to, passing as parameter the name of a meta-resource that includes meta-data tags, such as “patient-data,” “heart-disease,” “age,” “ethnicity” and “blood-pressure.” This indicates that the computation should be performed only on data sources that contain information regarding patients with heart conditions, where the age and the ethnicity are known, and for which blood pressures are being measured. It is important to note that the data scientists are not concerned with the specific format of the data, the data store being used for the data itself, or the location and address of the individual data stores.

Once started, the WWH computation first connects to the local catalog node passing the name of the meta-resource defined by the data-scientist. The catalog node returns to the WWH virtual computing node two types of information:

  • The exact location of all files or data stores that are referenced in the meta-resource. WWH then uses this information to start a computation in the data-zone to analyze these data sources; and
figure 4 Dell EMC

Figure 4: Catalog node informs virtual computing node on the location of local data sources

  • The addresses of the next hops where other catalog nodes reside.
figure 5 Dell EMC

Figure 5: Catalog node informs virtual computing node on the next hops

WWH then uses the information related to next hops to start distributed computation to these locations, and the process then repeats itself. Each new WWH computation that starts on these next hops will connect to the local catalog nodes and find the same two types of information: the local data stores that are related to the meta-resources, and the next hops so additional computation can be distributed there as well.

figure 6 Dell EMC

Figure 6: Next-hop virtual computing nodes access their local catalog nodes

It is important to note that, by using WWH, the data scientist will not know where the data stores are, or have access to the individual electronic medical records, if they exist. The data scientist will only have access to the results of some statistics calculation on the values collected for the purpose of improving science.

Of course, we shouldn’t overlook the five-ton elephant in the room. That’s data privacy, which is always a concern, regardless of the industry. With WWH, security mechanisms are built in at many levels. At a foundational level, the data always stays where it is, and only the results are sent back to the data scientist, so no data is subject to being compromised as it moves over a network. In another security feature, the catalogs the virtual computing nodes access to find appropriate associates ask questions to verify whether the user and the environment can be trusted. The catalogs also have built-in red-flag mechanisms that stop users from accessing samples that are so small that privacy protections might be breached.

Does it feel to you that we are talking about a vision for the future? Absolutely not. While WWH is still in its early years, it is already a concept that organizations are putting to work. For some examples, see my December blog post Using a WWH to Advance Disease Discovery and Treatment and the aforementioned April post, Distributed Analytics Meets Distributed Data.

And, by all means, please keep your eye on this space in the months ahead, as we dive deeper into the workings of WWH and its benefits to data scientists.

Patricia Florissi, Ph.D., is vice president and global CTO for sales and a distinguished engineer for Dell EMC.

Copyright © 2017 IDG Communications, Inc.