Sharing the impact of information technologies in the data era
Simplifying Data Analytics Pipelines using a Data Lake
For applications ranging from advanced driver assist systems to precision medicine and manufacturing, a simplified development environment enables more rapid prototyping, testing and launch capability for analytics and AI projects.
BrandPost Sponsored by Dell Technologies and Intel®
BrandPosts are written and edited by members of our sponsor community. BrandPosts create an opportunity for an individual sponsor to provide insight and commentary from their point-of-view directly to our audience. The editorial team does not participate in the writing or editing of BrandPosts.
By Sai Kumar Devulapalli
As part of enterprise artificial intelligence (AI) initiatives, data engineering teams are using a wide range of data analytics techniques, ranging from streaming analytics to machine learning to deep learning. This diversity in techniques has led to a corresponding diversity in software platforms and tools. Most data engineering teams are using data ingestion frameworks, such as Kafka; a combination of machine learning tools, such as Hadoop, Splunk, SAS Analytics, Spark, Python and R; and open-source deep learning packages, such as TensorFlow, Caffe and PyTorch.
Traditional data analytics pipelines
In traditional data analytics pipelines, data flows into enterprise environments from various internal and external sources and gets pre-processed and cleansed. Enterprises commonly use a “staging area” to store intermediate representations of pre-processed data. The data then propagates through several data stores to run diverse AI and data analytics sets. Visualization tools, such as Tableau or QlikView, correlate data from several sources to provide consolidated views to key stakeholders.
In addition, data engineering teams build and maintain sandboxes — analytics testing environments that are distinct from production environments — for use in both analytic model development and testing. In order to build and test models, input data for these sandboxes often needs to be synthesized to emulate real-world data, since production data is not always directly accessible. Unfortunately, the quality of analytic models built using synthetic data is only as good as the degree to which synthetic data emulates the real world.
Most data engineering teams will attest that building and managing data analytics pipelines and sandbox environments takes up a signification portion of their time. Therefore, platform architectures that simplify data analytics pipelines make data engineering teams more productive. For example, let’s look at how consolidating data into a centralized repository can help reduce time to results.
Benefits of data lakes
Consolidating data across all the stages of the pipeline in one place, which we will refer to as the data lake, is an effective way of simplifying data analytics pipeline design and management in the following ways:
Eliminates data movement: When data gets ingested into the analytics environment, there is no staging area. Instead, the data lake is used to manage intermediate representations of ingested data. Once cleansed, the data stays in the data lake and is instantaneously available for a broad variety of data analytics tools, ideally as a network file system (NFS) or server messaging protocol (SMB) mount-point or as an object store with RESTful application program interfaces (APIs). Visualization tools no longer need to connect to multiple data sources to provide consolidated views of the data.
Figure1: Simplifying data analytic pipelines
Simplifies data governance: Consolidating the data subject to analytics all in one place – regardless of where it is in the analytic pipeline – streamlines and simplifies the management of data security, data resiliency, audit, lineage and metadata. This enables the data engineering team to focus on use cases and data analytic applications and leaves the data governance tasks to the platform.
Accelerates analytics sandbox productivity: A data lake that provides secure read-only access of production data sets to AI sandboxes without compromising production system service-level agreements (SLAs) is a powerful asset. This allows data engineering teams to build and maintain analytic models using real-world production data sets, as opposed to synthetic data. Thus, it improves the quality and accuracy of these models, making model development and management more productive.
Flexibility of analytics/AI platforms: The analytics and AI platforms and tools used in a data pipeline are quite dynamic. New analytics and AI tools are introduced, and old/unused ones may be retired, on a frequent basis. Consolidating the data in a data lake – as opposed to embedding data into the analytics/AI tool – enables data engineering teams to quickly introduce new analytics and AI tools without having to move data around.
With these benefits in mind, let’s examine a few real-world examples.
Advanced driver-assistance systems (ADAS)
In the race to achieve fully autonomous driving, auto makers are relying on massive amounts of input gathered from a range of sensors that generate video, radar, lidar, ultrasonic, GPS and on-board vehicle data. Autonomous driving systems makers then use this information to train their AI algorithms — a process that requires huge amounts of data.
Once automotive data is ingested from a disk load station, autonomous driving systems makers use multiple applications in the data analytic pipeline, all of which need access to this data. These include data enrichment and labeling software, software-in-the-loop (SiL) and hardware-in-the-loop (HiL) servers, deep learning toolsets and test frameworks, as shown in Figure 2.
Figure 2: Consolidating data pipelines in ADAS
Consolidating this complex data pipeline with a data lake makes it easier to manage ADAS projects with the benefits discussed in the previous section.
Life sciences companies are relying on AI to help develop hyper-targeted treatments based on patients’ unique genetic traits. To achieve this, organizations must collect and process patient genomic data, cohort data, patient electronic medical records, data gathered from patient Internet of Things (IoT) devices ― such as fitness watches and heart rate monitors — and large sets of reference genomic data.
Figure 3: Combining diverse data sets for accelerating precision medicine
Modern manufacturers rely on data from hundreds to thousands of sensors spread across production lines to ensure that the manufactured products and the machines that build them are up to standards. These sensors keep an eye on vibrations, moisture and humidity, temperatures, metal purity and sounds. They run 24/7 and can produce petabytes of data, based on the manufacturing environment.
Figure 4: Consolidating data pipelines in manufacturing
Multiple data analytics and AI applications analyze data generated by these sensors and cross-correlate sensor data with baseline data (that indicates what is normal operation) to help manufacturers enable predictive maintenance, avoid machine down-times, diagnose defective components and optimize conditions on the manufacturing floor. These applications can include computer-aided engineering (CAE), test diagnostic software and hardware, and outlier detection frameworks driven by deep learning tool sets. Once again, consolidating the data analytics pipeline through use of a consolidated data lake simplifies the design and management of these complex analytical systems.
As illustrated by the examples above, simplifying data analytics pipelines can significantly improve the productivity of data engineering teams, making it easier to manage projects, and freeing up time to focus on use cases and data analytic applications. A consolidated data lake streamlines data analytics pipeline development and management, enabling data engineering teams to rapidly prototype, test and launch analytics and AI projects without having to deal with migrating, securing and managing large volumes of data.
By increasing efficiency and optimizing resources, a data lake can put organizations on the path to a more productive AI environment, and Dell EMC can help to accelerate this journey with Dell EMC Isilon scale-out network-attached storage, which supports traditional file storage and data management for AI data pipelines simultaneously, making data analytics and AI an integral part of the IT environment. As enterprise artificial intelligence initiatives — such as autonomous driving, precision medicine and predictive maintenance — continue to produce massive amounts of diverse data sets, a consolidated data lake can help organizations stay ahead of the game.