As part of enterprise artificial intelligence (AI) initiatives, data engineering teams are using a wide range of data analytics techniques, ranging from streaming analytics to machine learning to deep learning. This diversity in techniques has led to a corresponding diversity in software platforms and tools. Most data engineering teams are using data ingestion frameworks, such as Kafka; a combination of machine learning tools, such as Hadoop, Splunk, SAS Analytics, Spark, Python and R; and open-source deep learning packages, such as TensorFlow, Caffe and PyTorch.\nTraditional data analytics pipelines\nIn traditional data analytics pipelines, data flows into enterprise environments from various internal and external sources and gets pre-processed and cleansed. Enterprises commonly use a \u201cstaging area\u201d to store intermediate representations of pre-processed data. The data then propagates through several data stores to run diverse AI and data analytics sets. Visualization tools, such as Tableau or QlikView, correlate data from several sources to provide consolidated views to key stakeholders.\nIn addition, data engineering teams build and maintain sandboxes \u2014 analytics testing environments that are distinct from production environments \u2014 for use in both analytic model development and testing. In order to build and test models, input data for these sandboxes often needs to be synthesized to emulate real-world data, since production data is not always directly accessible. Unfortunately, the quality of analytic models built using synthetic data is only as good as the degree to which synthetic data emulates the real world.\nMost data engineering teams will attest that building and managing data analytics pipelines and sandbox environments takes up a signification portion of their time. Therefore, platform architectures that simplify data analytics pipelines make data engineering teams more productive. For example, let\u2019s look at how consolidating data into a centralized repository can help reduce time to results.\nBenefits of data lakes\nConsolidating data across all the stages of the pipeline in one place, which we will refer to as the data lake, is an effective way of simplifying data analytics pipeline design and management in the following ways:\n\nEliminates data movement: When data gets ingested into the analytics environment, there is no staging area. Instead, the data lake is used to manage intermediate representations of ingested data. Once cleansed, the data stays in the data lake and is instantaneously available for a broad variety of data analytics tools, ideally as a network file system (NFS) or server messaging protocol (SMB) mount-point or as an object store with RESTful application program interfaces (APIs). Visualization tools no longer need to connect to multiple data sources to provide consolidated views of the data.\n\n Dell EMC\nFigure1: Simplifying data analytic pipelines\n\nSimplifies data governance: Consolidating the data subject to analytics all in one place \u2013 regardless of where it is in the analytic pipeline \u2013 streamlines and simplifies the management of data security, data resiliency, audit, lineage and metadata. This enables the data engineering team to focus on use cases and data analytic applications and leaves the data governance tasks to the platform.\n\n\nAccelerates analytics sandbox productivity: A data lake that provides secure read-only access of production data sets to AI sandboxes without compromising production system service-level agreements (SLAs) is a powerful asset. This allows data engineering teams to build and maintain analytic models using real-world production data sets, as opposed to synthetic data. Thus, it improves the quality and accuracy of these models, making model development and management more productive.\n\n\nFlexibility of analytics\/AI platforms: The analytics and AI platforms and tools used in a data pipeline are quite dynamic. New analytics and AI tools are introduced, and old\/unused ones may be retired, on a frequent basis. Consolidating the data in a data lake \u2013 as opposed to embedding data into the analytics\/AI tool \u2013 enables data engineering teams to quickly introduce new analytics and AI tools without having to move data around.\n\nWith these benefits in mind, let\u2019s examine a few real-world examples.\nReal-world applications\nAdvanced driver-assistance systems (ADAS)\nIn the race to achieve fully autonomous driving, auto makers are relying on massive amounts of input gathered from a range of sensors that generate video, radar, lidar, ultrasonic, GPS and on-board vehicle data. Autonomous driving systems makers then use this information to train their AI algorithms \u2014 a process that requires huge amounts of data.\nOnce automotive data is ingested from a disk load station, autonomous driving systems makers use multiple applications in the data analytic pipeline, all of which need access to this data. These include data enrichment and labeling software, software-in-the-loop (SiL) and hardware-in-the-loop (HiL) servers, deep learning toolsets and test frameworks, as shown in Figure 2.\n Dell EMC\nFigure 2: Consolidating data pipelines in ADAS\nConsolidating this complex data pipeline with a data lake makes it easier to manage ADAS projects with the benefits discussed in the previous section.\nPrecision medicine\nLife sciences companies are relying on AI to help develop hyper-targeted treatments based on patients\u2019 unique genetic traits. To achieve this, organizations must collect and process patient genomic data, cohort data, patient electronic medical records, data gathered from patient Internet of Things (IoT) devices \u2015 such as fitness watches and heart rate monitors \u2014 and large sets of reference genomic data.\n Dell EMC\nFigure 3: Combining diverse data sets for accelerating precision medicine\nManufacturing\nModern manufacturers rely on data from hundreds to thousands of sensors spread across production lines to ensure that the manufactured products and the machines that build them are up to standards. These sensors keep an eye on vibrations, moisture and humidity, temperatures, metal purity and sounds. They run 24\/7 and can produce petabytes of data, based on the manufacturing environment.\n Dell EMC\nFigure 4: Consolidating data pipelines in manufacturing\nMultiple data analytics and AI applications analyze data generated by these sensors and cross-correlate sensor data with baseline data (that indicates what is normal operation) to help manufacturers enable predictive maintenance, avoid machine down-times, diagnose defective components and optimize conditions on the manufacturing floor. These applications can include computer-aided engineering (CAE), test diagnostic software and hardware, and outlier detection frameworks driven by deep learning tool sets. Once again, consolidating the data analytics pipeline through use of a consolidated data lake simplifies the design and management of these complex analytical systems.\nConclusion\nAs illustrated by the examples above, simplifying data analytics pipelines can significantly improve the productivity of data engineering teams, making it easier to manage projects, and freeing up time to focus on use cases and data analytic applications. A consolidated data lake streamlines data analytics pipeline development and management, enabling data engineering teams to rapidly prototype, test and launch analytics and AI projects without having to deal with migrating, securing and managing large volumes of data.\nBy increasing efficiency and optimizing resources, a data lake can put organizations on the path to a more productive AI environment, and Dell EMC can help to accelerate this journey with Dell EMC Isilon scale-out network-attached storage, which supports traditional file storage and data management for AI data pipelines simultaneously, making data analytics and AI an integral part of the IT environment. As enterprise artificial intelligence initiatives \u2014 such as autonomous driving, precision medicine and predictive maintenance \u2014 continue to produce massive amounts of diverse data sets, a consolidated data lake can help organizations stay ahead of the game.\nTo learn more\n\nFor a deeper dive into building consolidated data lakes, explore Dell EMC\u2019s Unstructured Data Analytics with Isilon.\nFor perspectives on broader portfolio for AI, explore Dell Technologies AI Solutions and Dell EMC Ready Solutions for AI.\n\nSai Devulapalli is the global head of data analytics solutions and go-to-market in the Unstructured Data Solutions Unit at Dell EMC.