Like a river tumbling down a mountain, the process of data analytics never stands still. It’s always on the move, evolving with new technologies and strategies for extracting business value from ever-larger volumes of data. Today is no exception, as we move into the era of the consolidated data lake that separates compute and storage to enable analytics applications across the enterprise.
More on that point in a moment, but first let’s consider the backstory — the evolution of big data and data warehousing. When I started in this work 20 years ago, businesses were challenged to gain intelligence from ever-growing datasets in as close to real time as possible. These efforts led the way for a groundbreaking architecture called Massively Parallel Processing (MPP), or data sharing across multiple commodity servers with Direct Attached Storage (DAS). This approach, which keeps data close to processing power, moved organizations beyond the limitations of symmetric parallel processing, where data is stored in centralized arrays and accessed over networks.
While the MPP approach worked well for years, today we are running up against the limitations of the architecture and the challenges brought by explosive data growth. One problem is that teams running data analytics must continually copy and move data to keep it close to the compute that processes queries. This has led to many cases where the same data lives in different forms in different siloes. While the datasets are supposedly the same, they are extracted, transformed, loaded (ETL) and maintained differently across different applications, which means you have different analytical windows into the data depending on which application you are using. So which data is the ultimate source of truth? That can be hard to determine.
The challenges with legacy approaches don’t stop there. From an operational point of view, data is backed up and secured in different ways. And then there is the problem of the cost to manage multiple copies of data, secure the data and maintain associated service level agreements. All of these challenges point to the need for a new approach to data storage and processing, one that separates storage and compute. This new way of doing business consolidates data into a centralized data lake that makes the data available as a shared resource accessed by many applications via a common file system.
This is the approach taken with the Dell EMC OneFS operating system. It enables organizations to consolidate diverse types of data into a scalable pool of storage with a global namespace. This includes unstructured and semi-structured data from IoT and edge devices, individual devices and social media channels, as well as neatly structured data from enterprise databases. It can also include data that is held in Hadoop environments and accessed via the Hadoop Distributed File System (HDFS), which many organizations now use for data analytics.
A consolidated data lake enables IT organizations to eliminate data silos, create a single accessible data source for all applications, and avoid the constant movement and replication of datasets among different silos. Simply load the data files into the data lake and point your application to query data directly from the data lake. Like that, data scientists and other users across the enterprise have access to the same data to power their analytics queries, train machine learning models and neural networks, and, ultimately, develop artificial intelligence applications that turn all that data into business value. And all the while, there is just one version of the truth when it comes to the data, because it all lives in one place.
Even better, with the new architecture that separate storage from compute, organizations can control compute and storage ratios to achieve better cost effectiveness with the ability to leverage best-in-class elastic cloud storage. They can also maintain open data formats without the lock-in associated with proprietary data management solutions.
Our approaches to data analytics have been evolving for decades. Today, this evolution continues as we move into the era of the consolidated data lake that enables the separation of storage and compute. Lots of great benefits fall out of this new architecture. Organizations can eliminate data siloes and all the problems that come with them while making data readily available to users across the enterprise.
At a higher level, a centralized storage environment creates an environment that gives users a 360-degree view of the data held in the enterprise, a view that spans business units and operations. They are no longer bound by issues related to where the compute is and what hurdles they need to jump through to access different datasets. With this new view, people can ask bigger questions, and gain deeper insights into the business.
To learn more
For a closer look at the capabilities that enable the consolidated data lake, visit Dell EMC Isilon OneFS Operating System.
Learn from and engage with leading players in the industry who are pushing the boundaries in AI and data analytics. Join us at AI & Data Analytics Re-Imagined.