Credit: Dell Applying artificial intelligence (AI) to data analytics for deeper, better insights and automation is a growing enterprise IT priority. But the data repository options that have been around for a while tend to fall short in their ability to serve as the foundation for big data analytics powered by AI. Traditional data warehouses, for example, support datasets from multiple sources but require a consistent data structure. They’re comparatively expensive and can’t handle big data analytics. However, they do contain effective data management, organization, and integrity capabilities. As a result, users can easily find what they need, and organizations avoid the operational and cost burdens of storing unneeded or duplicate data copies. Newer data lakes are highly scalable and can ingest structured and semi-structured data along with unstructured data like text, images, video, and audio. They conveniently store data in a flat architecture that can be queried in aggregate and offer the speed and lower cost required for big data analytics. On the other hand, they don’t support transactions or enforce data quality. If those in charge of managing the data lake don’t create precise processes and metadata for organizing data, the lake can quickly devolve into what’s come to be known as a “data swamp”—a data lake that makes it hard for users to locate data. If only there were a best-of-both-worlds compromise. Warehouse, data lake convergence Meet the data lakehouse. It’s a modern repository that stores all structured, semi-structured, and unstructured data as a data lake does. However, it also supports the quality, performance, security, and governance strengths of a data warehouse. As such, the lakehouse is emerging as the only data architecture that supports business intelligence (BI), SQL analytics, real-time data applications, data science, AI, and machine learning (ML) all in a single converged platform. The open lakehouse architecture implements data structures and management features similar to those in a warehouse directly on top of low-cost cloud storage in open formats, providing: Support for diverse data types, ranging from unstructured to structured data, big data workloads, analytics, and AIConsistency as multiple parties concurrently read or write dataBI support directly on source data, reducing staleness, latency, and the operational cost of having two copies of data in both a data lake and a warehouseOpen storage formats with API to a variety of tools and engines, including ML and Python/R libraries, which can access data directlyEnd-to-end streaming to enable real-time reporting and eliminate the need for separate systems dedicated to serving real-time data applicationsSchema enforcement and evolutionRobust governance and auditing mechanismsDecoupled storage and compute resources to enable asynchronous scaling. Challenges of supporting multiple repository types It’s common to compensate for the respective shortcomings of existing repositories by running multiple systems, for example, a data lake, several data warehouses, and other purpose-built systems. However, this process frequently creates a few headaches. Most notably, data stored in one repository type is often excluded from analytics run on another, which is suboptimal in terms of the results. In addition, having multiple systems requires the creation of expensive and operationally burdensome processes to move data from lake to warehouse if required. To overcome the data lake’s quality issues, for example, many often use extract/transform/load (ETL) processes to copy a small subset of data from lake to warehouse for important decision support and BI applications. This dual-system architecture requires continuous engineering to ETL data between the two platforms. Each ETL step risks introducing failures or bugs that reduce data quality. Second, leading ML systems, such as TensorFlow, PyTorch, and XGBoost, don’t work well on data warehouses. Data stored in warehouses, then, can’t be part of the multistructured, aggregate dataset, which yields the most comprehensive results. Many of the recent advances in AI/ML have been in improving models for processing unstructured data, which warehouses can’t run. Unlike BI, which extracts a small amount of data and for which warehouses are optimized, ML systems process huge datasets using complex, non-SQL code. On the data lake side, lack of data consistency makes it almost impossible to mix appends and reads, and batch and streaming jobs. As a result, much of the hoped-for data lake business outcomes haven’t materialized. Pulling it all together Data lakehouses are enabled by a new, open system design with data structures and data management features of a warehouse but implemented directly on the modern, low-cost storage platforms used for data lakes. Merging them into a single system means that data teams can move faster, as they can get to data without accessing multiple systems. Data lakehouses also ensure that teams have the most complete and up-to-date data available for data science, AI/ML, and business analytics projects. Learn more at https://delltechnologies.com/analytics. *** Intel® Technologies Move Analytics Forward Data analytics is the key to unlocking the most value you can extract from data across your organization. To create a productive, cost-effective analytics strategy that gets results, you need high performance hardware that’s optimized to work with the software you use. Modern data analytics spans a range of technologies, from dedicated analytics platforms and databases to deep learning and artificial intelligence (AI). Just starting out with analytics? Ready to evolve your analytics strategy or improve your data quality? There’s always room to grow, and Intel is ready to help. With a deep ecosystem of analytics technologies and partners, Intel accelerates the efforts of data scientists, analysts, and developers in every industry. Find out more about Intel advanced analytics. Related content BrandPost Making Remarkable Energy Grids a Reality Combine IT agility and operational technology (OT) to deliver sustainable power to an energy-hungry world By David Holmes, General Manager, Energy at Dell Technologies Jan 31, 2023 7 mins IT Leadership BrandPost The Reason Many AI and Analytics Projects Fail—and How to Make Sure Yours Doesn’t As the pace of innovation in these areas accelerates, now is the time for technology leaders to take stock of everything they need to successfully leverage AI and analytics. By Tanya O'Hara Jan 20, 2023 8 mins IT Leadership BrandPost The Technology Enabling Successful Hybrid Workforce Transformation Why more companies are shifting to VDI on a private cloud By George O’Toole III, VDI Solutions Marketing, Dell Technologies Jan 20, 2023 9 mins IT Leadership BrandPost Innovative Manufacturers are Investing in these Advanced Technologies To stay competitive, factories will need AI and edge computing—here’s why By Mariah Petrovic, AI Solutions Marketing, Dell Technologies Jan 12, 2023 8 mins IT Leadership Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe