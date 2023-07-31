Applying artificial intelligence (AI) to data analytics for deeper, better insights and automation is a growing enterprise IT priority. But the data repository options that have been around for a while tend to fall short in their ability to serve as the foundation for big data analytics powered by AI.\n\nTraditional data warehouses, for example, support datasets from multiple sources but require a consistent data structure. They\u2019re comparatively expensive and can\u2019t handle big data analytics. However, they do contain effective data management, organization, and integrity capabilities. As a result, users can easily find what they need, and organizations avoid the operational and cost burdens of storing unneeded or duplicate data copies.\n\nNewer data lakes are highly scalable and can ingest structured and semi-structured data along with unstructured data like text, images, video, and audio. They conveniently store data in a flat architecture that can be queried in aggregate and offer the speed and lower cost required for big data analytics. On the other hand, they don\u2019t support transactions or enforce data quality. If those in charge of managing the data lake don\u2019t create precise processes and metadata for organizing data, the lake can quickly devolve into what\u2019s come to be known as a \u201cdata swamp\u201d\u2014a data lake that makes it hard for users to locate data. \n\nIf only there were a best-of-both-worlds compromise. \n\nWarehouse, data lake convergence\n\nMeet the data lakehouse. It\u2019s a modern repository that stores all structured, semi-structured, and unstructured data as a data lake does. However, it also supports the quality, performance, security, and governance strengths of a data warehouse. As such, the lakehouse is emerging as the only data architecture that supports business intelligence (BI), SQL analytics, real-time data applications, data science, AI, and machine learning (ML) all in a single converged platform.\n\nThe open lakehouse architecture implements data structures and management features similar to those in a warehouse directly on top of low-cost cloud storage in open formats, providing:\n\nChallenges of supporting multiple repository types\n\nIt\u2019s common to compensate for the respective shortcomings of existing repositories by running multiple systems, for example, a data lake, several data warehouses, and other purpose-built systems. However, this process frequently creates a few headaches. Most notably, data stored in one repository type is often excluded from analytics run on another, which is suboptimal in terms of the results. \n\nIn addition, having multiple systems requires the creation of expensive and operationally burdensome processes to move data from lake to warehouse if required. To overcome the data lake\u2019s quality issues, for example, many often use extract\/transform\/load (ETL) processes to copy a small subset of data from lake to warehouse for important decision support and BI applications. This dual-system architecture requires continuous engineering to ETL data between the two platforms. Each ETL step risks introducing failures or bugs that reduce data quality. \n\nSecond, leading ML systems, such as TensorFlow, PyTorch, and XGBoost, don\u2019t work well on data warehouses. Data stored in warehouses, then, can\u2019t be part of the multistructured, aggregate dataset, which yields the most comprehensive results. Many of the recent advances in AI\/ML have been in improving models for processing unstructured data, which warehouses can\u2019t run. Unlike BI, which extracts a small amount of data and for which warehouses are optimized, ML systems process huge datasets using complex, non-SQL code.\n\nOn the data lake side, lack of data consistency makes it almost impossible to mix appends and reads, and batch and streaming jobs. As a result, much of the hoped-for data lake business outcomes haven\u2019t materialized.\n\nPulling it all together\n\nData lakehouses are enabled by a new, open system design with data structures and data management features of a warehouse but implemented directly on the modern, low-cost storage platforms used for data lakes. Merging them into a single system means that data teams can move faster, as they can get to data without accessing multiple systems. Data lakehouses also ensure that teams have the most complete and up-to-date data available for data science, AI/ML, and business analytics projects. 