For enterprises looking to wrest the most value from their data, especially in real-time, the \u201cdata lakehouse\u201d concept is starting to catch on.\n\nThe idea behind the data lakehouse is to merge together the best of what data lakes and data warehouses have to offer, says Gartner analyst Adam Ronthal.\n\nData warehouses, for their part, enable companies to store large amounts of structured data with well-defined schemas. They are designed to support a large number of simultaneous queries and to deliver the results quickly to many simultaneous users.\n\nData lakes, on the other hand, enable companies to collect raw, unstructured data in many formats for data analysts to hunt through. These vast pools of data have grown in prominence of late thanks to the flexibility they provide enterprises to store vast streams of data without first having to define the purpose of doing so. \n\nThe market for these two types of big data repositories is \u201cconverging in the middle, at the lakehouse concept,\u201d Ronthal says, with established data warehouse vendors adding the ability to manage unstructured data, and data lake vendors adding structure to their offerings.\n\nFor example, on AWS, enterprises can now pair Amazon Redshift, a data warehouse, with Amazon Redshift Spectrum, which enables Redshift to reach into Amazon\u2019s unstructured S3 data lakes. Meanwhile, data lake Snowflake can now support unstructured data with external tables, Ronthal says.\n\nWhen companies have separate lakes and warehouses, and data needs to move from one to the other, it introduces latency and costs time and money, Ronthal adds. Combining the two in one platform reduces effort and data movement, thereby accelerating the pace of uncovering data insights.\n\nAnd, depending on the platform, a data lakehouse can also offer other features, such as support for data streaming, machine learning, and collaboration, giving enterprises additional tools for making the most of their data.\n\nHere is a look at at the benefits of data lakehouses and how several leading organizations are making good on their promise as part of their analytics strategies.\n\nEnhancing the video game experience\n\nSega Europe\u2019s use of data repositories in support of its video games has evolved considerably in the past several years.\n\nIn 2016, the company began using the Amazon Redshift data warehouse to collect event data from its Football Manager video game. At first this event data consisted simply of players opening and closing games. The company had two staff members looking into this data, which streamed into Redshift at a rate of ten events per second.\n\n\u201cBut there was so much more data we could be collecting,\u201d says Felix Baker, the company\u2019s head of data services. \u201cLike what teams people were managing, or how much money they were spending.\u201d\n\nBy 2017, Sega Europe was collecting 800 events a second, with five staff working on the platform. By 2020, the company\u2019s system was capturing 7,000 events per second from a portfolio of 30 Sega games, with 25 staff involved.\n\nAt that point, the system was starting to hit its limits, Baker says. Because of the data structures needed for inclusion in the data warehouse, data was coming in batches and it took half an hour to an hour to analyze it, he says.\n\n\u201cWe wanted to analyze the data in real-time,\u201d he adds, but this functionality wasn\u2019t available in Redshift at the time.\n\nAfter performing proofs of concept with three platforms \u2014 Redshift, Snowflake, and Databricks \u2014 Sega Europe settled on using Databricks, one of the pioneers of the data lakehouse industry.\n\n\u201cDatabricks offered an out-of-the-box managed services solution that did what we needed without us having to develop anything,\u201d he says. That included not just real-time streaming but machine learning and collaborative workspaces.\n\nIn addition, the data lakehouse architecture enabled Sega Europe to ingest unstructured data, such as social media feeds, as well.\n\n\u201cWith Redshift, we had to concentrate on schema design,\u201d Baker says. \u201cEvery table had to have a set structure before we could start ingesting data. That made it clunky in many ways. With the data lakehouse, it\u2019s been easier.\u201d\n\nSega Europe\u2019s Databricks platform went live into production in the summer of 2020. Two or three consultants from Databricks worked alongside six or seven people from Sega Europe to get the streaming solution up and running, matching what the company had in place previously with Redshift. The new lakehouse is built in three layers, the base layer of which is just one large table that everything gets dumped into.\n\n\u201cIf developers create new events, they don\u2019t have to tell us to expect new fields \u2014 they can literally send us everything,\u201d Baker says. \u201cAnd we can then build jobs on top of that layer and stream out the data we acquired.\u201d\n\nThe transition to Databricks, which is built on top of Apache Spark, was smooth for Sega Europe, thanks to prior experience with the open-source engine for large-scale data processing.\n\n\u201cWithin our team, we had quite a bit of expertise already with Apache Spark,\u201d Baker says. \u201cThat meant that we could set up streams very quickly based on the skills we already had.\u201d\n\nToday, the company processes 25,000 events per second, with more than 30 data staffers and 100 game titles in the system. Instead of taking 30 minutes to an hour to process, the data is ready within a minute.\n\n\u201cThe volume of data collected has grown exponentially,\u201d Baker says. In fact, after the pandemic hit, usage of some games doubled.\n\nThe new platform has also opened up new possibilities. For example, Sega Europe\u2019s partnership with Twitch, a streaming platform where people watch other people play video games, has been enhanced to include a data stream for its Humankind game, so that viewers can get a player\u2019s history, including the levels they completed, the battles they won, and the civilizations they conquered.\n\n\u201cThe overlay on Twitch is updating as they play the game,\u201d Baker says. \u201cThat is a use case that we wouldn\u2019t have been able to achieve before Databricks.\u201d\n\nThe company has also begun leveraging the lakehouse\u2019s machine learning capabilities. For example, Sega Europe data scientists have designed models to figure out why players stop playing games and to make suggestions for how to increase retention.\n\n\u201cThe speed at which these models can be built has been amazing, really,\u201d Baker says. \u201cThey\u2019re just cranking out these models, it seems, every couple of weeks.\u201d\n\nThe business benefits of data lakehouses\n\nThe flexibility and catch-all nature of data lakehouses is fast proving attractive to organizations looking to capitalize on their data assets, especially as part of digital initiatives that hinge quick access to a wide array of data.\n\n\u201cThe primary value driver is the cost efficiencies enabled by providing a source for all of an organization\u2019s structured and unstructured data,\u201d says Steven Karan, vice president and head of insights and data at consulting company Capgemini Canada, which has helped implement data lakehouses at leading organizations in financial services, telecom, and retail.\n\nMoreover, data lakehouses store data in such a way that it is readily available for use by a wide array of technologies, from traditional business intelligence and reporting systems to machine learning and artificial Intelligence, Karan adds. \u201cOther benefits include reduced data redundancy, simplified IT operations, a simplified data schema to manage, and easier to enable data governance.\u201d\n\nOne particularly valuable use case for data lakehouses is in helping companies get value from data previously trapped in legacy or siloed systems. For example, one Capgemini enterprise customer, which had grown through acquisitions over a decade, couldn\u2019t access valuable data related to resellers of their products.\n\n\u201cBy migrating the siloed data from legacy data warehouses into a centralized data lakehouse, the client was able to understand at an enterprise level which of their reseller partners were most effective, and how changes such as referral programs and structures drove revenue,\u201d he says.\n\nPutting data into a single data lakehouse makes it easier to manage, says Meera Viswanathan, senior product manager at Fivetran, a data pipeline company. Companies that have traditionally used both data lakes and data warehouses often have separate teams to manage them, making it confusing for the business units that needed to consume the data, she says.\n\nIn addition to Databricks, Amazon Redshift Spectrum, and Snowflake, other vendors in the data lakehouse space include Microsoft, with its lakehouse platform Azure Synapse, and Google, with its BigLake on Google Cloud Platform, as well as data lakehouse platform Starburst.\n\nAccelerating data processing for better health outcomes\n\nOne company capitalizing on these and other benefits of data lakehouses is life sciences analytics and services company IQVIA.\n\nBefore the pandemic, pharmaceutical companies running drug trials used to send employees to hospitals and other sites to collect data about things such adverse effects, says Wendy Morahan, senior director of clinical data analytics at IQVIA. \u201cThat is how they make sure the patient is safe.\u201d\n\nOnce the pandemic hit and sites were locked down, however, pharmaceutical companies had to scramble to figure out how to get the data they needed \u2014 and to get it in a way that was compliant with regulations and fast enough to enable them to spot potential problems as quickly as possible.\n\nMoreover, with the rise of wearable devices in healthcare, \u201cyou\u2019re now collecting hundreds of thousands of data points,\u201d Morahan adds.\n\nIQVIA has been building technology to do just that for the past 20 years, says her colleague Suhas Joshi, also a senior director of clinical data analytics at the company. About four years ago, the company began using data lakehouses for this purpose, including Databricks and the data lakehouse functionality now available with Snowflake.\n\n\u201cWith Snowflake and Databricks you have the ability to store the raw data, in any format,\u201d Joshi says. \u201cWe get a lot of images and audio. We get all this data and use it for monitoring. In the past, it would have involved manual steps, going to different systems. It would have taken time and effort. Today, we\u2019re able to do it all in one single platform.\u201d\n\nThe data collection process is also faster, he says. In the past, the company would have to write code to acquire data. Now, the data can even be analyzed without having to be processed first to fit a database format.\n\nTake the example of a patient in a drug trial who gets a lab result that shows she\u2019s pregnant, but the pregnancy form wasn\u2019t filled out properly, and the drug is harmful during pregnancy. Or a patient who has an adverse event and needs blood pressure medication, but the medication was not prescribed. Not catching these problems quickly can have drastic consequences. \u201cYou might be risking a patient\u2019s safety,\u201d says Joshi.