by Philip Kushmaro

5 reasons why data lakes are vital for startup analytics

Opinion
Oct 23, 2018
AnalyticsBig DataDigital Transformation

Compared to more mature companies, early-stage startups have drastically different analytics needs. Data lake infrastructure can make things easier on them.

data lake
Credit: Thinkstock

You might not be so familiar yet with the buzz term “data lake,” but if you’re at an early stage startup, you probably soon will be.

Whereas data warehouses and data marts tend to force companies into narrow data paradigms and silos, data lakes emphasize a more holistic and expansive view of analytics. Data lakes deliver a more adaptive approach towards analyzing data, and stress the value of all information, instead of pre-screened bits and pieces.

The controversy in the big data industry surrounding data lakes tends to focus on their perceived drawbacks. They are too unstructured, too expansive, and too difficult to manage. Regardless, data lakes have key features that make them uniquely valuable, and despite their relative newness, they can be especially useful for startups.

That’s because for a startup, discarding the massive amounts of data they have can result in a narrower understanding of their market and potentially ignore key trends. Instead of locking themselves into rigid data management practices, these five reasons highlight why data lakes represent a vital component of a startup’s analytics paradigm.

Startups may start off with fewer data streams and smaller needs, but that quickly changes when they begin to grow. Data warehouses are highly structured and require high maintenance and constant monitoring by dedicated data engineers and architects. This includes building the proper schemas for analysis, making changes to analytics models, and even building the right structures to store scrubbed data.

Companies like Meta Networks, for instance, which offers Network-as-a-Service tools for businesses, collect millions of data points per second, numbers that exponentially grow as new clients are onboarded. By building data lakes with Upsolver — which can rest on more easily scalable systems such as AWS’s S3 cloud servers — the company has been able to collect all the data it needs without having to pre-build schema and warehouse structures.

They eliminate data silos

At a young company, quickly sharing data and performing a variety of cross-sectional analyses can supply insights and new, unexpected paths forward. However, many early-stage startups make the mistake of creating data silos for the sake of convenience. Once information is heavily partitioned, it becomes harder to communicate and transfer data.

On an enterprise level, PwC implemented a data lake system at UC Irvine Medical Center that significantly improved operations. Perhaps even more so than startups, medical organizations are prone to data silos, but PwC showed that a data lake can provide a more agile approach. The hospital has been able to provide better analytics, broader studies, and faster communication thanks to data that is not forced into schema that partition it.

They reduce time wasted sorting and querying

Regardless of the data structure a startup chooses, they will have to dedicate some resources to managing and optimizing it. Usually, this means spending hours setting up dashboards, analytics algorithms, data schema, and managing all of them on a consistent basis. This means having someone on staff who is, if not fully dedicated to the task, constantly taking time away from other tasks to handle data warehousing.

Data lakes, due to their unstructured nature and their raw data streams, require significantly less effort. Instead of dedicating a full-time team member, which most startups simply cannot afford, data lakes let any team member perform their own analysis on an ad hoc basis without necessitating a complex scrubbing and structuring process beforehand. Most importantly, it also reduces query times significantly.

They encompass all data

The point of big data is to have as much information as possible to parse and process, but most data warehouses operate counter to that paradigm. Data warehouses often filter out significant chunks of data that don’t fit predetermined structures, often removing scores of data points that could contain key insights when viewed in a different light. One of the biggest sources of value data lakes provide is that their massive repositories of data come from various sources and offer unique ways to combine them. This context-free model is extraordinarily valuable when performing predictive analytics or simply hunting for interesting trends.

EMC, one of the more popular data lake solutions, has been implemented successfully at healthcare services to improve predictive care and trend discovery. It is so successful, however, because it allows for a much broader cross-section of data to be studied in different configurations. Unlike data warehouses, which force predetermined analytics algorithms onto data, having a full set of raw data empowers startups to perform their own analysis based on needs instead of technology.

They let startups get creative with analysis

Most importantly, perhaps, data lakes don’t lock companies into specific paradigms for analytics and insights. Data warehouses often have essential uses, but their applications are narrower due to their rigid structures. Because they require careful planning of data flows and structures, startups must decide how exactly it will be used even before they see the data.

For a company that is still understanding their data and channels, building restrictive habits can ultimately prove detrimental to analyzing the bigger picture. On the other hand, data lakes offer an ability to ignore preconceptions regarding data along with the opportunity to explore information in unique ways.

Lakes for the win

For startups, which often pride themselves on disruption and innovation, a holistic view of data and the ability to perform ad hoc analysis based on needs instead of restrictions is a crucial distinction.

Your startup simply can’t accurately predict a specific, finite list of metrics, information sources and use cases that will be most important over the life cycle of the organization. By favoring a data lake infrastructure, your company and its stakeholders can revisit these decisions and unlock new layers of value for years to come.