by Mike Lamble

Modern Enterprise Data Warehouses: What’s Under the Hood?

Mar 09, 20155 mins
AnalyticsBig DataData Warehousing

We’re entering into the Modern Enterprise Data Warehouse (EDW) age where the scope of an analytics-driven business is wider. Are you ready?

All across the country it seems like companies are collapsing limited-purpose data warehouses to a singular integrated repository. In every industry, companies are moving up this path with a strikingly similar stride. From my view, here’s why so much of this is happening at once: to reduce the time-to-answer, reduce the cost-per-answer, achieve limitless scalability, integrate internal and external data, and deliver data as a service. Not bad stuff. We’re entering into an age of the Modern Enterprise Data Warehouse (EDW) where the scope of analytics is wider and IT is delivering at enterprise scale, albeit by leveraging non-IT horse power.

This modern era overlays decades of data management “growing-up” onto fit-for-purpose scalable technologies and more tightly couples enterprise information with data democratization, advanced analytics, and BI dashboards. Looking under the hood, these modern data warehouses share some common traits including:

Data Lakes are replacing ETL hubs and landing areas because enterprise-wide, Hadoop-based data lakes support better, cheaper, and faster schema’less landing and pre-processing of data at an atomic level. This ecosystem’s limitlessness makes it ideal for structured internal data aligned with semi- and un-structured Big Data from IoT and digital sources. So that the lake does not become a swamp, organizations are implementing data-factory’ish frameworks in an effort to deliver the data security, manageability, reliable, and cost-performance expected of these enterprise-class systems. Further, an emergent class of data-smart power users are fishing directly from the data lake, making an IT-enabled end-run around the EDW. This is good for everyone – the hands-on power users as well as the EDW users – but most importantly it’s good for the business.

Self-Service Analytics are being deployed in lieu of IT-supplied dashboards. Todays’ data visualization and discovery tools empower businesses to roll their own. Their story boarding capability, rapid response development environments and in-memory processing shorten both the time and cost-to-answer.

Data Governance processes are part of the foundation. Having become data smart, companies now realize that the policies, processes, and standards around data are just as important as the tools. For example, since “customer churn” can be measured a dozen ways, a single version of the truth requires coming to terms on the meaning of metrics, data definitions, and system sources of truth.

Pooled Compute Infrastructure is becoming the new normal, while dedicated boxes are starting to seem more than a little antiquated.Pooled compute resources take a variety of forms such as grid computing clusters, high performance specialized MPP platforms (e.g., Teradata), virtualized private clouds, and public clouds. The advantage of pooled platforms is that enterprises reduce equipment costs by more effectively managing utilization, and new projects can be supported without sub-projects to add compute infrastructure.

Data Ecosystems Leveraging “Best Fit” Data Platforms are the rule of thumb. To affordably scale with emerging data volumes and varieties, modern EDWs employ multiple fit-to-purpose data management systems (DBMS) rather than a single-repository one-size-fits all strategy. For enterprise-class demands that need to support hundreds of online users and ETL processes, row-oriented databases, particularly MPP databases that scale linearly, continue to be the preferred solution. Hadoop solutions are being used for landing and staging of large data volumes to achieve cost and scalability advantages. Column-store databases are a tool of choice for many applications that require ultra-fast response with minimal compute resources, such as BI dashboards and many “rite once/read many workloads. NoSQL tools are being used for big data applications that do use flexible schema and constant time retrieval methods.

Enterprise IT and “Shadow IT” are on the same team. Over the years EDW costs and project backlogs have grown, so too has the workforce of data professionals working in businesses (i.e., outside IT) and who are unencumbered by data modelers and ETL job streams that support hundreds or even thousands of users. Closer to the business is where we find many of today’s data Navy Seals. In the past, these groups lived off in their own data eco-systems; in fact, it was often one “data pond” for every SAS developer. In the modern EDW, however, these data Navy Seals have access to sand boxes that are provisioned from the data lakes and made accessible through self service BI tools as well as statistical modeling tools.

It’s is an exciting time to be working in our business. At the same time the data supply and demand are sky-rocketing off the charts and our tools and techniques are coming of age. Interestingly, big data is not a new term. It re-emerges every 10 years or so when escalating data volumes make conventional technical solutions obsolete. With this new paradigm for the Modern Enterprise Data Warehouse, I won’t be surprised to see the term big data go back into remission in the next year or two.