For decades, I have seen corporate data strategy swing back and forth like a digital pendulum. Centralize – decentralize. Consolidate – federate. Inmon – Kimball. ERP – EUC (end user computing/desktop). Master data management – analytical sandboxes. Data center – cloud. Each swing is a multi-year, multi-million-dollar migration, after which the limits of the new approach often drive the company to reverse direction again.
The forces driving the pendulum are two very real, and seemingly conflicting, business needs. On one hand, businesses need data agility to respond to rapidly evolving opportunities and threats. On the other hand, large enterprises need data to scale to deliver secure, consistent, high-quality information to automated systems and business teams. These needs have traditionally been seen as a choice: agility or scale—pick one. This is because the traditional tools and methodologies to deliver agility or scale are very different.
Agility was typically achieved through federated, self-service, end-user driven data environments, such as an analytical sandbox. Departments had their own data assets, data and application development tools and could quickly develop analytical models, iterate through ad-hoc queries, and gain rapid insights. Business units loved the independence to pursue their own priorities without the overhead of corporate data standards and other bureaucracy. Until they needed to scale. Then they realized they were on a data island, and building a bridge to the mainland would be a very expensive, long migration project.
Scale was the result of careful analysis, engineering, standardization, and conversion, such as an enterprise data warehouse. The resulting system could be automated, secure, and trusted, but it was hardly agile. These systems often took years to build, making them obsolete on day one. And new data requirements would take months to integrate correctly into an elaborate data supply chain involving many different products. Ironically, the latency in delivering data at scale undermined its very goal of security and consistency. Because it took so long to deliver, business units worked around these centralized systems, creating a complex patchwork of undocumented data flows and logic.
The dichotomy seems to persist in the era of big data. Gartner cites that 90% of Hadoop data lakes never make it to production scale because they were designed as analytical sandboxes for a small number of data scientists. Lakes that do make it into production often take an elite team of programmers years to build and a small number of specialists remain a bottleneck to agility. Self-service data wrangling and preparation tools democratize data access but continue the spread of inconsistent, department-level logic.
However, I have found that big data technology coupled with a new paradigm, the data marketplace, makes the “Agility or Scale?” question a false choice. A successful marketplace has four pillars that provide agility and scale:
A consolidated catalog of all data
Agility requires that there is one place for users to find and understand data, regardless where it lives. By all I mean data of all formats, quality, and curation—data from raw to ready. This type of catalog can be created quickly (in days), and supports all data consumers, from expert data scientists to the general business analyst.
On-demand access to business-ready data
The catalog is closely coupled with data assets, providing users with self-service data provisioning. The definition of business ready is different for each user community; the marketplace enforces policies for appropriate access through the catalog.
Reuse & collaboration
The speed and simplicity of finding data in the catalog drives reuse of the best data sources. As users develop new data sets, the marketplace exposes them to interested user communities (such as a department or project team) for collaboration, improvement, and reuse. These can be curated to become an enterprise asset, and democratized through the marketplace.
Enterprise class data management
For the marketplace to scale, enterprise policies, such as data protection and appropriate use, need to be in place at all times. The catalog is central to this, using metadata to capture, enforce, and monitor data policies and usage from the time data enters the marketplace.
Over the past three years I have seen companies achieve dramatic gains in agility and scale with a data marketplace. One transformed a data swamp into a consolidated, self-service platform that today provisions trusted, secure data in minutes. Another reduced the time for advanced business analytics from months to days through collaboration and reuse. Moreover, the marketplace has become a flywheel that continues to accelerate end-user agility as adoption scales across these companies.