Creating a data strategy a decade ago was relatively easy compared to today. Back then, database experts debated the capabilities and performance of relational databases from Oracle, Microsoft and IBM, or whether to use open source databases like MySQL and PostgresSQL. A minority of enterprises explored NoSQL databases including document stores, key-value databases and columnar databases from technologies such as MarkLogic, MongoDB and Apache Cassandra. Organizations moving lots of data between enterprise systems invested in ETL (Extract, Transform and Load) platforms and a small minority invested in data quality or master data management solutions.
Flash forward to today and CIO recognize that data and information is the oil of the 21st century. Having diverse data management options, dependable dataops practices, proactive data governance, advanced analytics, citizen data science programs and maturing machine learning capabilities are all required to deliver competitive and differentiating business capabilities.
Closing the gap between data strategy and execution
I attended the Strata Data Conference in New York last week to see where the new opportunities, trends and challenges lie in CIO creating and executing comprehensive data strategies.
Those challenges became abundantly clear right from the opening keynote where Cloudera’s CMO Mick Hollison cited recently published research conducted with Harvard Business Review. A key finding in the research is that “sixty-nine percent say their organizations need a comprehensive data strategy in order to meet its strategic goals over the next three years, yet only thirty-five percent say their organizations’ analytics and data management capabilities are on course to meet those goals.”
That’s a sizable gap that illustrates the growing business expectations around data and analytics and the underlying implementation complexities. CIOs looking to close these gaps should consider the following five technical capabilities in their data strategies highlighted at the Strata Data Conference.
1. Manage data platforms on multiple clouds
According to the same survey, fifty-one percent plan to leverage multiple clouds as part of their data strategy, and only twelve percent have more than seventy-five percent of their data on public clouds. The strategy of consolidating data into centralized data warehouses or data lakes appears to be dated, and the new reality is that CIOs have to be able to manage, integrate and share data stored in multiple public and private clouds.
The good news is that platforms such as Cloudera Data Platform, SAP Data Hub and InfoWorks DataFoundry are designed to help data organizations manage, integrate and govern access to data repositories stored in different big data engines and on different clouds.
I was able to speak to InfoWorks CEO, Buno Pati about working with data in a multi-cloud environment. He informed me, “Establishing a robust and agile foundation for enterprise data operations and orchestration is central to the success of any modern enterprise data strategy. These systems must empower enterprises to launch new analytic use cases rapidly, minimize dependence on highly-specialized talent and seamlessly traverse hybrid and multi-cloud environments with a variety of execution engines and storage systems, e.g. Hadoop, Spark and cloud infrastructure.”
2. Mature capabilities on several big data platforms
CIOs could probably use a pocket dictionary to help define all the big data platforms that are growing in popularity. While Hadoop was the early winner in big data platforms, enterprises are investing in a mix of them today including Apache Spark, Apache Hive, Snowflake, multiple databases supported on AWS, Azure and Google Cloud Platform, and many others.
Using multiple big data platforms creates significant challenges for CIO because attracting data and analytics-skilled people is highly competitive and managing numerous platforms adds operational and security complexities.
While many enterprises are likely to consolidate to fewer data platforms as part of their strategy, they also must consider services, tools, partnerships and training to provide better support across several data platforms.
3. Invest in a data catalog
Since large enterprises are unlikely to be able to centralize data in one data warehouse or data lake, then the need to establish a data catalog becomes even more strategically important.
Data catalogs help end-users search, identify and learn more about data repositories that they can use for analytics, machine learning experiments and application development. They also provide a central point to govern access policies, publish the status of data sources and enable collaboration between end-users and subject matter experts.
Cloudera, SAP and Infoworks all have data catalog capabilities as part of their offerings.
4. Select the right data integration platform for the job
Whereas a decade ago, the debate was whether to invest in an ETL platform and then which one, the question today is broader and more strategic. That’s because data integration today covers a wider range of use cases beyond the batch processing that ETLs support. Today many organizations have
- Data streaming requirements for IoT and other real-time data processing implemented with platforms such as Apache Kafka, Apache Spark and event-driven architectures such as VANTIQ.
- Document and other unstructured data processing requirements implemented in the MarkLogic Data Hub Platform or document stores such as Apache Lucene, Apache Solr and MongoDB.
- Data prep needs for data scientists and business analysts serviced with tools such as Tableau Prep, Alteryx Designer and Trifacta Wrangler.
- API integration with SaaS platforms and enterprise data sources streamlined with platforms such as Boomi and MuleSoft.
- Requirements to improve data quality and create master data sources performed with platforms from Informatica, Talend, IBM, Reltio, Tamr and others.
Unfortunately, there isn’t a one-size-fits-all platform that can support all of these use cases. Besides, data integrations can be implemented more efficiently and supported more reliably by selecting the right tool for the job. That likely means that enterprises looking to support broad data integration needs are going to need to procure and mature capabilities with several data integration platforms.
5. Establish proactive data governance with every new capability
While CIO, CISO and CDOs would prefer to establish data governance upfront and before exposing new business capabilities, it’s an unrealistic strategy. Businesses that need analytics to enable data-driven decision making and other competitive benefits must move fast and may roadblock attempts to have governance in place as a prerequisite.
That’s a tough pill for executives chartered with protecting the organization’s data assets, privacy policies and confidential information.
However, it is possible for CIO and CDO to institute data governance in parallel to exposing new tools, capabilities and data sources. It requires investing in talent to understand the data governance capabilities of the platforms receiving investment and establishing procedures on introducing and managing changes to data sources.
Without these disciplines, CIOs will be introducing data debt, similar to the technical debt organizations have accumulated over time.
The good news is that CIOs will see data governance capabilities in mature data platforms that are targeting enterprises. However, having the technical ability is just the start and CIOs are going to need technical talent, training programs and change management practices to get business teams to understand and comply with data governance.
Becoming data-driven requires an ongoing commitment to excellence
I’m not a big fan of the “data is the new oil” analogy, but let’s stick with it for a moment. Oil companies don’t just buy drills and magically have an end-to-end mechanism to find oil repositories efficiently and ship it to refineries. It just isn’t that simple, and neither is data management, analytics, or machine learning.
However, it’s also not daunting, provided that organizations responsibly invest in platforms that meet their use cases, invest in talent and mature their practices on data integration, management and governance.