by Myles F. Suer

The dos and don’ts of data lakes

Opinion
Jul 05, 2017
AnalyticsBusiness IntelligenceData Science

What's needed to ensure CIOs establish valued data lakes...instead of unvalued data swamps.

data lakes
Credit: Thinkstock

In 3 keys to keep your data lake from becoming a data swamp” Thor Olavsrud provides CDOs, CIOs, and other business intelligence leaders with relevant guidance on preventing the so called ‘data swamp’. This includes:

  1. Collect less data to start with
  2. Adopt a machine learning strategy
  3. Determine the business issue you’re trying to address.

These are great suggestions — in particular, having a business mission for your data lake.

Why does this matter?

Clearly, delivering a data lake rather than a data swamp is important because insight-driven companies make more money and develop more sustainable barriers to entry. Forrester Research estimates that “Insights-driven businesses will steal $1.2 trillion a year by 2020” (“Insights-Driven Business”, Forrester Research, July 27, 2016).

The insight-driven accomplish this by experimenting and continuously learning. These firms are adding data lakes often at the same time as they put in place CDOs. They understand in particular that advanced data capabilities are needed to implement a successful digital transformation”(“Data Centric Businesses need a Data Centric Leader”, Forrester Research, April 26, 2016). Successful CDOs clearly need to lean on their CIOs to succeed at creating a valued data lake.

To further understand how to avoid a data swamp, I interviewed 12 leading edge CIOs; specifically, I asked them for recommendations regarding the dos and don’ts of data lakes.

CIOs see real opportunity to create new business value from data lakes and self-service business intelligence. They believe these trends matter because they are about increasing the availability and transparency of data and enabling business users with the ability to get answers without involving their IT colleagues. Most CIOs connect self-service business intelligence to the data lake, which means effective data lakes are not just about data storage, but also about citizen and professional data scientists’ self-exploration of the information contained within them.

Some CIOs even see self-service business intelligence options broadening the community served by IT:

Self Service BI allows our broader community to be engaged in relevant and up-to-date analytics-based decision making”, CIO at Binghamton University

While some see the valuable impact of taking this step when data lakes are used for every day decisions, Joanna Young, former CIO of Michigan State University suggests that CIOs shouldn’t “use new tools to pave old reporting cow paths.” Several CIOs, however, suggest self-service business intelligence and data lakes offer better business and IT alignment all by themselves.

Taking these comments on board, I would like to suggest five things that will keep you out of the murk of a data swamp:

1. Make it purposeful

CIOs suggest that they have learned from the first wave of business intelligence and value is generated only when asking the right questions are asked. They are candid that even though the tools make data highly available, asking the right questions is still a challenging process for most organizations. Some worry that if data lakes don’t move from the experimentation phase to generating business value, that CEOs and CFOs will start complaining and heads may roll. For this reason, it came as no surprise when one CIO said that a data lake with no business goals or purposes is just taking up space.

CIOs say even though it is not easy that IT shouldn’t always say yes to a data lake project. This sentiment was also found in The Big Data Payoff, Capgemini IDG. 2016 where interviews with 210 business executives showed that those who excel with big data use it to achieve strategic business objectives

This echoes Tom Davenport who said, “Even the most analytically oriented company needs to target its analytical efforts where they will do the most good, because resources, especially talent, are always constrained.”

CIOs need to help make sure their business customers start with an end in mind and be clear that a “data first, questions later” approach won’t work. Fixing things can simply start with CIOs asking their business counterparts and internal IT proponents what problems they are trying to solve. Countering several industry slogans, CIOs say that while it is about the data, it’s also really about the intended purposes and  translating data into answers can be even more challenging than many vendors portend. CIOs know from firsthand experience that understanding what data is telling you is crucial and claim that this won’t happen by magic.

2. Start simple

One CIO said that the notion of a data lake can feel difficult if you have difficulty identifying data definitions in huge systems. CIOs feel that a big bang approach is a loser. David Chou, CIO and Chief Digital Officer at Children’s Mercy Hospital, feels there is a need to stop trying to solve “world data hunger” with data lakes because to do something about this involves people, processes, governance, and prioritization.

Chou wisely says that one organization’s pilot could be another’s phase one production rollout. CIOs suggest that projects should be based on your organization’s size. They say CIO’s should find a problem and focus on the source data that could possibly relate to solving this problem only. This should be about IT and the business learning together, piloting and starting small to get big. Or put differently – go slow to get fast.

CIOs say that new tools need to be used to answer new questions or to enable better answers to existing questions. Interestingly, some CIOs questioned whether their IT organization should deliver these new approaches or whether it is better to deliver all of this through a public cloud vendor.

3. Govern the data going in

CIOs feel that it is critical that there is transparency with how data is used and combined. This includes proper design and planning, and identifying system ‘sources of truth’ that allow citizen and professional data scientists to access extracted data. Without this, collected data is just a bunch of bits taking up storage space from other systems. Chou puts it this way, “It all comes down to the governance model.”

4. Fix data problems

CIOs stress the need for data hygiene. They suggest that there are all sorts of quality, governance, and accuracy issues and that in reality, a data swamp is just a data lake filled with dirty data as a result of poor data curation process. With appropriate data curation, requiring both IT maturity and data governance in place, data swamps shouldn’t happen.

CIOs claim that data swamps can also be avoided with proper analysis and that to get value out of the data lake and big data, you need to continue to do “data management 101.” Several CIOs suggest master data management and stewardship is required, with both IT and the business understanding that real-world data is ‘dirty’ at the start and that openness about this is the beginning of the ‘cleansing’ process. CIOs assert that everyone should be aware that it takes a lot of work to get proper results.

5. Manage data access and security

CIOs worry about the ‘putting all your eggs in one basket’ effect. They stress the importance of establishing data security and privacy from the start of a data lake project. This is an important point because most CIOs see most big data and data lake projects as still largely experimental. This should come as no surprise considering that only 27 percent of business executives say their big data projects have achieved profitability(The Big Data Payoff, Capgemini IDG. 2016).

Regardless, data lakes have already become targets for hackers or improper internal access. This means data security governance needs to be done sooner rather than later. CIOs suggest a big challenge with data lakes is in finding the right tools to provide necessary protection of sensitive data while maintaining the appropriate access to become insights-driven.

Take these 5 steps to avoid the data swamp

More and more organization want to become insights-driven to stay relevant and a data lake can be an element of this but it takes real discipline to succeed here. I strongly agree with Olavsrud’s suggestions but I advocate that there are more steps needed to avoid a data swamp because clearly, creating a data lake should not add business risk.