1.The concept is still quite new. The term data lake, credited to Pentaho CTO James Dixon, has been bandied about for several years. But the idea of data lakes as corporate resources is still in its infancy, according to IDC analyst Ashish Nadkarni. A data lake is defined as a massive–and relatively cheap–storage repository, such as Hadoop, that can hold all types of data until it is needed for business analytics or data mining. A data lake holds data in its rawest form, unprocessed and ungoverned.
2.You Can’t buy a ready-to-use data lake. Vendors are marketing data lakes as a panacea for big-data projects, but that’s a fallacy, according to Gartner. “Like data warehouses, data lakes are a concept, not a technology,” says Gartner analyst Nick Heudecker. “You can use several technologies to build a data lake. At its core, a data lake is a data storage strategy.”
3.Lakes have Big appetites for data. Data lakes are designed for data ingestion–the procedure that involves gathering, importing and processing data for storage or later use. “Where the storage cost model of a data warehouse may not lend itself to wholesale data ingestion, a data lake does,” Heudecker says. “Also, a data lake doesn’t require the users to create a schema before data is available for use. Data can simply be ingested and the schema created and applied when the data is read.”
4.You must involve multiple facets of the business. Data lakes are resources for the entire organization, not just IT. Therefore, all interested parties should be involved in planning data lake projects. “It is central to the firm’s big-data architecture, and therefore cannot be implemented in isolation,” Nadkarni says. In addition to IT managers, a data lake project should involve business leaders and users. Storage experts also need to play key role. “At the end of the day,” Nadkarni says, “it is a storage platform, and therefore [companies] should involve the storage team in its design and implementation.”
5.The biggest benefits don’t come from technology. The business value of a data lake has very little to do with the underlying technologies chosen, Heudecker says. “Instead, the business value is derived from the data-science skills you can apply to the lake,” he says. “Data lakes aren’t a replacement for existing analytical platforms or infrastructure. Instead, they complement existing efforts and support the discovery of new questions.” Once those questions are discovered, he says, you then “optimize” for the answers. “Optimizing may mean moving out of the lake and into data marts or data warehouses,” Heudecker says.