Hadoop-based Data Lakes can be game-changers, but too many are under performing. Here's a checklist to make your data lake a wild success. Credit: Thinkstock Hadoop-based data lakes can be game changers: better, cheaper and faster integrated enterprise information. Knowledge workers can access data directly, where project cycles are measured in days rather than months, and business users can leverage a shared data source rather than creating stand-alone sandboxes or warehouses. Unfortunately, more than a few data lake projects are off track. Data is going in but it’s not coming out, at least not at the pace envisioned. What’s the chokepoint? It tends to be some combination of lack of manageability, data quality and security concerns, performance unpredictability, and shortage of skilled data engineers. What distinguishes data lakes that are “enterprise class”, i.e., the ones that are built to last and attract hundreds of users and uses? First let’s look at the features that are Table Stakes, i.e., what makes a data lake a data lake. Next we will describe the capabilities that make a first class data lake, one that is built to last. SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe Table stakes Hadoop – the open source software framework for distributed storage and distributed processing of very large data sets on computer clusters. The base Apache Hadoop includes contains libraries and utilities needed by other Hadoop modules, HDFS – a distributed file –system that stores data on commodity machines,a resource-management platform for managing computing, and an implementation of the MapReduce programming model for large scale data processing. Commodity Compute Clusters – whether on premise or cloud Hadoop runs on low cost commodity servers that rack and stack and virtualize. Scaling is easy and inexpensive. The economics of open source massively parallel software combined with the low cost hardware deliver the promise of intelligent applications on truly big data. All Data / Raw Data – The data lake design philosophy is to land and store all data in raw format from source systems. Structured enterprise data from operational systems, semi structured machine-generated and web log data, social media data, et al. Schema’less writes – this point in particular is a break-through. Whereas traditional data warehouses are throttled by time and complexity of data modelling, data lakes land data in source format. Instead of weeks (or worse) data can be gathered and offered up in short order. Schemas are used on read, pushing that analytic or modeling work to analysts. Open source tools – (e.g., Spark, Pig, Hive, Python, Sqoop, Flume, Map Reduce, R, Kafka, Impala, Yarn, Kite, and many more) the evolving toolkit of programming, querying, and scripting languages and frameworks for ingesting and integrating data, building analytic apps, and accessing data. Enterprise class If the Table Stakes listed above defines a data landing area, the following differentiate a data lake that is expansible, manageable, and industrial strength: Defined Data and Refined Data – where data lakes contain raw data, advanced lakes contain Defined and Refined data as well. Defined Data has a schema, and that schema is registered in Hadoop’s Hcatalog. Since most data comes from source systems with structured schemas, it’s infinitely practical to leverage those. Refined Data, a step up the value chain, is data that has been altered and augmented to add intelligence and value with joins, aggregations, cleansing, counts, transformations, de-duplication, et al. Meta Data Management – perhaps the biggest complaint of data lakes is that they become unmanageable due to lack of meta data. This includes technical meta data about the structure, format, and validation rules for data (e.g. schemas registered in Hcatalog), business meta data about business rules and meanings of data, and operations meta data about the jobs and counts. Lineage and Audit Trail – a data lake needs an audit trail of data and processes to show how data flows from its source to its destination and the various changes that occur as it moves and gets transformed. The audit trail can be achieved by collecting logs from across the platform. Data Profiling – determining data quality and content is central to the analytic process. Profiling in the era of Big Data often requires parsing raw data to get to the numbers and values. Also volumes can be so high that profiling jobs to tally descriptive statistics can run for hours, so that optimization and approximation techniques must be used. Operations Control and SLA Management – users in an enterprise class environment require performance predictability. To achieve this data lakes need industrial strength tools for operations management and control. The types of capabilities that are required are the ability to manage performance in a multi-tenancy environment, ability to allocate departmental charge-backs based on usage, and providing visibility into clusters with health checks. Historical views and metrics should let you see what happened when, and allow you to quickly see unusual system performance. High availability across components and built in backup and disaster recovery means you can run even your most critical workloads, risk-free. Security Enforcement – data security is a must-have for an enterprise data lake. Hadoop poses unique security challenges, including its replication of data across multiple nodes. Data lakes concentrate vast and often sensitive data in a relatively open environment. Standard IT security controls – e.g., securing the computing environment and log-based network activity monitoring – are starting points but they are not enough to secure an organization from data-centric cyber-attacks. Additional data de-identification approaches include encryption or masking at the row, field, and cell level. Data-centric security calls for de-identifying the data, transforming sensitive data elements with de-identified equivalents can be used by applications and analytic engines. Software Tools – if you’re serious about creating an enterprise class data lake, you’ll probably consider software tools that help get the job done. Tool vendors are aggressively bringing to market solutions to accelerate the journey. This is imperative because hand coding to meet the requirements above is impractical and inefficient, and even if a firm had the talent for the hand coding, most will lack the top notch vendors’ vision of operational excellence. One of the leading Hadoop companies uses the term “Enterprise Data Hub” rather than “data lake”. I hope that moniker takes hold. The Holy Grail for enterprise information management is the vision of a “singular version of the truth,” which eluded legacy EDWs due to rigidity and delays and high costs, which forced business units to roll their own. I’m optimistic that enterprise class data lakes — i.e., the ones that are built to last — are a strong step in the direction of both singularity and data democratization. Related content opinion Bringing enterprise analytics to the cloud Cloud solution partners are creating new methods to long time problems such as parallelizing statistical processing and automating common data integration tasks. As we move into the era of self-service analytics, a number of vendors envision a future By Mike Lamble Aug 12, 2015 5 mins Big Data Data Warehousing Analytics opinion Business intelligence and analytics trend towards self-service at the Gartner Summit Self-service analytics, business intelligence on Big Data, and the changing role of the IT buyer were the belles of the annual Gartner Business Intelligence (BI) and Analytics Summit. By Mike Lamble Apr 17, 2015 4 mins Hadoop Big Data Business Intelligence opinion Modern Enterprise Data Warehouses: What’s Under the Hood? Weu2019re entering into the Modern Enterprise Data Warehouse (EDW) age where the scope of an analytics-driven business is wider. Are you ready? By Mike Lamble Mar 09, 2015 5 mins Big Data Data Warehousing Analytics Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe