With more and more companies storing more and more data and hoping to leverage it for actionable insights, big data is making a big splash these days. Open source technology is at the core of most big data initiatives, but projects are proliferating so quickly it can be hard to keep track of them all. Here are 15 key open source big data technologies to keep an eye on.
Originally developed by Matel Zaharia in the AMPLab at UC Berkeley, Apache Spark is an open source Hadoop processing engine that is an alternative to Hadoop MapReduce. Spark uses in-memory primitives that can improve performance by up to 100X over MapReduce for certain applications. It is well-suited to machine learning algorithms and interactive analytics. Spark consists of multiple components: Spark Core and Resilient Distributed Datasets (RDDs), Spark SQL, Spark Streaming, MLlib Machine Learning Library and GraphX. Spark is a top-level Apache project.
Written primarily in the Clojure programming language, Apache Storm is another distributed computation framework alternative to MapReduce geared to real-time processing of streaming data. It is well suited to real-time data integration and applications involving streaming analytics and event log monitoring. It was originally created by Nathan Marz and his team at BackType, before it was acquired by Twitter and released to open source. Storm applications are designed as a “topology” that acts as a data transformation pipeline. Storm is a top-level Apache project.
Apache Ranger is a framework for enabling, monitoring and managing comprehensive data security across the Hadoop platform. Based on technology from big data security specialist XA Secure, Apache Ranger was made an Apache Incubator project after Hadoop distribution vendor Hortonworks acquired that company. Ranger offers a centralized security framework to manage fine-grained access control over Hadoop and related components (like Apache Hive, HBase, etc.). It also can enable audit tracking and policy analytics
Apache Knox Gateway
Apache Knox Gateway is a REST API Gateway that provides a single secure access point for all REST interactions with Hadoop clusters. In that way, it helps in the control, integration, monitoring and automation of critical administrative and analytical needs of the enterprise. It also complements Kerberos secured Hadoop clusters. Knox is an Apache Incubator project.
Apache Kafka, originally developed by LinkedIn, is an open source fault-tolerant publish-subscribe message broker written in Scala. Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data. It’s ability to broker massive message streams for low-latency analysis — like messaging geospatial data from a fleet of long-haul trucks or sensor data from heating and cooling equipment — makes it useful for Internet of Things applications. Kafka is a top-level Apache project.
Born from a National Security Agency (NSA) project, Apache Nifi is a top-level Apache project for orchestrating data flows from disparate data sources. It aggregates data from sensors, machines, geo location devices, clickstream files and social feeds via a secure, lightweight agent. It also mediates secure point-to-point and bidirectional data flows and allows the parsing, filtering, joining, transforming, forking or cloning of data streams. Nifi is designed to integrate with Kafka as the building blocks of real-time predictive analytics applications leveraging the Internet of Things.
Apache Hadoop is an open source software framework for data-intensive distributed applications originally created by Doug Cutting to support his work on Nutch, an open source Web search engine. To meet Nutch’s multimachine processing requirements, Cutting implemented a MapReduce facility and a distributed file system that together became Hadoop. He named it after his son’s toy elephant. Through MapReduce, Hadoop distributes Big Data in pieces over a series of nodes running on commodity hardware. Hadoop is now among the most popular technologies for storing the structured, semi-structured and unstructured data that comprise Big Data. Hadoop is available under the Apache License 2.0.
R is an open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is rapidly becoming the go-to tool for statistical analysis of very large data sets. It has been commercialized by a company called Revolution Analytics, which is pursuing a services and support model inspired by Red Hat’s support for Linux. R is available under the GNU General Public License.
An open source software abstraction layer for Hadoop, Cascading allows users to create and execute data processing workflows on Hadoop clusters using any JVM-based language. It is intended to hide the underlying complexity of MapReduce jobs. Cascading was designed by Chris Wensel as an alternative API to MapReduce. It is often used for ad targeting, log file analysis, bioinformatics, machine learning, predictive analytics, Web content mining and ETL applications. Commercial support for Cascading is offered by Concurrent, a company founded by Wensel after he developed Cascading. Enterprises that use Cascading include Twitter and Etsy. Cascading is available under the Apache License.
Scribe is a server developed by Facebook and released in 2008. It is intended for aggregating log data streamed in real time from a large number of servers. Facebook designed it to meet its own scaling challenges, and it now uses Scribe to handle tens of billions of messages a day. It is available under the Apache License 2.0.
Developed by Shay Banon and based upon Apache Lucene, ElasticSearch is a distributed, RESTful open source search server. It’s a scalable solution that supports near real-time search and multitenancy without a special configuration. It has been adopted by a number of companies, including StumbleUpon and Mozilla. ElasticSearch is available under the Apache License 2.0.
Written in Java and modeled after Google’s BigTable, Apache HBase is an open source, non-relational columnar distributed database designed to run on top of Hadoop Distributed Filesystem (HDFS). It provides fault-tolerant storage and quick access to large quantities of sparse data. HBase is one of a multitude of NoSQL data stores that have become available in the past several years. In 2010, Facebook adopted HBase to serve its messaging platform. It is available under the Apache License 2.0.
Another NoSQL data store, Apache Cassandra is an open source distributed database management system developed by Facebook to power its Inbox Search feature. Facebook abandoned Cassandra in favor of HBase in 2010, but Cassandra is still used by a number of companies, including Netflix, which uses Cassandra as the back-end database for its streaming services. Cassandra is available under the Apache License 2.0.
Created by the founders of DoubleClick, MongoDB is another popular open source NoSQL data store. It stores structured data in JSON-like documents with dynamic schemas called BSON (for Binary JSON). MongoDB has been adopted by a number of large enterprises, including MTV Networks, craigslist, Disney Interactive Media Group, The New York Times and Etsy. It is available under the GNU Affero General Public License, with language drivers available under an Apache License. The company 10gen offers commercial MongoDB licenses.