The Best Open Source Big Data Tools

Top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem

The best open source big data tools
The best open source big data tools

Although Hadoop is more popular than ever, MapReduce seems to be running out of friends. Everyone wants the answer faster, faster, now, and often in response to SQL queries. This year's Bossies in big data track important new developments in the Hadoop stack, underscore a maturing NoSQL space, and highlight a number of useful tools for data wrangling, data analysis, and machine learning.

IPython
IPython

Two important elements of scientific work, including data science, are sharing and validating conclusions. You don't want to be the first cold fusion data scientist after all. IPython's Notebooks provide an environment where researchers can document and automate the data analysis workflow. The notebook is a single location researchers can use to share code, documentation, ideas, and data visualizations, and access them from a browser-based environment.

It's difficult to understate the importance of these features in modern data science. Building and managing data products has become a complicated business requiring input from multiple parts of the organization, such as operations and monitoring, as well as a way to share analytics knowledge. To be sure, IPython is much more than IPython Notebooks. It includes multiple language support in data pipelines, parallel computing capabilities, and much more.

-- Steven Nunez

Pandas
Pandas

Pandas is a Python DSL (domain-specific language) for the manipulation of tabular data. It has roots in the hedge fund industry and emphasizes high performance and ease of use.

Although a relative newcomer to the scene, Pandas has a large and growing community. In many respects it is similar to the R language and can be used for many of the same data wrangling tasks, though it lacks the comprehensive and well-tested set of libraries R has. On the positive side, later Pandas development (beginning in 2008) has taken the best ideas of R and several of its packages and incorporated them into the base Pandas package. For data science shops using Python, Pandas is the best Swiss Army knife out there.

-- Steven Nunez

RCloud
RCloud

RCloud, from AT&T Labs, was created to address the need for a collaborative data analysis environment for R. Similar to IPython, RCloud allows researchers to analyze large data sets and share their results across an organization. For example, one group of data scientists might use RCloud to document the data workflow for semantic analysis of Web documents. This notebook could be annotated and reused by the machine learning group in the company.

Conceptually, RCloud is similar to an R CRAN (Comprehensive R Archive Network) package, but augmented by wiki-like collaboration features. Notebooks and code are stored in GitHub. Although a relative newcomer with little documentation outside of AT&T, RCloud shows a lot of promise.

-- Steven Nunez

R Project
R Project

A specialized computer language for statistical analysis, R continues to evolve to meet new challenges. Since displacing lisp-stat in the early 2000s, R is the de-facto statistical processing language, with thousands of high-quality algorithms readily available from the Comprehensive R Archive Network (CRAN); a large, vibrant community; and a healthy ecosystem of supporting tools and IDEs. The 3.0 release of R removes the memory limitations previously plaguing the language: 64-bit builds are now able to allocate as much RAM as the host operating system will allow.

Traditionally R has focused on solving problems that best fit in local RAM, utilizing multiple cores, but with the rise of big data, several options have emerged to process large-scale data sets. These options include packages that can be installed into a standard R environment as well as integrations into big data systems like Hadoop and Spark (that is, RHive and SparkR).

-- Steven Nunez

Hadoop
Hadoop

No technology in recent memory has made as big or as quick an impact as Hadoop. Hadoop encompasses many topics: HDFS, YARN, MapReduce, HBase, Hive, Pig, and a growing ecosystem of tools and engines. In the last year Hadoop has gone from being that thing everyone is talking about to being that thing even fairly conservative companies are deploying. Whether you're trying to spark information sharing in your organization, mine data for new uses, replace expensive data warehousing technology, or offload ETL processing, Hadoop is the platform of technologies you should be looking at today.

-- Andrew C. Oliver

Hive

Hive enables interactive queries over petabytes of data using familiar SQL semantics. Version 0.13 realized the three-part community vision that focused on speed, scale, and SQL compliance.

Hive preserves existing investments in SQL tools, skills, and processes and allows them to be applied to large-scale data sets. Tez integration, vector-based execution, and a cost-based optimizer significantly improve interactive query performance for users, while also lowering resource requirements. Hive 0.13 is the de-facto SQL interface to big data, with a large and growing community behind it, including Microsoft, which contributed several key pieces of SQL server technology.

-- Steven Nunez

Hivemall
Hivemall

Hivemall is a scalable machine learning library from the National Institute of Advanced Industrial Science and Technology in Japan. It provides machine learning algorithms and feature-engineering functions. Hive's SQL semantics are great for slicing and dicing large data sets before they are fed into Hivemall's machine learning algorithms.

Hivemall scales well -- in terms of both the number of features and the number of instances -- and it outperforms most other Hadoop-based machine learning frameworks. Hivemall is a useful addition to Hive for any data scientist familiar with SQL. Installation is simply a one-line Hadoop command that adds the JAR to the Hive classpath.

-- Steven Nunez

Cloudera Impala
Cloudera Impala

There's no shortage of real-time SQL engines for Hadoop: Hive on Tez (Stinger), Facebook's Presto, Shark on Spark. It makes sense. Businesses have long relied on SQL skill sets and have BI expediency demands that batch processing can't satisfy.

Cloudera Impala still leads the pack for interactive SQL queries on Hadoop. Its speed, maturity, and overall ecosystem (which includes Sentry for role-based authorization) bring robust SQL performance to Hadoop's scalability, while tapping HDFS and HBase directly to sidestep costly ETL. And the momentum hasn't slowed, with Amazon endorsing Impala by adding it to its Elastic MapReduce service this year.

Impala still outpaces Hive on overall performance and on real-time queries. Although efforts to modernize Hive are under way, Impala remains the strongest choice for real-time data analysis with SQL applications built on Hadoop.

-- James R. Borck

MongoDB
MongoDB

At this point, the reasons for adopting MongoDB are plain and simple. If you're looking for a cross-platform, document-oriented database that will make data integration faster and easier, you've come to the right spot. Also, it's open source. With the enterprise-oriented improvements in the recent 2.6 release and document-level locking coming soon, you can expect even broader and faster adoption of MongoDB. With big clients like eBay, Craigslist, and the New York Times, this is no longer "just" some database for hipster hackers and startups.

-- Andrew C. Oliver

Cassandra

Apache Cassandra is a NoSQL database that offers a compelling mix of simplicity, functionality, and good design. Cassandra implements a Bigtable-like data model, using a storage system inspired by Amazon's Dynamo for data partitioning. The result is an easy-to-configure system that linearly scales from small clusters to multiple data centers, while efficiently distributing data across nodes built from commodity hardware.

Cassandra's "ring" topology is flexible in the way it places data in the cluster, with several useful strategies available out of the box. With few moving parts (a single JVM per node), native file system for storage, linear scalability, and high availability, Cassandra is a top contender for a NoSQL solution.

-- Steven Nunez

Neo4j
Neo4j

Neo4j is a NoSQL database with an almost organic feel. It's a graph database, so it doesn't see the world as documents or as rows and columns, but as connected objects -- nodes connected by relationships, in Neo4j's language. Nodes can represent people or restaurants or cities or stellar objects. Relationships can be "knows" or "has eaten at" or "lives in" or "is X light years from."

The community edition gives you everything you need to run an embedded version, or a client-server system. Drivers exist for nearly every language you can think of, but if you want to manipulate the database at a higher level, you can use Cypher, Neo4j's equivalent to SQL. The Neo4j website's interactive documentation is some of the best we've seen. Even if the concepts are new, you'll have no problem getting up to speed with this exceptional graph database.

-- Rick Grehan

Pentaho
Pentaho

This granddaddy of open source BI covers everything from data mining to data visualization with a feature-rich set of enterprise-grade tools, including a J2EE BI server, data transformation facilities, and a graphical report designer. Yet Pentaho adroitly walks the line between comprehensive and overwhelming. It meets the big data processing and predictive analytics needs of large organizations, yet smoothly handles SMB reporting tasks.

Pentaho's free community edition lacks the more advanced features such as mobile support, interactive dashboards, and data visualizations. To realize the full potential of Pentaho, you'll need to upgrade to Professional or Enterprise editions. But with Kettle for ETL, the Mondrian OLAP engine for analysis, and Weka-based data mining and machine learning, Pentaho connects all of the dots between data integration, business analytics, and meaningful decisions.

-- James R. Borck

Talend Open Studio for Big Data
Talend Open Studio for Big Data

Analyzing large volumes of data requires moving large volumes of data. Talend Open Studio for Big Data brings all of the Apache integration tools, a centralized repository, and ETL components (extract, transform, load) to big jobs, along with connectors capable of assimilating most modern "big data" ecosystems and tools, including Hadoop, HBase, Hive, Sqoop, Pig, and NoSQL databases such as Cassandra and MongoDB.

Using Talend's visual mapping tools, you can graphically construct large-scale data transformations, replete with flow testing and debugging, and directly leverage the power of a Hadoop cluster to complete the job. You'll need to pay for features like dashboards and metadata management. But when it comes to cutting through the complexities of big data integration -- particularly syncing on-premise with cloud-based data sources and analytics -- Talend Open Studio may be the sharpest tool in the shed.

-- James R. Borck

Spark
Spark

Developed in the AMPLab at U.C. Berkeley and built on top of HDFS, Spark gets the answer a hundred times faster than MapReduce for particular applications. And with more than 150 individual developers representing 30-plus companies contributing to the code, Spark is well on its way to being a dominant part of Hadoop.

Although Spark has a "competitor" in Storm, the two are optimized for different use cases. For a great many types of problems, an in-memory parallel processing with a DAG (directed acyclic graph) execution engine such as Spark is more appropriate than MapReduce. For any of us who've watched a job go through stage 8, got red-faced, and started yelling, "Just execute it at the same time!” Spark will be a godsend.

-- Andrew C. Oliver

Storm
Storm

Storm is a distributed, real-time computation system that makes unbounded streams of data easy to process. Companies like WebMD, The Weather Channel, and Groupon are already taking advantage of this scalable, fault-tolerant, and effortless technology. Most of Hadoop is measured in minutes. Storm is about the milliseconds. You can do low-latency streaming one object at a time or in so-called microbatches. The questions remain: When will we see 1.0 and when will Storm be done "incubating"?

-- Andrew C. Oliver

Kafka
Kafka

This open source tool was originally developed by LinkedIn and used by Web companies to expedite messages from Web applications to their designated data systems. Kafka is able to handle hundreds of megabytes of reads and writes per second from thousands of clients. It also boasts a contemporary cluster-centric design that offers durability, elasticity, and transparency. Kafka is frequently paired with Storm for a streaming pub-sub type of model -- kind of like JMS, only actually distributed and scalable.

-- Andrew C. Oliver

Tez
Tez

Tez provides a distributed execution framework for data applications. Built on top of YARN, another Apache project that enables applications to share resources, Tez frees application developers from programming error-prone tasks like scheduling, resource management, and fault-tolerance in a distributed cluster. This means that higher-level applications like Cascading, Pig, and Hive can focus on their domain-specific problems, while Tez performs the heavy lifting within the cluster.

Tez lets developers focus less on managing the cluster and more on their business logic, so they can write applications that perform better, are more reliable, and make more efficient use of important cluster resources like network I/O, CPU, and RAM.

-- Steven Nunez

Mahout
Mahout

Mahout is a computation-engine-independent collection of algorithms for classification, clustering, and collaborative filtering, along with the underlying math required to implement them. It focuses on highly scalable machine learning.

Currently at release 0.9, with the 1.0 release around the corner, the project is moving from Hadoop's MapReduce execution paradigm to a Domain Specific Language for linear algebra that is optimized for execution on Spark clusters. Map-reduce contributions are no longer accepted. Given the iterative nature of machine learning, this makes good sense and shows that the community isn't afraid to do what's right when technology changes.

-- Steven Nunez

KNIME
KNIME

Build predictive analytics applications without coding? Yes you can, thanks to this handy toolkit from Swiss analytics vendor KNIME. KNIME's Eclipse-based visual workbench lets you drag and configure nodes on a canvas, then wire them together into workflows. KNIME supplies thousands of nodes to handle everything from data source connectivity (databases, Hive, Excel, flat files) to SQL queries, ETL transforms, data mining (supporting Weka and R), and visualization.

This year brought welcome connectors to Twitter and Google Analytics, as well as support for OpenStreetMap visualizations. The KNIME engine is open source with no crippling restrictions on data volume, memory size, or processing cores. The company also offers affordable commercial tiers that add collaboration, authentication, and performance enhancements for cluster environments.

-- James R. Borck

Cascading
Cascading

The learning curve for writing Hadoop applications can be steep. Cascading is an SDK that brings a functional programming paradigm to Hadoop data workflows. With the 3.0 release, Cascading provides an easier way for application developers to take advantage of next-generation Hadoop features like YARN and Tez.

The SDK provides a rich set of commonly used ETL patterns that abstract away much of the complexity of Hadoop, increasing the robustness of applications and making it simpler for Java developers to utilize their skills in a Hadoop environment. Connectors for common third-party applications are available, enabling Cascading applications to tap into databases, ERP, and other enterprise data sources.

-- Steven Nunez