11 Open Source Tools to Make the Most of Machine Learning

Tap the predictive power of machine learning with these diverse, easy-to-implement libraries and frameworks

11 open source tools for making the most of machine learning
11 open source tools for making the most of machine learning

Spam filtering, face recognition, recommendation engines -- when you have a large data set on which you’d like to perform predictive analysis or pattern recognition, machine learning is the way to go. This science, in which computers are trained to learn from, analyze, and act on data without being explicitly programmed, has surged in interest of late outside of its original cloister of academic and high-end programming circles.

This rise in popularity is due not only to hardware growing cheaper and more powerful, but also the proliferation of free software that makes machine learning easier to implement both on single machines and at scale. The diversity of machine learning libraries means there’s likely to be an option available regardless of what language or environment you prefer.

These 11 machine learning tools provide functionality for individual apps or whole frameworks, such as Hadoop. Some are more polyglot than others: Scikit, for instance, is exclusively for Python, while Shogun sports interfaces to many languages, from general-purpose to domain-specific.

Scikit-learn
Scikit-learn

Python has become a go-to programming language for math, science, and statistics due to its ease of adoption and the breadth of libraries available for nearly any application. Scikit-learn leverages this breadth by building on top of several existing Python packages -- NumPy, SciPy, and matplotlib -- for math and science work. The resulting libraries can be used either for interactive “workbench” applications or be embedded into other software and reused. The kit is available under a BSD license, so it’s fully open and reusable.

Project: scikit-learn
GitHub:
https://github.com/scikit-learn/scikit-learn

Shogun
Shogun

Among the oldest, most venerable of machine learning libraries, Shogun was created in 1999 and written in C++, but isn’t limited to working in C++. Thanks to the SWIG library, Shogun can be used transparently in such languages and environments: as Java, Python, C#, Ruby, R, Lua, Octave, and Matlab.

Though venerable, Shogun has competition. Another C++-based machine learning library, Mlpack, has been around only since 2011, although it professes to be faster and easier to work with (by way of a more integral API set) than competing libraries.

Project: Shogun
GitHub: https://github.com/shogun-toolbox/shogun

Accord Framework/AForge.net
Accord Framework/AForge.net

Accord, a machine learning and signal processing framework for .Net, is an extension of a previous project in the same vein, AForge.net. “Signal processing,” by the way, refers here to a range of machine learning algorithms for images and audio, such as for seamlessly stitching together images or performing face detection. A set of algorithms for vision processing are included; it operates on image streams (such as video) and can be used to implement such functions as the tracking of moving objects. Accord also includes libraries that provide a more conventional gamut of machine learning functions, from neural networks to decision-tree systems.

Project: Accord Framework/AForge.net
GitHub: https://github.com/accord-net/framework/

Mahout
Mahout

The Mahout framework has long been tied to Hadoop, but many of the algorithms under its umbrella can also run as-is outside Hadoop. They're useful for stand-alone applications that might eventually be migrated into Hadoop or for Hadoop projects that could be spun off into their own stand-alone applications.

One downside of Mahout: Few of its algorithms currently support the high-performance Spark framework for Hadoop, and instead use the legacy (and in increasingly obsolete) MapReduce framework. The project no longer accepts MapReduce-based algorithms, but those looking for a more performant and future-proof library want to look into MLlib instead.

Project: Mahout

MLlib
MLlib

Apache’s own machine learning library for Spark and Hadoop, MLlib boasts a gamut of common algorithms and useful data types, designed to run at speed and scale. As you’d expect with any Hadoop project, Java is the primary language for working in MLlib, but Python users can connect MLlib with the NumPy library (also used in scikit-learn), and Scala users can write code against MLlib. If setting up a Hadoop cluster is impractical, MLlib can be deployed on top of Spark without Hadoop -- and in EC2 or on Mesos.

Another project, MLbase, builds on top of MLlib to make it easier to derive results. Rather than write code, users make queries by way of a declarative language à la SQL.

Project: MLlib

H2O
H2O

0xdata’s H2O's algorithms are geared for business processes -- fraud or trend predictions, for instance -- rather than, say, image analysis. H2O can interact in a stand-alone fashion with HDFS stores, on top of YARN, in MapReduce, or directly in an Amazon EC2 instance. Hadoop mavens can use Java to interact with H2O, but the framework also provides bindings for Python, R, and Scala, providing cross-interaction with all the libraries available on those platforms as well.

Project: H20
GitHub: https://github.com/0xdata/h2o

Cloudera Oryx
Cloudera Oryx

Yet another machine learning project designed for Hadoop, Oryx comes courtesy of the creators of the Cloudera Hadoop distribution. The name on the label isn’t the only detail that sets Oryx apart: Per Cloudera’s emphasis on analyzing live streaming data by way of the Spark project, Oryx is designed to allow machine learning models to be deployed on real-time streamed data, enabling projects like real-time spam filters or recommendation engines.

An all-new version of the project, tentatively titled Oryx 2, is in the works. It uses Apache projects like Spark and Kafka for better performance, and its components are built along more loosely coupled lines for further future-proofing.

Project: Cloudera Oryx
GitHub:
https://github.com/cloudera/oryx

GoLearn
GoLearn

Google’s Go language has been in the wild for only five years, but has started to enjoy wider use, due to a growing collection of libraries. GoLearn was created to address the lack of an all-in-one machine learning library for Go; the goal is “simplicity paired with customizability,” according to developer Stephen Witworth. The simplicity comes from the way data is loaded and handled in the library, since it’s patterned after SciPy and R. The customizability lies in both the library’s open source nature (it’s MIT-licensed) and in how some of the data structures can be easily extended in an application. Witworth has also created a Go wrapper for the Vowpal Wabbit library, one of the libraries found in the Shogun toolbox.

Project: GoLearn
GitHub:
https://github.com/sjwhitworth/golearn

Weka
Weka

Weka, a product of the University of Waikato, New Zealand, collects a set of Java machine learning algorithms engineered specifically for data mining. This GNU GPLv3-licensed collection has a package system to extend its functionality, with both official and unofficial packages available. Weka even comes with a book to explain both the software and the techniques used, so those looking to get a leg up on both the concepts and the software may want to start there.

While Weka isn’t aimed specifically at Hadoop users, it can be used with Hadoop thanks to a set of wrappers produced for the most recent versions of Weka. Note that it doesn’t yet support Spark, only MapReduc. Clojure users can also leverage Weka, thanks to the Clj-ml library.

Project: Weka

CUDA-Convnet
CUDA-Convnet

By now most everyone knows how GPUs can crunch certain problems faster than CPUs. But applications don’t automatically take advantage of GPU acceleration; they have to be specifically written to do so. CUDA-Convnet is a machine learning library for neural-network applications, written in C++ to exploit the Nvidia’s CUDA GPU processing technology (CUDA boards of at least the Fermi generation are required). For those using Python rather than C++, the resulting neural nets can be saved as Python pickled objects and thus accessed from Python.

Note that original version of the project is no longer being developed, but has since been reworked into a successor, CUDA-Convnet2, with support for multiple GPUs and Kepler-generation GPUs. A similar project, Vulpes, has been written in F# and works with the .Net framework generally.

Project: CUDA-Convnet

ConvNetJS
ConvNetJS

As the name implies, ConvNetJS provides neural network machine learning libraries for use in JavaScript, facilitating use of the browser as a data workbench. An NPM version is also available for those using Node.js, and the library is designed to make proper use of JavaScript’s asynchronicity -- for example, training operations can be given a callback to execute once they complete. Plenty of demo examples are included, too.

Project: ConvNetJS
GitHub:
https://github.com/karpathy/convnetjs