The words “machine learning” have been imbued with an almost magical aura. Ordinary people don’t teach machines to learn. That’s for highly specialized alchemists, like data scientists, who transform data into gold in research divisions and labs with little explanation beyond simply saying, “Science.”
Of course, it may be a little known fact, but over the years machine learning tools have evolved to a point where almost anyone with a bit of pluck and drive can push a button and start some machine on a path to learning something valuable. It’s not exactly a snap, but the hard work of corralling the data and turning it into actionable insights has been automated enough that smart people with some motivation can do it themselves.
This slow renaissance has been driven by the reality that many non-programmers in the business world are already pretty savvy with data. Spreadsheets loaded with numbers are the lingua franca of decision makers at all levels of business and machine learning algorithms also like data in tables with cleanly defined rows and columns. To dispel a little magic, the new tools for machine learning are essentially just another collection of strategies and options for turning tabular data into useful answers.
The strength of the tools is in their ability to handle the grungy work of collecting data, adding structure and consistency where possible, and then starting the calculation. They simplify the data gathering process and the grind of keeping the information in rows and columns.
The tools, alas, are not yet smart enough to do all of this learning for you. You do have to ask the right questions and look in the right places. But the tools accelerate the search for answers so you can cover more ground, look behind more doors, and poke around in more crevices.
AutoML: Democratizing machine learning
Lately, a new buzzword, “AutoML,” has started to appear to denote that a machine learning algorithm comes with an additional meta-layer of automation. The standard algorithms have always been designed to churn through data and find patterns and rules on their own, but the traditional algorithms came with many options and parameters. Data scientists often spent 80 to 99 percent of their time fiddling with these dials until they found the most predictive rules.
AutoML automates this stage by trying a bunch of options, testing them and then trying some more. Instead of running the machine learning algorithm once, it runs it N times, makes some adjustments, runs it N times again, often repeating until your budget in time, money or patience is exhausted.
AutoML tools are natural fits for cloud computing, because in the cloud they can spin up enough machines to run in parallel and then return them to the pool when you’re done. You pay only for the peak computational time.
In general, AutoML algorithms are good options for people beginning to explore machine learning on their own. The automation simplifies the job by handling some of the basic work of setting parameters and choosing options before testing the results for you. As users become more sophisticated and begin to understand the results, they can take on more of these jobs and set the values themselves.
The newest systems also make it easier to learn how machines can learn. If classical programming turns rules and data into answers, machine learning algorithms will work backwards and turn answers and data into rules — rules that might teach you what is going on in the depths of your business. The developers of these simplified tools are also creating interfaces that explain the rules that the algorithm discovered and, more importantly, how to duplicate the results. They want to open up the black box to promote understanding.
All of these features are opening up the world of machine learning to the people who work with numbers, spreadsheets and data by eliminating the need to be great at programming and data science. The following six options simplify using machine learning algorithms to find answers in the sea of numbers reaching your desk.
The original version of Splunk began as a tool for searching (or “spelunking”) through the voluminous log files created by modern web applications. It has since grown to analyze all forms of data, especially time-series and others produced in sequence. The tool presents the results in a dashboard with sophisticated visualization routines.
The newest versions include apps that integrate the data sources with machine learning tools like TensorFlow and some of the best Python open source tools. They offer quick solutions for detecting outliers, flagging anomalies and generating predictions for future values. They are optimized to search for the proverbial needles in very large datasets.
Inside DataRobot’s stack are a collection of some of the best open source machine libraries written in R, Python or several other platforms. You’ll deal only with a web interface that displays flowchart-like tools for setting up a pipeline. DataRobot connects to all of the major data sources, including local databases, cloud datastores and downloaded files or spreadsheets. The pipeline you build can clean the data, fill in missing values and then generate models that will flag outliers and predict future values.
DataRobot can also attempt to offer “human-friendly explanations” about just why certain predictions were made, a useful feature for understanding how the AI may be working.
It can be deployed in a mixture of cloud and on-premises solutions. Cloud implementaitons can deliver maximum parallelism and throughput through the shared resources, while local installations offer more privacy and control.
H2O likes to use the words “driverless AI” to describe its automated stack for exploring various machine learning solutions. It ties together data sources (databases, Hadoop, Spark, and so on) and feeds them into a variety of algorithms with a wide range of parameters. You control the amount of time and compute resources devoted to the problem and it tests various combinations of parameters until the budget is finished. The results can be explored and audited through a dashboard or Jupyter notebooks.
H2O’s core machine learning algorithms and integration with tools such as Spark are open source, but the so-called “driverless” option is one of the proprietary wrappers sold to enterprise customers along with support.
The core of the RapidMiner ecosystem is a studio for creating data analytics from visual icons. A bit of dragging and dropping produces a pipeline that will clean up your data and then run it through a wide range of statistical algorithms. If you want to use machine learning instead of some more traditional data science, the Auto Model will choose from a number of classification algorithms and search through various parameters until the best fit is found. The tool’s goal is to produce hundreds of models and then identify the best one.
Once the models are created, the tool can deploy them while also testing their success rate and explaining how the model makes its decisions. The sensitivity to different data fields can be tested and tweaked with the visual workflow editor.
Recent enhancements include better text analytics, a greater variety of charts for building visual dashboards and more sophisticated algorithms for analyzing time series data.
The BigML dashboard offers all of the basic tools for data science for identifying correlations that can form the foundation for more complex work with machine learning. Their Deepnets, for instance, offers sophisticated mechanism for testing and optimizing more elaborate neural networks. The quality of the model can be compared to other algorithms with a standardized comparison framework that helps you choose between classic data science and more sophisticated machine learning.
BigML’s dashboard runs in your browser, and its analysis runs either in the BigML cloud or in an installation in your server room. The prices for the cloud version are set low to encourage early experimentation; there’s even a free layer. The cost is mostly determined by a limit on the size of your data set and the amount of computational resources you can invoke. The free tier will analyze up to 16MB of data using no more than two processes running in parallel. The smaller paid accounts are priced very reasonably with monthly bills as small as $30 but the costs rise as your resource needs increase.
R is not an easy language for non-programmers to use but it remains one of the most essential tools for sophisticated statistical analysis because it is so popular with the hard core data scientists. R Studio is a tool that offers a set of menus and point-and-click options for users that make it a bit easier to interact with the R layer running deep inside.
Sophisticated managers who can handle spreadsheets can use the simplest options to run basic analyses and even some complex ones. It’s still a bit more painful than it needs to be and some parts are going to confuse average users, but it’s right on the edge of being open and accessible to everyone willing to invest some time. There will still be some confusion, but it can be worth it for someone who wants to explore cutting-edge tools.