6 tools that make data science easier

New tools are bundling data cleanup, drag-and-drop programming, and the cloud to help anyone comfortable with a spreadsheet to leverage the power of data science.

6 tools that make data science easier
Metamorworks / Getty Images

Data science may never be easy but it’s getting easier to dive in. Buzzwords like “machine learning,” “regression,” and “dimensionality reduction” are just as challenging to understand as ever, but the widespread desire to reap the benefits of these techniques has resulted in several good tools that create assembly lines for data that are ready to pump out the answers we seek.

The secret is similar to what revolutionized manufacturing. Just as standardized parts helped launch the industrial revolution, data scientists at various tools vendors have produced a collection of very powerful and very adaptive analytical routines. They’ve standardized the interfaces, making it much simpler to build your custom pipeline out of these interchangeable data science tools.

Data scientists used to wring their hands because 80 percent of the work was preparing data for analysis by crafting custom routines in Python, Java or their favorite language all so the sophisticated statistical tools in R or SASS could do their job. The marketplace is now filling with sophisticated tools that bundle together several hundred well-engineered routines into a package that does much of the repetitive and unpleasant data cleanup and standardization for you.

These new tools open the opportunity for anyone who’s comfortable working with a spreadsheet. They won’t make all prep work disappear, but they’ll make it easier. There’s less need to fuss over data formats because the tools are smart enough to do the right thing. You can often just open the file and start learning.

The tools also unlock much of the cost-saving power of the cloud. In the past, data scientists needed powerful computers to crunch big data sets. Now we can rent even bigger, faster machines in the cloud by the second, increasing processing speed while saving money by returning the hardware to the pool when the monthly reports are done.

The tools are a boon for both hardcore data scientists and data analysts who just need to train an algorithm to predict next year’s trends. Both groups can enjoy the pleasure of using sophisticated tools that do the right thing with data. The standardization, though, opens up the potential for entirely new groups to dive into data science. Now you don’t need to master R syntax or Python programming to begin.

Of course, we still need to think deeply about statistics and machine learning. These tools can’t answer strategic questions about when it’s better to use a neural network or a clustering algorithm, but they can make it simple to pull in all of your data and try both very quickly. Just as standardization removed the need for long apprenticeships and sophisticated craftsmen when it became simpler for everyone to participate in the industrial revolution, these data tools are unleashing the potential for more and more people in a business to turn to sophisticated data analysis for guidance.

Here is a look at six tools helping to democratize data science today.

Alteryx

The core of the Alteryx platform is its Designer tool, a visual programming IDE that allows users to drag and drop icons instead of typing out a text program. Alteryx targets its platform at both data scientists and “citizen users,” which is a nice way of saying people who don’t want to mess with the details of cleaning up data and modifying it for analysis. The platform tries to “flip the 80/20 data prep rule” by simplifying preparation using its visual programming model. There’s a good chance you can drag an icon into the right place in the data pipeline and it will apply many of the standard tasks such as grouping by a customer number or joining two files.

Alteryx also offers a number of predefined predictive models for analyzing data and drawing inferences. These look like icons for data processing, but they’re really R or Python programs and Alteryx is saving you the trouble of dealing with their complexity and text-based coding. In Designer, the data flows along lines between icons and you don’t need to worry about commas or square brackets or other sources of coding grief.

The Alteryx platform is moving toward a more server-driven model in which the code you build lives on a server that’s ready to scale to larger data sets. If your data needs to be enhanced, Alteryx has licensed commercial data sets from companies such as Dun & Bradstreet or DigitalGlob to help fill out your tables.

When you’re done designing the model on your personal PC, Alteryx offers the infrastructure for publishing the model to a central server and then distributing the graphical summaries to everyone in the business. The Promote tool is responsible for tasks of distributing everyday production data to the right people in the enterprise so they can use the results from the predictive modeling.

The list price for the Designer tool is $5,195 per user per year, but extras like data sets with demographic or spatial data can add $33,800. The central server starts at $58,500 and extra features are available for collaboration and connecting.

Domino

Domino is also begins around the Lab, a visual integrated development environment (IDE) for constructing models by threading together icons and pipelines. The difference is that Domino is also open to other tools. All of the major and not-so-major web-based IDEs are supported because the system is designed to be open to all of them. Most may use Jupyter or R-Studio but other tools such as Apache Zeppelin or SAS’s tools are well-supported.

Most of Domino is devoted to the art of maintaining all the infrastructure you need to turn data into models. Domino’s back end carefully tracks various versions of the data as well as all of your revisions and experiments along the path. All these are relentlessly saved and linked to the results to ensure that your results can be re-run and reproduced. Storing an accurate rendition of the query is emphasized so that others can discover and reuse the work later.

Domino is more a fancy web-based operating system to a cloud network than a single platform. The platform’s openness depends on a relatively standard mechanism for storing data in files and keeping revisions consistent. Luckily disk storage is cheaper than ever.

One of Domino’s major selling points is its cloud integration. Your experiments will run on a pool of powerful machines shared with others. The underlying architecture is completely containerized and built around Docker if you happen to want to deploy your own code to the stack. You configure the optimal size for your job and the hardware will be borrowed from the pool, a good solution for data science work that is often intermittent and dispatched in clumps when the code is ready. It’s a nice solution for an environment where much of the computation is processed in batches when the weekly, monthly or quarterly data is ready.

Domino is priced “as an annual subscription that depends on where Domino is running (our hosted infrastructure, your private cloud, or on premise).” The cloud option will charge you based on the resources consumed.  

RapidMiner  

RapidMiner is one of the more highly automated tools for turning data into actionable models. Its IDE allows users to build a visual description of the data transformations as a collection of icons connected by lines. The most useful part may be the AutoModel feature, which assembles many of these icons for you based on your data and goals. When it’s done, you can open up the model and tweak the individual parts.

There’s a large collection of extensions that can help handle many of the more exotic challenges, such as making sense of unstructured text scraped off websites. There’s also a wide array of tools for working with time series data, such as for reconstructing missing data elements and forming (and testing) predictions for the future.

If your data set is larger, RapidMiner has you covered. Those who have an easily parallelized solution can use RapidMiner’s integrated version of Hadoop and Hive called “Radoop.” There’s also a server-based solution that will provision cloud machines from AWS, Azure or your own on-premises server farm. The server-based ecosystem nurtures collaboration with a centralized repository for data and analyses that can be scheduled to deliver reports and insights in production.

The pricing model for each is separate. The desktop edition has a free community edition that’s missing two of the most attractive features: TurboPrep for cleaning data and AutoModel for generating results. Pricing starts at $2,500 per user per year for a “small” version that’s limited to 100,000 rows of data. Larger data sets and the ability to deploy more processors cost more. Installing your own version of the server tool on premises begins at $15,000 but you can also buy time on RapidMiner’s cloud version starting at $6.75 per hour.

Knime

Knime (pronounced with a silent K) is an open source data analysis platform with a visual IDE for linking together various data processing and analysis routines. The core software is distributed for free but commercial versions of some plugins and extensions are available and the fees support the main development. A server version that runs in the cloud or on your own machines is also available.

The foundation of the software is written in Java, so much of Knime’s integrations depends on the Java ecosystem. Users will notice that the Knime IDE is built on top of Eclipse, which will make it more familiar to Java developers. The platform can work with data in all of the major databases (MySQL, PostgreSQL) and cloud services (Amazon Athena, Redshift) and any other with a JDBC-compliant connector. Knime offers particularly tight integration with “in database processing,” which can speed up your job. It also integrates with the next generation of distributed data tools such as Apache Spark.

A robust open source community supports a fair amount of extensions and workflows that can be used, revised and customized, with most of the code hosted on GitHub or Bitbucket. There’s also a large collection of commercial extensions with integrated support.

Companies that rely heavily on Google web applications may also like the deeper integration. Knime can read and write from data in Google Sheets, a potentially effective way to bring data analytics to an office that uses Google’s spreadsheets frequently.

The enterprise server product comes in three sizes that include extra features. The smallest size begins at $8,500 per year for five users and eight cores and is aimed more at analytics teams. The larger sizes allow you to distribute the results to others inside your organization.

Talend

Talend offers a collection of apps that work on desktops, in a local data center or in the cloud. The company’s multi-layered tools collect data from various warehouses and databases before transforming it for analysis. Pipeline Designer, for instance, offers a visual design tool for pulling data from various sources and then analyzing it with standard tools or Python extensions.

An open source version is available for free in several packages such as the Open Studio for Data Quality and the Stitch Data Loader. The cloud version begins at $1,170 per user per month with discounts for annual commitments and larger teams. The price is computed per person and generally not based on the consumption of computing resources. Pricing for the Data Fabric is done by quote.

Looker

Looker takes aim at the confusion caused by too many versions of data from too many sources. Its products create one solid source of accurate, version-controlled data that can be manipulated and charted by any user downstream. Everyone from business users to backend developers can create their own dashboards filled with data and charts configured to their personal tastes.

The platform is built around many of the standards dominating the open source world. Data and code evolve under the control of Git. Dashboard visualizations come from D3. Data is gathered from SQL databases using LookML, a custom query language similar to a regular imperative programming language.

Google recently announced that it will be acquiring Looker and integrating it into Google Cloud. How that acquisition will affect the platform remains to be seen. Prices are available by request.

Others making data more accessible

The above tools aren’t the only ones changing how we work with data. Other tools and platforms are integrating similar ideas. The major cloud companies all offer tools for analyzing data in their storage systems. Azure’s Databricks, for instance, offers a flexible user interface for configuring Apache Spark, while the Data Factory offers a visual tool for extracting, transforming and loading all of the data.

Some tools focus more on machine learning and other forms of artificial intelligence. Amazon’s SageMaker simplifies the job of building, training and then deploying a machine learning process, offering more than 100 algorithms and models in an open marketplace. H20.ai offers what they call “driverless AI,” an open source platform built with Apache Spark to simplify model creation and analysis.

They are all converging on a set of tools that accelerate our ability to explore our data and make more sense of what all of the numbers mean.

Copyright © 2019 IDG Communications, Inc.

Survey says! Share your insights in our 2020 CIO Tech Poll.