The math of data science is complex and powerful, a daunting hurdle for anyone who wants to unlock the insights it can offer. The unavoidable housekeeping and basic maintenance that go along with it, though, have never been easier. New tools and better support software are revolutionizing the discipline by delivering assembly lines for data that are ready to pump out the answers we seek.
Just as standardized parts helped launch the industrial revolution, data tools vendors have produced a collection of powerful, adaptive analytical routines, and they’ve standardized the interfaces, making it easier to build custom pipelines out of these interchangeable tools.
Data scientists used to wring their hands preparing data for analysis by crafting custom routines in Python, Java or their favorite language so that sophisticated statistical tools in R or SASS could do their job. The marketplace now offers tools that bundle together several hundred well-engineered routines into a package that does much of the repetitive and unpleasant data cleanup and standardization for you.
These new tools open the opportunity for anyone who’s comfortable working with a spreadsheet. They won’t make all prep work disappear, but they’ll make it easier. There’s less need to fuss over data formats because the tools are smart enough to do the right thing. You can often just open the file and start learning.
The tools also unlock the cost-saving power of the cloud. Data scientists no longer need powerful computers to crunch big data sets. Instead we can rent even bigger, faster machines in the cloud by the second, increasing processing speed while saving money by returning the hardware to the pool when the reports are done.
The tools are a boon for both hardcore data scientists and data analysts who just need to train an algorithm to predict next year’s trends. Both groups can enjoy the pleasure of using sophisticated tools that do the right thing with data. The standardization, though, opens up the potential for entirely new groups to dive into data science. Now you don’t need to master R syntax or Python programming to begin.
Of course, you still need to think deeply about statistics and machine learning. These tools can’t answer strategic questions about when it’s better to use a neural network or a clustering algorithm, but they can make it simple to pull in all of your data and try both very quickly. Just as standardization removed the need for long apprenticeships and sophisticated craftsmen to participate in the industrial revolution, these data tools are unleashing the potential for users throughout your organization to turn to sophisticated data analysis for guidance.
Here is a look at nine tools helping to democratize data science today.
The core the Alteryx platform is its Designer tool, a visual programming IDE that allows users to drag and drop icons instead of typing out a text program. Alteryx is targeted at both data scientists and “citizen users,” aka those who don’t want to mess with the details of cleaning and modifying data for analysis. It tool acts as a shell for popular open source tools, and does so in eight major human languages for companies with a global reach.
The platform tries to “flip the 80/20 data prep rule” by simplifying preparation using its visual programming model. There’s a good chance you can drag an icon into the right place in the data pipeline and it will apply many of the standard tasks such as grouping by a customer number or joining two files.
Alteryx offers predefined predictive models for analyzing data and drawing inferences. These look like icons for data processing, but they’re really R or Python programs that Alteryx has saved you the trouble of coding. In Designer, data flows along lines between icons and you don’t need to worry about commas or square brackets or other sources of coding grief.
Alteryx is moving toward a server-driven model in which the code you build lives on a server that’s ready to scale to larger data sets. Alteryx has also licensed commercial data sets from the 2010 US Census, Experian, Dun & Bradstreet or DigitalGlobe for your use. Its visualization routines are integrated with cartographic data from TomTom so you can turn your data into rich maps for location-based analysis.
After you’ve designed your model on your personal PC, Alteryx offers the infrastructure for publishing the model to a central server and distributing graphical summaries to everyone in the business. The Promote tool is responsible for distributing everyday production data to the right people in the enterprise so they can use the results from the predictive modeling.
The list price for Designer is $5,195 per user per year, but extras like data sets with demographic or spatial data can add $33,800. The central server starts at $78,975 and extra features are available for collaboration and connecting.
Domino’s tool for data analysis is Workbench, a visual IDE for threading together models using icons and pipelines. The difference is that Domino is also open to other tools, including most web-based IDEs, including Jupyter, R-Studio, Apache Zeppelin, Microsoft’s VS Code, and SAS’s various tools.
Domino is mostly devoted to maintaining the infrastructure you need to turn data into models. Its back end carefully tracks versions of the data as well as your revisions and experiments along the way. All are saved and linked to the results to ensure your results can be re-run and reproduced. Storing an accurate rendition of the query is emphasized so that others can discover and reuse the work later.
Domino is more a fancy web-based operating system to a cloud network than a single platform. The platform’s openness depends on a relatively standard mechanism for storing data in files and keeping revisions consistent. Luckily disk storage is cheaper than ever.
One of Domino’s major selling points is its cloud integration. Your experiments run on a pool of powerful machines shared with others. The underlying architecture is containerized using Docker and Kubernetes if you want to deploy your own code to the stack. Configure the optimal size for your job and the hardware is borrowed from the pool, a good solution for data science work that is often intermittent and dispatched in clumps when the code is ready. It’s a nice solution for an environment where much of the computation is processed in batches when the weekly, monthly or quarterly data is ready.
Domino is priced “as an annual subscription that depends on where Domino is running (our hosted infrastructure, your private cloud, or on premise).” The cloud option will charge you based on the resources consumed.
RapidMiner is one of the more highly automated tools for turning data into actionable models. Its IDE enables users to build a visual description of the data transformations as a collection of icons connected by lines. The company uses sophisticated automation to encourage collaboration between technical (programmers, scientists) and non-technical users.
The most useful part may be AutoModel, which assembles many of these icons for you based on your data and goals. When it’s done, you can open the model and tweak the individual parts. RapidMiner Go was built specifically for non-technical users to start exploring data sets with or without the assistance of data scientists. The latest version also integrates JupyterHub so Python users can build notebooks around data questions.
A large collection of extensions can help tackle many of the more exotic challenges, such as making sense of unstructured text scraped off websites. There’s also a wide array of tools for working with time series data, such as for reconstructing missing data elements and forming (and testing) predictions for the future.
If your data set is larger, RapidMiner has you covered, including an integrated version of Hadoop and Hive called “Radoop.” There’s also a server-based solution that will provision cloud machines from AWS, Azure or your on-premises server farm. The server-based ecosystem nurtures collaboration with a centralized repository for data and analyses that can be scheduled to deliver reports and insights in production.
The pricing model for each is separate. The desktop edition has a free community edition that’s missing two of the most attractive features: TurboPrep for cleaning data and AutoModel for generating results. Pricing for RapidMiner Studio starts at $7,500 per user per year for a version limited to 100,000 rows of data. Larger data sets and the ability to deploy more processors cost more. A version limited to 10,000 rows of data and one processor is free. Installing your own version of the server tool on premises begins at $54,000 but you can also buy time on RapidMiner’s cloud versions running on both Azure and AWS that bundle the cost of the software with the machine. The lowest priced machines on AWS start at $11.76 per hour.
Knime (with a silent K) is an open source data analysis platform with a visual IDE for linking data processing and analysis routines. The core software is free but commercial versions of some plugins and extensions are available for fees that support core development. A server version that runs in the cloud or on your own machines is also available.
Knime’s foundation is written in Java, so much of Knime’s integrations depends on the Java ecosystem. The Knime IDE is built on Eclipse, which makes it more familiar to Java developers. The platform can work with data in all of the major databases (MySQL, PostgreSQL) and cloud services (Amazon Athena, Redshift) and any other data store with a JDBC-compliant connector. Knime offers tight integration with “in database processing,” which can speed up your job. It also integrates with next-generation distributed data tools such as Apache Spark.
Knime Hub is a clearinghouse for data sets and analytic routines that was introduced in 2019. A robust open source community supports a fair amount of extensions and workflows that can be used, revised and customized, with most of the code hosted on GitHub or Bitbucket. There’s also a large collection of commercial extensions with integrated support.
Knime can also read and write from data in Google Sheets, a potentially effective way to bring data analytics to an office that uses Google’s spreadsheets frequently.
The Knime Analytics Platform is open source and available for free. The enterprise server comes in three sizes that include extra features. The smallest is available on AWS and Azure priced by the hour (a small server on Azure, for instance, starts at $1.16 per hour). The midsize, intended for installation, runs $29,000 per year and is aimed at analytics teams. Larger servers allow you to distribute results to others inside your organization.
Talend calls its product line a “data fabric,” a metaphor for how it weaves together threads of information. This collection of apps works on desktops, in a local data center or in the cloud, collecting and storing data in a common format and then analyzing and distributing it throughout the enterprise.
The company’s multi-layered tools collect data from various warehouses and databases before transforming it for analysis. Pipeline Designer, for instance, offers a visual design tool for pulling data from various sources and then analyzing it with standard tools or Python extensions.
An open source version is available for free in several packages, such as the Open Studio for Data Quality and the Stitch Data Loader. The cloud version begins at $1,170 per user per month with discounts for annual commitments and larger teams. The price is computed per person and generally not based on the consumption of computing resources. Pricing for the Data Fabric is done by quote.
Looker takes aim at the confusion caused by multiple versions of data from multiple sources. Its products create one source of accurate, version-controlled data that can be manipulated and charted by any user downstream. Everyone from business users to backend developers can create their own dashboards filled with data and charts configured to their personal tastes.
The platform is built around many of the standards dominating the open source world. Data and code evolve under the control of Git. Dashboard visualizations come from D3. Data is gathered from SQL databases using LookML, a custom query language similar to a regular imperative programming language.
Google recently completed acquiring Looker and integrating it into Google Cloud. While the integration with BigQuery is highlighted, the product’s management continues to emphasize that it will be able to fetch data from other clouds (Azure, AWS) and other databases (Oracle, Microsoft SQL). Prices are not generally listed but are available by request.
Oracle’s 2018 acquisition of DataScience.com added a strong collection of analytic tools to the company’s core database tools. The integration is now complete in the form of Oracle Cloud Data Science Platform, which includes a collection of powerful tools (TensorFlow, Jupyter, etc.), enhancements to the core database (Oracle Autonomous Database) and an option to use Oracle’s cloud for analysis.
The collection of tools is mainly open source, and the dominant language is Python, available through Jupyter notebooks running in JupyterLab environments. Machine learning options such as TensorFlow, Jupyter, Dask, Keras, XGboost, and scikit-learn are integrated with an automation tool to run through multiple approaches, and the work is spread out around the cloud using Hadoop and Spark.
Oracle’s aim is to empower teams by handling the infrastructure chores. Data is stored in the Infrastructure Data Catalog where teams can control access to it. Spinning up an instance to handle computation in Oracle’s cloud is largely automated so teams can start and stop jobs quickly without working with DevOps.
The tools are incorporated into the cloud and billed according to usage. Beginners can start with the “Always Free” tier which includes two autonomous databases, two compute machines, and 100GB of storage. After that, costs vary according to how much compute jobs you fire up. Oracle estimates that a basic server with a GPU for accelerating machine learning will start at $30 per full day.
MathWorks was once known mainly by engineers and scientists for producing Matlab and Simulink. Now that data scientists are bringing these techniques to larger audiences, the tools are gathering attention in new shops.
The core of the system is Matlab, a tool that began life juggling large matrices for linear algebra problems. The system still supports this mission but now offers a collection of machine learning and AI algorithms that can be focused on other data such as text analysis. Matlab also offers optimization algorithms for finding the best solution given a set of constraints, as well as dozens of toolboxes designed to handle common problems in areas as diverse as risk management, autonomous driving and signal processing.
Free trials are available for 30 days. After that a full, perpetual license runs $2,150 per individual, but there are several groups such as students and academic institutions eligible for large discounts. You can also buy a shorter year-long license at a discount. Options are also available for pools of shared licenses.
The heart of the Databricks system is a data lake that fills up with the information that will be transformed into collaborative notebooks shared by data scientists and those in the enterprise who rely on their insights. Notebooks support multiple languages (R, Python, Java) and enable multiple users to revise and extend them at the same time while storing versions with Git. The tool provides a unified path for iterative exploration of data models built with machine learning algorithms.
In the core of the system are major open source projects ranging from the data storage layer (Delta Lake), the main computational platform (Apache Spark), through the algorithms (TensorFlow, MLFlow). The computation resources are drawn from Azure or AWS.
Pricing is billed by the second for cloud machines booted with the Databricks image. The current charges on AWS machines add between 7 and 65 cents per hour depending on the computational power of the machine. More expensive tiers come with extra features such as role-based access control and HIPAA compliance.
Others making data more accessible
Other tools and platforms are integrating similar ideas. Major cloud companies such as Google and Microsoft offer tools for analyzing data in their clouds. Azure Data Factory, for instance, offers a visual tool for extracting, transforming and loading data. Companies such as Tibco and SAS that once offered report generating tools under the umbrella of “business intelligence” are offering more sophisticated analysis that might properly be called “data science.”
Some tools focus more on machine learning and other forms of artificial intelligence. Amazon’s SageMaker simplifies the job of building, training and deploying a machine learning process, offering hundreds of algorithms and models in an open marketplace. H20.ai offers what it calls “driverless AI,” an open source platform built with Apache Spark to simplify model creation and analysis.
They are all converging on a set of tools that accelerate our ability to explore our data and make more sense of what all of the numbers mean.