by Thor Olavsrud

Data preparation tools: Your analytics strategy’s secret weapon

Feature
Nov 05, 2019
AnalyticsBig DataData Management

Data preparation is frequently cited as the leading roadblock to leveraging data within an organization. Getting the right tool for your organization can help you breakthrough.

analytics research numbers graphs charts by chombosan getty
Credit: chombosan / Getty Images

To reap the benefits of data analytics, you first have to get data preparation right. For many organizations, this is a significant bottleneck, with up to 70 percent of their time focused on data preparation tasks, according to recent research from Gartner.

“Finding, accessing, cleaning, transforming, and sharing the data, with the right people and in a timely manner, continues to be one of the most time-consuming roadblocks in data management and analytics,” says Ehtisham Zaidi, senior director analyst of Gartner’s Data and Analytics team and lead author of Gartner’s Market Guide for Data Preparation Tools.

For organizations seeking to transform their business with analytics, the chief problem is less about mastering AI and more about mastering the data pipeline, says Jonathan Martin, chief marketing officer of Hitachi Vantara.

“The data preparation piece is the piece that is most challenging,” he says. “How do I identify where all this data is? Am I able to build a portfolio? Am I able to engineer the pipelines to connect all those data sources together in an automated and managed and governed way to allow us to get that data to the right place, the right person, the right machine in the right time frame?”

Following is an in-depth look at why data preparation remains a significant analytics challenge, how data prep tools have evolved to address these issues, and what to look for when choosing data preparation tools for your business.

Data preparation challenges

Multiple factors contribute to the challenge of data preparation.

First, the number and complexity of data sources and data types needed to support analytics initiatives are increasing exponentially. Accessing these data sources across a distributed data ecosystem, internal and external to the organization, requires significant time, resources, skills and tools to accomplish.

“It’s the complexity of data environments in this day and age,” says Stewart Bond, research director of the Data Integration and Integrity Software service at IDC. “There’s multiple different data types: There’s transactional data, master data, social media data, structured data, unstructured data, log file data, graph data. There’s all different kinds of data that is out there and there’s all different kinds of technologies that these data are being stored in.”

Second, the number of requests for self-service data access and integration is leaving IT teams overwhelmed — a sign that the centralized IT model to data integration doesn’t work anymore, Zaidi says.

“IT needs to provision data access and integration through tools that are easy for the business users to use and understand, and this is where demand for data preparation is further escalating,” he says.

Third, data requirements keep changing, as business analysts, citizen integrators, line of business users, data engineers, and data scientists all have different data demands for their projects.

“This makes preparing data once and making it available to different personas/consumers for their ever-changing demands virtually impossible,” Zaidi says.

Next-gen data prep tools

As data preparation tools have matured, the pain points have largely shifted, he adds. The pain used to lie in which data sources to connect and which data to prepare; these days organizations are focusing on data governance, lineage, traceability, and quality. They’re also faced with ensuring that the right people with the necessary skills get access to the right data using data preparation tools.

Bond sums this up as an issue of “data intelligence” — the metadata about the data.

“It’s the intelligence to know where the data is, what the data means, who’s using it, who can get access to it, why we have the data, how long we need to keep the data, and how people are using it,” he says.

Thankfully, the data preparation tools market is evolving to include new features to address these issues. Previous-generation tools were limited to supporting simple data transformation requirements for last-mile data preparation tasks needed by business users. Next-generation tools now incorporate capabilities for sharing findings and prepared models with IT teams for operationalization, as well as data management features such as data cataloging, which enables users to view and search for connected data assets.

“Some tools also now come embedded with advanced data quality features which were missing in previous-generation tools,” Zaidi says. “These include profiling, tagging, annotating, deduplication, fuzzy logic matching, linking, and merging capabilities. These features make it easier for IT and data management teams to improve the quality and ensure governance and compliance for widespread adoption and usage of prepared data models.”

Here, machine learning (ML) is key. ML-based capabilities can not only automate the matching, joining, profiling, tagging, and annotating of data prior to preparation, but some tools can highlight sensitive attributes, anomalies and outliers, and collaborate with metadata management and governance tools to prevent sensitive data from being exposed.

“These machine learning augmented data preparation tools allow users of varying levels of skills adopt data preparation and yet ensure governance and compliance,” Zaidi explains.

What to look for in a data preparation tool

As organizations evaluate modern data preparation tools, Zaidi says they should look for key capabilities:

  • Data ingestion and profiling. Look for a visual environment that enables users to interactively ingest, search, sample, and prepare data assets.
  • Data cataloging and basic metadata management. Tools should allow you to create and search metadata.
  • Data modeling and transformation. Tools should support data mashup and blending, data cleansing, filtering, and user-defined calculations, groups, and hierarchies.
  • Data security. Tools should include security features such as data masking, platform authentication, and security filtering at the user/group/role level.
  • Basic data quality and governance support. Data preparation tools should integrate with tools supporting data governance/stewardship and capabilities for data quality, user permissions, and data lineage.
  • Data enrichment. Tools should support basic data enrichment capabilities, including entity extraction and capturing of attributes from the integrated data.
  • User collaboration and operationalization. The tools should facilitate the sharing of queries and datasets, including publishing, sharing, and promoting models with governance features such as dataset user ratings or official watermarking.

Additionally, Zaidi highlights the following differentiating capabilities to look for:

  • Data source access/connectivity. Tools should feature APIs and standards-based connectivity, including native access to cloud application and data sources, such as popular database PaaS and cloud data warehouses, on-premises data sources, relational and unstructured data, and non-relational databases.
  • Machine learning. Tools should support the use of machine learning AI to improve or even automate the data preparation process.
  • Hybrid and multi-cloud deployment options. Data preparation tools need to support deployment in the cloud, on-premises, or in a hybrid integration platform setting.
  • Domain- or vertical-specific offerings or templates. Tools should provide packaged templates or offerings for domain- or vertical-specific data and models that can accelerate the time to data preparation.

Ultimately, Zaidi says one of the first things you must consider is whether your organization will go with a standalone data preparation tool or with a vendor that embeds data preparation into its broader analytics/BI, data science, or data integration tools. Consider standalone tools if you have a general-purpose use case that depends on integration of data for a range of analytics/BI and data science tools. On the other hand, if you need data preparation only within the context of a particular platform or ecosystem, it may make more sense to go with the embedded data preparation capability of those tools.

Data preparation market overview

Gartner breaks data preparation tools vendors into four categories, each of which is in flux as data preparation capabilities are being embedded across all data management and analytics tools.

Standalone data preparation tools. Vendors in this space focus on enabling tighter integration with downstream processes, such as API access and support for multiple analytics/BI, data science, and data integration tools. Tools in this space include offerings from vendors such as Altair, Datameer, Lore IO, Modak Analytics, Paxata and Trifacta.

Data integration tools. Vendors in this category have historically focused on data integration and management. This includes offerings from vendors such as Cambridge Semantics, Denodo, Infogix, Informatica, SAP, SAS, Talend, and TMMData.

Modern analytics and BI platforms. These vendors focus on data preparation as part of an end-to-end analytics workflow. Because data preparation is critical to modern analytics and BI, all vendors in the space are embedding data preparation capabilities, Zaidi says. Vendors in this category include Alteryx, Tableau, Cambridge Semantics, Infogix, Microsoft, MicroStrategy, Oracle, Qlik, SAP, SAS, TIBCO Software, and TMMData.

Data science and machine learning platforms. Gartner says these vendors provide data preparation capabilities as part of an end-to-end data science and ML process. Representative vendors include Alteryx, Cambridge Semantics, Dataiku, IBM, Infogix, Rapid Insight, SAP, and SAS.

In addition to the above four broad categories, Gartner sees new categories emerging with data preparation capabilities, including the following platforms and representative vendors:

  • Data management/data lake enablement platforms: Informatica, Talend, Unifi, and Zaloni
  • Data engineering platforms: Infoworks
  • Data quality tools: Experian
  • Data integration specialists: Alooma, Nexla, StreamSets, and Striim

6 key data preparation tools

The following six data preparation tools provide a more granular picture of what is currently available today.

Alteryx Designer

This standalone data preparation tool is also a part of the Alteryx Analytics and Data Science platform, meaning it is also embedded as a capability within a broader modern analytics and BI platform, and as a capability within a broader data science and machine learning platform. It offers a drag-and-drop workflow for profiling, preparing and blending data without SQL code. It is licensed on an annual subscription basis and priced per named user.

Cambridge Semantics Anzo

Anzo is Cambridge Semantics’ end-to-end data discovery and integration platform, and so crosses all four of Gartner’s categories. Anzo applies a semantic, graph-based data fabric layer over existing data infrastructure to map enterprise data, expose connections between datasets, enable visual exploration and discovery, and blend multiple datasets. Anzo is offered via subscription, with pricing based on the number of cores and number of users.

Datameer Enterprise

Datameer Enterprise is a data preparation and data engineering platform squarely in Gartner’s standalone category. It focuses on bringing together raw, disparate data sources to create a single data store using a wizard-led integration process. Datameer offers a spreadsheet-like interface for point-and-click blending and visual exploration capabilities. Customers are charged based on compute power or data volume. Cloud customers are charged hourly or via an annual license.

Infogix Data3Sixty Analyze

Infogix’s Data3Sixty Analyze is a web-based solution born from Infogix’s acquisition of Lavastorm. Like Datameer, it crosses all four of Gartner’s categories. Data3Sixty uses roles to define users. Designers can create and edit data flows, explorers can only execute data flows, while schedulers can create and modify schedules for automated processing. Infogix sells Data3Sixty as both a subscription-based desktop product and a server-based product offered on both a perpetual and subscription basis.

Talend Data Preparation

Talend offers three data preparation tools: Talend Data Preparation (an open source desktop version), Talend Data Preparation Cloud (a commercial version offered as part of the Talend Cloud platform), and another version of Talend Data Preparation (a commercial version that is part of the on-premises Talend Data Fabric offering). Talend Data Preparation is a standalone tool, while Talend Cloud and Talend Data Fabric are examples of data preparation integrated as a capability within a broader data integration/data management tool. Talend uses machine learning algorithms for standardization, cleansing, pattern recognition, and reconciliation. The open source version is free. The commercial versions follow a subscription model based on named user licenses.

Trifacta Wrangler

Trifacta Wrangle is a standalone data preparation platform that comes in various editions for supporting cloud and on-premises computing environments. It offers embedded ML capabilities for recommending data with which to connect, inferring data structure and schema, recommending joins, defining user access, and automating visualizations for exploration/data quality. Trifacta Wrangler is offered in a free version, Wrangler Pro (with a charge based on compute capacity and number of users), Wrangler Enterprise (offered as both an on-premises version and a cloud version charged by scale of compute/processing and the number of users), and Google Cloud Dataprep by Trifacta (charged by compute consumption).