Data mining definition
Data mining, sometimes used synonymously with “knowledge discovery,” is the process of sifting large volumes of data for correlations, patterns, and trends. It is a subset of data science that uses statistical and mathematical techniques along with machine learning and database systems. The Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining (SigKDD) defines it as the science of extracting useful knowledge from the huge repositories of digital data created by computing technologies.
The idea of extracting patterns from data is not new, but the modern concept of data mining began taking shape in the 1980s and 1990s with the use of database management and machine learning techniques to augment manual processes.
Data mining vs. data analytics
The terms data analytics and data mining are often conflated, but data analytics can be understood as a subset of data mining.
Data mining focuses on cleaning raw data, finding patterns, creating models, and then testing those models, according to analytics vendor Tableau. Data analytics, on the other hand, is the part of data mining focused on extracting insights from data. Its aim is to apply statistical analysis and technologies on data to find trends and solve problems.
The business value of data mining
Data mining is used at companies across a broad swathe of industries to sift through their data to understand trends and make better business decisions. Media and telecom companies use mine their customer data to better understand customer behavior. Insurance companies use data mining to price their products more effectively and to create new products. Educators are now using mining data to discover patterns in student performance and identify problem areas where they might need special attention. Retailers are using data mining to better understand their customers and create highly targeted campaigns.
Data mining use cases include the following:
Data mining techniques
Data mining uses an array of tools and techniques. According to data integration and integrity specialist Talend, the most commonly used functions include:
- Data cleansing and preparation. Before data can be analyzed and processed, you need to identify and remove errors, and identify missing data, too.
- Data mining frequently leverages AI for tasks associated with planning, learning, reasoning, and problem solving.
- Association rule learning. Also known as market basket analysis, these tools are used to search for relationships among variables in a dataset. A retailer might use them to determine which products are typically purchased together.
- Clustering is used to partition a dataset into meaningful subclasses to understand the structure of the data.
- Data analytics. Data analytics is the process of extracting insight from data.
- Data warehousing. A data warehouse is a collection of business data. It’s the foundation of most data mining.
- Machine learning. Machine learning helps automate the process of finding patterns in your data.
- This technique is used with a particular data set to predict values like sales, temperatures, or stock prices.
Data mining process
The Cross Industry Standard Process for Data Mining (CRISP-DM) is a six-step process model that was published in 1999 to standardize data mining processes across industries. The six phases under CRISP-DM are: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
This phase is about understanding the objectives, requirements, and scope of the project. It consists of four tasks: determining business objectives by understanding what the business stakeholders want to accomplish; assessing the situation to determine resources availability, project requirement, risks, and contingencies; determining what success looks like from a technical perspective; and defining detailed plans for each project tools along with selecting technologies and tools.
The next phase involves identifying, collecting, and analyzing the data sets necessary to accomplish project goals. It also comprises four tasks: collecting initial data, describing the data, exploring the data, and verifying data quality.
This is often the biggest part of any project, and it consists of five tasks: selecting the data sets and documenting the reason for inclusion/exclusion, cleaning the data, constructing data by deriving new attributes from the existing data, integrating data from multiple sources, and formatting the data.
Building models from data has four tasks: selecting modeling techniques, generating test designs, building models, and assessing models.
While the modeling phase includes technical model assessment, this phase is about determining which model best meets business needs. It involves three tasks: evaluating results, reviewing the process, and determining next steps.
The final phase is about putting the model to work. It includes four tasks: developing and documenting a plan for deploying the model, developing a monitoring and maintenance plan, producing a final report, and reviewing the project.
In 2015, IBM published an extension to CRISP-DM called the Analytics Solutions Unified Method for Data Mining (ASUM-DM). It takes CRISP-DM as a baseline but builds out the deployment phase to include collaboration, version control, security, and compliance.
Companies use a variety of data mining software and tools to support their efforts. Some of the more popular software and tools include:
- H20. This open source machine learning platform can be integrated through an API and uses distributed in-memory computing for analyzing massive datasets.
- IBM SPSS Modeler. IBM’s visual data science and machine learning solution can be used for data preparation, discovery, predictive analytics, model management, and deployment.
- Knime. Open source platform Knime is aimed at data analytics, reporting, and integration.
- Oracle Data Mining (ODM). ODM is part of Oracle Database Enterprise Edition, offering data mining and data analysis algorithms for classification, prediction, regression, associations, feature selection, anomaly detection, feature extraction, and specialized analytics.
- Orange Data Mining. Orange is an open source data visualization, machine learning, and data mining toolkit.
- R. This open source programming language and free software environment is widely used by data miners. Founded by Revolution Analytics, R also has commercial support and extensions. Microsoft acquired Revolution Analytics in 2015, and has integrated R with its SQL Server offerings, Power BI, Azure SQL Managed Instance, Azure Cortana Intelligence, Microsoft ML Server, and Visual Studio 2017. Oracle, IBM, and Tibco also support R in their offerings.
- RapidMiner. Geared for teams, the RapidMiner data science platform supports data prep, machine learning, and predictive model deployment.
- SAS Enterprise Miner. SAS Enterprise Miner is aimed at creating predictive and descriptive models on large volumes of data from sources across the organization.
- Sisense. Sisense’s BI stack covers everything from the database through ETL and analytics to visualization.
Data mining jobs
Data mining is most often conducted by data scientists or data analysts. Here are some of the most popular job titles related to data mining and the average salary for each position, according to data from PayScale:
- Business intelligence analyst: $52K-$90K
- Business intelligence architect: $72K-$140K
- Business intelligence developer: $$62K-$109K
- Data analyst: $43K-90K
- Data engineer: $44K-$141K
- Data scientist: $66K-$130K
- Senior data analyst: $63K-$108K
- Statistician: $44K-$159K