Buried Treasure

Some dos and don'ts of data mining

Merely storing information in a data warehouse does a company little good, no matter how neatly the data is stacked and organized. Getting information out of the warehouse is what allows organizations to reap the benefits of data warehousing, and data mining is one of the best ways to extract meaningful trends and patterns from a vast pile of data.

Although data mining is still in its infancy, companies in a wide range of industries -- including retail, finance, medicine, manufacturing, transportation and aerospace -- are already using data mining tools and techniques to take advantage of historical data gathered internally or acquired from other organizations. By using pattern recognition technologies and statistical and mathematical techniques to sift through warehoused information, data mining helps analysts recognize significant facts, relationships, trends, patterns, exceptions and anomalies that might otherwise go unnoticed (see boxed examples). Companies also can use mining techniques to visualize the data, or present it in an easily digestible format, as well as to check for holes in the underlying data store.

The demands of data mining easily can overwhelm today's technology and products. Conventional mainframes seldom are able to run brute-force multiple queries on large data sets, and although the memory and processing power of many PCs and workstations are increasing, the available software is not always up to the task. Most analysis routines can handle only small samples of data at a time, making wide-ranging analysis difficult and time-consuming.

A methodical approach to data mining increases the chances of overcoming those barriers. Here are the main steps of Gartner Group Inc.'s data mining methodology:

1. Database selection and preparation. To mine data effectively, the warehouse must be set up properly. The first step is to identify the databases and factors to be explored. If possible, a live data dictionary should be created from which required records can be retrieved into the flat files needed by most analysis routines. This step is very complex: The databases of interest may be maintained by multiple departments, on various hardware platforms and operating systems, or in separate locations.

Data preparation involves filling in missing values and correcting errors. The referential integrity controls of modern relational databases have improved data quality, but legacy databases may be incomplete or full of errors. Interpolating missing data can be dangerous, particularly when dealing with small samples.

2. Clustering and feature analysis. The large database groups defined during the preparation phase are divided further using clustering techniques. That is followed by a more detailed feature analysis to find the factors that most obviously contribute to the formation of the clusters and to determine which factors are involved in attaining particular business goals. Clustering and feature analysis can pare down the problem scope in terms of the number of factors or records to examine.

3. Tool selection. Many data mining tools are available, but most are incomplete and may have to be combined with techniques or systems already developed within an enterprise. Before acquiring a tool or a technology, Gartner Group recommends conducting a thorough analysis. Important questions that must be answered include the following:

  • How many examples can be handled at once? Even tools advertised as capable of handling large databases can't deal with amounts of data that exceed the computer's processing power and memory; that is especially true for PC-based products. That limitation is particularly important when analysts need to evaluate many factors because the number of examples the product can process may be insufficient.
  • How much preprocessing is required? Some high-level methods may process only discrete value ranges, which requires data preprocessing and normalization. In addition, data formats are an issue as some products process numeric input exclusively. Complete tools should come with database access, translation and preprocessing capabilities.
  • Can the user express and test top-down hypotheses, or is the process only bottom-up? Users should be able to substantiate hypotheses by testing them with specific facts or records (top-down analysis). The system also should be able to build hypotheses from individual facts while allowing users to modify the facts to perform what-if investigations (bottom-up analysis).
  • Does the system generate rules, models, decision trees or numbers? If an explanation of the results is the main goal, rules and induction techniques are appropriate. If finding the best combination of factors is the goal and a literal explanation is not required, neural networks and fractal techniques should be used.
  • How easily can a data model be updated when new information is available? Discovery tools should be able to process all data, no matter how often the data changes or how fast it enters the system. Systems that require long training or processing times may be inappropriate for organizations with rapidly changing data.
  • How much effort and expertise are required to use the technology? Products range from highly automated tools that require few user adjustments to manual systems that demand considerable technical knowledge. The choice should be based on the skill levels of prospective users and developers as well as the complexity of the problem. In most cases, consulting expertise will be required.

4. Hypothesis testing and knowledge discovery. This step is most often associated with the term "data mining." During this process, hypotheses are formed and tested, new relationships are discovered and what-if analyses may be performed. Many issues come into play, such as sample size, processing time, complexity of the data and degree of confidence. The output of the data mining process depends on the product and technology used and often is in the form of rules, correlations, prediction models, relationship graphs or decision trees.

5. Knowledge application. In most cases, tested rules created from the discovery process can be added directly to either procedural code or -- if there are many rules and updates are likely -- into a knowledge-based system. Prediction models can often be integrated directly into application code, a particularly easy process with products that output their models in common languages such as C.

Data mining requires substantial human effort and interaction. The mechanisms for steering the process are still relatively new and are inadequate for dealing with the myriad factors and interactions in large data stores. Shrink-wrapped packages may promise enticing ease of use, but in many cases both technology-specific expertise and relevant domain knowledge are necessary .

Regardless of the technology underlying the data mining process, the value of discovered data -- especially in retail marketing and finance -- is time-sensitive. The first enterprises to exploit the data will have the upper hand in serving and attracting customers.

Related:

Copyright © 1996 IDG Communications, Inc.

7 secrets of successful remote IT teams