Categorization Software Improves Search Capabilities
Even companies that can afford manual tagging have reasons to look at autocategorization. Chat Joglekar, business development manager at USAToday.com, says that the major benefit of autocategorization for his company is consistency. USAToday.com had long used editors to do manual categorizing?or had avoided categorization altogether. But as the sheer mass of online material and the total number of editors kept growing and changing, the slight idiosyncrasies in how each of them categorized information steadily degraded the search function’s performance. Now the online newspaper takes advantage of a product called Concept Server from Applied Semantics. While machines may have their peculiarities, at least their biases are consistent over both time and scale of operation.
Raymond Karrenbauer, CTO of ING Americas’ Technology Management Office, reports a fourth payoff: Automatic categorization and taxonomy makes it easier for a company to add uncategorized or weakly categorized material, such as e-mail messages or ING’s more than 40,000 different formats of unstructured data, to its searchable data space. He adds that categorization improves the work of internal users?allowing customer service reps, for instance, to find what they need faster.
Several trends have combined to make those new services possible. First, two relevant "natural language recognition" technologies have matured almost simultaneously. One maps the frequencies of words in a document and their positions relative to each other to generate a document profile. The software then compares that profile with the profiles of previously categorized reference documents, those of other new documents or both. The first comparison sorts new documents into established categories; the second recognizes new topical "clusters" that probably should be explicit categories. For instance, if two documents have China and ceramics within 10 words of each other, the odds that they should be in the same category go up. Autonomy’s product relies on that approach.
The second technique (the one used by Applied Semantics, Inquira and others) relies on semantics. Given a document, such a program first filters out the important words, then looks up their synonyms, meanings and their thematic relationships (for example, the term chair would be linked to furniture and rocking). Finally it counts the number of these relationships to decide which words are most likely to reflect the document’s major and minor themes. Theoretically such a system can figure out whether an article on chips belongs under food, gambling, computers or horses, even if none of those specific terms appears in the document.
Perhaps the best news for vendors designing autocategorization products, however, has nothing to do with research breakthroughs. Today, more and more information travels with a lengthening entourage of data about itself (such as e-mail headers or meta-tags in webpages). Autocategorization software can recognize and leverage that data for its own ends. For example, iPhrase Technologies specializes in finding and harvesting, or "spidering," categorization information across many data types. "Three to four years ago, we had to code up explicit structure with every deployment," says Senior Product Manager Roy Rodenstein. "But today our clients have much richer data."



