by Fred Hapgood

Categorization Software Improves Search Capabilities

May 01, 20039 mins
Enterprise Applications

More and more, the problems that earn CIOs their paychecks revolve around making it easier for users to explore huge volumes of data. They do this through finding known objects in huge search spaces, assembling top-down overviews that summarize the important points of a topic, and helping searchers decide what they really want when their initial search ideas are confused, misguided or ambiguous.

At one time, researchers speculated that solving such search problems might require artificial intelligence: systems that simulated human thought and could behave like skilled reference librarians. But there is an easier solution?ordering data into categories and subcategories and then having users interact with that structure before looking at the raw results. Consider a hungry New Yorker looking for a place to eat. A search under “New York AND restaurant” that returned only a list of actual eateries would be too long. On the other hand, if the results came packaged in an easy-to-scan collection of restaurant types?Italian, French, Asian and, if necessary, subtypes under that: Korean, Japanese, Vietnamese and so on?the whole set of New York restaurants suddenly becomes navigable.

Categorization also helps with other issues. It solves the overview problem by formatting different categories (restaurant types, locations, price ranges, ratings) side by side, presenting the searcher with a multifaceted, top-down perspective. The same formatting trick helps searchers who don’t quite know what they want by letting them examine query results from several angles at once, interactively.

Category trees are not new. Until recently, however, IT applications required paid humans to think up the category names, define their relationships and write the rules that channeled data into the proper boxes. As a result, the technique was limited to fields with big budgets, such as financial analysis or defense. During the past few years, however, several developments have made it much easier to automate or at least semiautomate categorization, sparking a small revolution in the sophistication of enterprise-level search engines and the number and kinds of users a system can help.

These systems, however, are not exactly plug and play (at least today) and may require significant time to establish rules that ultimately create the final categories. But with proper investment, autocategorization tools can reap significant benefits.

Parsing Parts

In 2000, components distributor Arrow Electronics built and started to sell subscriptions to Ubiquidata, a components database made up of information about more than 23 million items, each with as many as 50 related data elements. The company initially marketed the product to purchasing and material planning professionals within original equipment manufacturers (OEMs). For clients such as those, searching the huge data set was no problem, since they usually knew exactly what they were after, often right down to the manufacturer’s part number.

Arrow, however, wanted to bring the service to another group: design engineers. Unlike line managers, designers seldom know what they are looking for ahead of time. They start with a wish list of properties for the perfect part, filter out candidates that come close but not close enough, and then find the best compromise by carefully comparing the remaining parts and fine-tuning their design. The very last thing they learn is the part number. Customers such as those require a very different set of searching tools.

In response, Arrow struck up a partnership with Endeca Technologies, a startup search vendor that specializes in “query discovery” software?searches that use the experience of navigating around, through and over complex category landscapes to help searchers figure out what they want.

After a development phase of about six months, the search application was ready for the design engineers. Today a user searching the Arrow database can organize results by several interacting categories. For instance, suppose she is looking at the power, size and price categories, and she clicks on a specific range of power (say, 10 to 20 watts). The listings in the size and price categories then automatically change to present just the sizing and pricing of the parts in the desired power range.

The new service started in June 2002, and its success has allowed Arrow to change Ubiquidata’s licensing model from seats to sites, says Chris Henry, Arrow’s vice president and global information business unit general manager. In other words, the database’s ease of use finds that enterprises now prefer to let anyone in the company?not just specific individuals?log on and poke around.

The Politics of Searching

Automatic categorization can do more than just expand markets. “It’s difficult for anyone to understand who hasn’t lived through it to appreciate how political categorization management is,” observes Scott Lundstrom, CIO of AMR Research. “We had a category nomination process. We had a category retirement process. They all required long meetings.” Maintaining and supervising a process consumed a full-time IT position.

Then AMR moved to an autocategorizing product from Autonomy, and things changed for the better. “Today we’re increasingly relying on the software to do category recommendations,” Lundstrom says. “Everybody can see that it recognizes more relationships and that it isn’t biased.” And Lundstrom got his developer back, which made the CIO happiest of all.

U.S. Robotics (USR) is hoping to extract efficiencies from a different source. “We make a low-margin product,” says IT Director Steve Kossel. “One call to our support desk wipes out our profit on that sale.” Surveys show that 90 percent of users calling technical support had visited the USR website before calling. While the jury is still out on USR’s experiment with autocategorization (using tools from iPhrase Technologies), Kossel believes that the products will improve the precision and responsiveness of support on the USR website sufficiently to cut the number of support calls by a third, saving the company more than $135,000 a month.

Even companies that can afford manual tagging have reasons to look at autocategorization. Chat Joglekar, business development manager at, says that the major benefit of autocategorization for his company is consistency. had long used editors to do manual categorizing?or had avoided categorization altogether. But as the sheer mass of online material and the total number of editors kept growing and changing, the slight idiosyncrasies in how each of them categorized information steadily degraded the search function’s performance. Now the online newspaper takes advantage of a product called Concept Server from Applied Semantics. While machines may have their peculiarities, at least their biases are consistent over both time and scale of operation.

Raymond Karrenbauer, CTO of ING Americas’ Technology Management Office, reports a fourth payoff: Automatic categorization and taxonomy makes it easier for a company to add uncategorized or weakly categorized material, such as e-mail messages or ING’s more than 40,000 different formats of unstructured data, to its searchable data space. He adds that categorization improves the work of internal users?allowing customer service reps, for instance, to find what they need faster.

Several trends have combined to make those new services possible. First, two relevant “natural language recognition” technologies have matured almost simultaneously. One maps the frequencies of words in a document and their positions relative to each other to generate a document profile. The software then compares that profile with the profiles of previously categorized reference documents, those of other new documents or both. The first comparison sorts new documents into established categories; the second recognizes new topical “clusters” that probably should be explicit categories. For instance, if two documents have China and ceramics within 10 words of each other, the odds that they should be in the same category go up. Autonomy’s product relies on that approach.

The second technique (the one used by Applied Semantics, Inquira and others) relies on semantics. Given a document, such a program first filters out the important words, then looks up their synonyms, meanings and their thematic relationships (for example, the term chair would be linked to furniture and rocking). Finally it counts the number of these relationships to decide which words are most likely to reflect the document’s major and minor themes. Theoretically such a system can figure out whether an article on chips belongs under food, gambling, computers or horses, even if none of those specific terms appears in the document.

Perhaps the best news for vendors designing autocategorization products, however, has nothing to do with research breakthroughs. Today, more and more information travels with a lengthening entourage of data about itself (such as e-mail headers or meta-tags in webpages). Autocategorization software can recognize and leverage that data for its own ends. For example, iPhrase Technologies specializes in finding and harvesting, or “spidering,” categorization information across many data types. “Three to four years ago, we had to code up explicit structure with every deployment,” says Senior Product Manager Roy Rodenstein. “But today our clients have much richer data.”

All those trends have made autocategorization, and therefore smarter searching tools, a bright spot in today’s IT scene. Many companies have entered the sector. Some, such as Endeca or Mercado Software, specialize in the display and management of the categories that users see and interact with. Others, such as Applied Semantics, Autonomy and GammaSite, focus on the back end: looking at input documents and creating the meta-data the display tools need in order to work. Another set of companies, including longtime search player Verity, does both.

There’s no sign that advances in categorization and search technology will slow down anytime soon either. If searching is the foundation of all our relations with the online data, and categorization is the foundation of intelligent searching, then it seems likely that CIOs are going to be boosting the IQ of their searching tools for some time to come. Smart searching might very well become as important to the face of an enterprise as smart salespeople. n