by Fred Hapgood

Data Trends: Petabyte and Beyond

News
Oct 15, 200211 mins
Virtualization

Suppose you came to work one day, took off your jacket, loosened your tie, sat down and found a letter on your desk from corporate counsel advising you that a court has just ruled that corporations are now responsible for retaining all business-related phone conversations for one calendar year.

Under this letter is a memo from the PR department (labeled URGENT) advising you that marketing’s plan to use face recognition technology to identify prime customers entering the store may be getting some negative reviews in the press, with phrases like invasion of privacy and Big Brother being tossed around.

And under that is a directive from management asking for a technical analysis of a market simulator that may be able to predict to 5 percent accuracy the response to a new product of single parents between the ages of 29 and 33 living in Illinois suburbs with incomes in the third decile.

Who are you? You’re a petabyte CIO, the person responsible for developing and maintaining the science-fiction-like applications that will be running on tomorrow’s immense storage capacities. Currently, petabyte responsibilities are mostly restricted to the IT departments of universities, research organizations, and microbiology and genetics labs. But the law of technological adoption (“If you build it, they will use it”) says that sooner or later most CIOs will be crossing that line, discovering a new world of applications, responsibilities, costs and problems. So it’s time to start thinking about…

Petabyte Power

Petabyte levels of storage will make possible three new categories of applications. One group depends on retaining and processing vast amounts of visual data, especially data from video cams. Imagine cameras trained on the sales floor, recording the minute-by-minute flow of customer traffic. Imagine that data feeding an application that analyzes the relative effectiveness of a given product placement or the impact of a markdown. Marketing might be interested in learning how the proportion of couples to singles entering the store changes during the course of a two-week promotional campaign, or at what time of day the number of women shopping with kids rises and at what time it falls. With petabyte levels of storage, HR will be able to crunch a few months’ worth of video camera data to flag personnel responsible for traffic bottlenecks on the sales floor.

A second category of petabyte potential is in supporting the transition to device networks. If the first generation of networks connected people to data and to each other, the second will do the same with both physical (counters, meters, cameras, motors, switches, telephones, digital printers) and virtual devices (applications and program objects). The great virtue of device networks is that they allow any interested constituency to have remote access to any link in the production cycle. CNN, for example, is digitizing and networking all its production equipment so that pagers, cell phones, PDAs, desktops and websites will all have equal, continuous and simultaneous access to programming.

In many industries, machines already keep maintenance informed about their operating condition, allowing them to be repaired just before they are about to fail. But if all the machines in a production line could be fully networked, management would be able to switch an entire production process to a single desired configuration?change the car’s bench seats, for example, to buckets, or its color from bottle-green to battleship gray?from half a world away. Once manufacturers can do that reliably, the dream of on-demand manufacturing truly has been achieved.

However, properly managing the thousands of sensor-actuated loops that form device networks requires retaining the history of their states, often in their raw, unsummarized form, for months and possibly years. To do that, you need to be able to store data in petabytes.

Finally, petabyte levels of storage would allow simulations and predictive models of enormous complexity. For instance, retail managers currently worry about how they can persuade a casual visitor to make a purchase. While this is an important problem, even more critical is how to turn a casual buyer into a loyal, recurring one. Over the long term, this second kind of conversion can deliver even more value to the enterprise.

But brand loyalty doesn’t happen overnight, points out Richard Winter, president of the Winter Corp., a Waltham, Mass-based consultancy specializing in the architecture of very large databases. “[Transforming] someone into a repeat customer means presenting her with just the right information or opportunity at the right time,” he says. “Knowing what to present can mean retaining huge amounts of information on that customer?what they’ve looked at, checked prices on, asked about, what they’ve not looked at?over long periods. Often the relationship needs to be followed from the point the customer first appears. Right now, that’s impossible because raw, unsummarized, clickstream and transaction data is generally discarded after 30 to 60 days.”

The reason it’s discarded is because heretofore it was impossible to store. Petabyte levels of storage will change that.

Petabyte Problems

It’s easy to think of reasons why management might want to ask the CIO to lead the enterprise through the petabyte door. The next issue is finding ways to do that without getting fired.

If you assume storage-related costs (especially the time penalties) scale linearly, then the headaches that come along with petabyte management will dwarf the headaches associated with a terabyte of data as an eight-story building towers over an inch-high matchbox.

Does that make you cringe? Wait. It gets worse.

Searches conducted on large volumes of data naturally generate more errors. At some point, the number of errors so overwhelm the user’s ability to cope that the system essentially becomes useless. The only solution is to rewrite the search programs so that they make fewer errors, and no IT development task is harder to do predictably than boosting the IQ of computer programs. Finally, according to John Parkinson, CTO of Cap Gemini Ernst & Young for the America’s region, even the cost of the core overhead tasks (such as buffer management) typically grow faster than linearly.

One school of thought is that transition to petabyte levels is just not worth it.

Faisal Shah, cofounder and CTO of Chicago-based systems integrator Knightsbridge Solutions, says that data quality naturally drifts down as more space opens up in the corporate attic, in part because you are now saving things you used to throw away. Shah believes that companies will be better off spending their now-restricted IT dollars on trying to extract more intelligence from current data stores rather than piling up haystacks with fewer and fewer needles hidden in them.

Petabyte Solutions

Other observers are betting that new technologies will be able to keep those penalties under control. Like many IT problems, the solutions being explored fall along the spectrum of centralized to distributed.

Ron Davis, senior IT architect of Equifax, the Atlanta-based consumer data company, is working with a centralized management solution from Corworks. Equifax’s business is to buy raw data from state agencies or directory companies, and turn it into information products. Equifax wants to control the data it buys for as long as possible as it never knows what a new product design might call for or when. While the data could, in theory, be left with its suppliers, Davis’s experience is that retention policies and practices vary too widely over Equifax’s 14,000 data sources to make such dependence practical. He believes that at least over the short run, companies near the end of the value chain will have to take on the responsibility of archiving raw data. Shouldering this responsibility has put Equifax on the road to becoming a petabyte company, and it has forced Davis to search for an architecture competent to deal with the petabyte problems of cost, error and time.

Corworks’ basic idea is to beat the time penalties inherent in handling large volumes of data by loading it into electronic memory. This seems counterintuitive, rather like making a quart easier to drink by squeezing it into a pint, but the feat is done by stripping out the structural data (such as converting everything into flat files), compressing the result, and then relying on fast processors to decompress and restore the data structures only as needed. In other words, just-in-time logic.

“I have a 67 billion row table,” Davis says, “and I can do a sort across six months of that table in three seconds.” Backing up and restoring became easier for the same reasons.

A second approach to leveraging the speed of electronic memory is to build algorithms that can grade data by importance. The most critical pieces get loaded into memory, while the rest goes to disk-based systems where lower performance levels (and therefore lower operating costs) are tolerable. Dave Harley, chief designer of London-based BT Group, is experimenting with this approach with software from Princeton Softtech. While the application has been used only in the system that supports IT management for its employees in asset management and fault tracking, results have been such that Harley expects to see this so-called active archiving adopted throughout the company. “The key factor is keeping the most critical database as small as possible,” he says. “It’s quite a new idea.”

StorageNetworks of Waltham, Mass., is also using this approach to manage the 1.5 petabytes acquired through its storage services arm. CEO Peter Bell says that 70 percent of the data stored on an average system has not been looked at in the previous 90 days. If you make the reasonable assumption that the number of recent accesses is a dependable way to determine relative enterprise criticality, then loading just the most used data into memory can go a long way to delivering acceptable performance where it is needed. Bell adds that the critical issue in managing petabyte-scale volumes of data is developing data classification systems that balance power without introducing excessive single point-of-failure risks. (If a computer managing a petabyte goes bad, the damage it can cause is breathtaking.)

On the other hand, Len Cavers, director of technical development for Experian, an Equifax competitor, believes that in the long run centralized solutions will not scale adequately. He argues that as backbone bandwidth speeds increase and data standards get defined and distributed, companies such as his will find it increasingly practical to “leave the data” higher and higher up the value chain. In that world, the networks would carry not raw data (which wouldn’t move) but queries and intelligent indexes so that querying systems know which to connect to. Experian is now involved with an active development program with its partners over how to use XML and Web services to frame and respond to queries and generate indexes.

Cavers believes that petabyte-level data stores will force IT people to minimize the number of mass copy operations as much as possible. “This is a paradigm shift in the way people think about computing,” he says.

The Petabyte Paradigm

Gerry Higgins, senior vice president of Verizon information processing services in New York City, points out that maintaining a petabyte of data raises distribution management issues in hardware as well as software. In the petabyte world, data is usually spread over thousands of disks. “Vendors always want to talk to me about how great their mean-time-between-failure numbers are. I tell them not to bother. All I’m interested in is what happens when there is a failure,” Higgins says. “When you deal with so many disks, some are always crashing. I tell them that when you’re a petabyte guy like me, you have to expect failures.”

Many observers think the transition to petabyte levels is going to introduce changes even more sweeping than those associated with previous leaps in storage. “Traditionally vendors have built standalone data mining engines and moved the data into them,” says Winter. “But are you going to be able to move a petabyte around like that?” Winter foresees radical changes in engine architecture, probably involving breakthroughs in the engineering of parallelization.

“The whole notion of storage takes on a new meaning,” says Scot Klimke, vice president and CIO for Network Appliance, a storage services vendor in Sunnyvale, Calif. “It starts to be defined less as simple retention and more as the struggle for information quality.”

Perhaps the worst such issue is consistency. A petabyte of data is so big, and the quality of the information it contains is perforce so low, that it is bound to contain and create inconsistent information, which means that any petabyte-level system has to contain ways of detecting and resolving data conflicts.

Another issue is aging: Information quality varies, roughly, with age, but present systems are poorly equipped to track the age of material, especially material within a file. “I have five priorities for this fiscal year,” Klimke says. “Two involve data quality.”

Klimke argues that as the petabyte revolution picks up steam, the struggle to measure and manage data quality will increasingly define the CIO’s job. While he might or might not be right about this specific point, it’s clear that anyone exploring the petabyte world should bring a good map, watch out for booby traps and carry a rabbit’s foot for luck.