Strategies for Pruning Data in the Cloud
In terrestrial systems, you don't think about disk space. In the clouds, you have to, if you don't it will cost you.
Wed, November 09, 2011
CIO — Year after year, the cost of disk space has plummeted. Since you can pick up a terabyte for $50, it's often seemed a false economy to be careful with storage.
But in the clouds, the rules are different. If you've got too much low-value data or too many copies of files, it can cost you in two ways. First are the monthly storage charges, and second is the inevitable performance hit when it comes to searches, views, reports, and dashboard updates. In the clouds, it really pays to prune you data set.
The first order of business is assessing the problem: is it documents, or table data? These typically have different storage limits, and the strategies and tools used for pruning are quite different.
Documents typically serve as attachments to records (such as a PDF of a signed contract, pinned to the relevant opportunity), so users may not be able to find them easily. Consequently, the same document may have been attached to three or four different records. You also need to look for cases where people have attached every version of a rapidly-changing document. The first thing to do is export an inventory of every document in the system (including the record IDs they are attached to, plus their last update date) and look for possible duplicates using spreadsheet filters. There are duplicate file detection tools that can do a much better job (by inspecting the contents of the files), but I don't know of any of these file tools that work directly in cloud applications. Unless you are willing to download all the file contents onto your own servers for that deep analysis, you're going to have to live with metadata analysis to identify which files to prune. Since optical storage is cheap, you might as well archive all the files you delete from the cloud, in case somebody complains later on.
Table data is a very different story, with many system-specific tricks and techniques for different kinds of clouds. That said, here's the general workflow:
• Identify which of your cloud systems really have a storage problem. Some systems (e.g., accounting) really can't be pruned very much because they need to be auditable and must hold all the details over long periods. Other systems (e.g., marketing automation or log analytics) rapidly collect enormous amounts of detail that can really slow the system down.
• Identify which tables are consuming more than 20 percent of your total storage. Focus there.
• For each table, understand the value of the individual records. Some tables (particularly accounts or contracts) are almost inviolate because of what they represent and the impact of record removal (particularly when these tables are integrated with outside systems). Other tables, such as "anonymous leads" in a marketing automation system, can be pruned with abandon.