Year after year, the cost of disk space has plummeted. Since you can pick up a terabyte for $50, it's often seemed a false economy to be careful with \n\nstorage. But in the clouds, the rules are different. If you've got too much low-value data or too many copies of files, it can cost you in two ways. First are \n\nthe monthly storage charges, and second is the inevitable performance hit when it comes to searches, views, reports, and dashboard updates. In \n\nthe clouds, it really pays to prune you data set.The first order of business is assessing the problem: is it documents, or table data? These typically have different storage limits, and the \n\nstrategies and tools used for pruning are quite different.Documents typically serve as attachments to records (such as a PDF of a signed contract, pinned to the relevant opportunity), so users may not \n\nbe able to find them easily. Consequently, the same document may have been attached to three or four different records. You also need to look \n\nfor cases where people have attached every version of a rapidly-changing document. The first thing to do is export an inventory of every document \n\nin the system (including the record IDs they are attached to, plus their last update date) and look for possible duplicates using spreadsheet filters. \n\nThere are duplicate file detection tools that can do a much better job (by inspecting the contents of the files), but I don't know of any of these file \n\ntools that work directly in cloud applications. Unless you are willing to download all the file contents onto your own servers for that deep analysis, \n\nyou're going to have to live with metadata analysis to identify which files to prune. Since optical storage is cheap, you might as well archive all the \n\nfiles you delete from the cloud, in case somebody complains later on.Table data is a very different story, with many system-specific tricks and techniques for different kinds of clouds. That said, here's the general \n\nworkflow:\u2022 Identify which of your cloud systems really have a storage problem. Some systems (e.g., accounting) really can't be pruned very much \n\nbecause they need to be auditable and must hold all the details over long periods. Other systems (e.g., marketing automation or log analytics) \n\nrapidly collect enormous amounts of detail that can really slow the system down.\u2022 Identify which tables are consuming more than 20 percent of your total storage. Focus there.\u2022 For each table, understand the value of the individual records. Some tables (particularly accounts or contracts) are almost inviolate \n\nbecause of what they represent and the impact of record removal (particularly when these tables are integrated with outside systems). Other tables, \n\nsuch as "anonymous leads" in a marketing automation system, can be pruned with abandon. \u2022 Before you go any further, do a complete backup of all your cloud's data onto either disk or optical media. I cannot say it any more clearly: \n\nthis is NOT optional. \u2022 For tables that can be freely pruned, look for the "signal to noise ratio." Is there some time horizon beyond which the information doesnt \n\nmatter at all? For example, in a marketing automation or web monitoring cloud, do we really care about anonymous visitors who havent returned in \n\n6 months? Is it OK to remove all Leads with a score of less than zero? Make sure you get buy-in from all the affected user groups first, but signal-\n\nto-noise based pruning can get rid of millions of records in a hurry.\u2022 Some tables have decent signal-to-noise ratios, but the amount of detail stored just isn't worth it over time. For example, many marketing \n\nautomation and e-mail blasting systems use the activity table to record important e-mail and Web interactions. These activity tables can represent \n\nhalf of the system's storage. But how much will it matter a year from now whether a person watched video A today versus video B yesterday? Use \n\nthis litmus test: if a particular detail will not actually change anyone's decision or behavior, it's not "information" any longer. For these situations, we \n\nrecommend a compression approach: keep the information, but remove most of the details after 6 months or so. The histories are typically stored \n\nas custom tables, represented by tallies, token strings, or even bitmaps with tiny storage requirements. This strategy will require some careful \n\nthinking, user input, and custom code development, but can provide continuous pruning based on information value.\u2022 Some tables (particularly leads and contacts) can collect duplicates in a hurry, particularly if your firm has process problems in lead \n\ncollection and handling. If your cloud system has deduping tools (from the main vendor or third parties), buy a good one and really learn it. The \n\nbest tools have fuzzy-logic algorithms that let you find and merge duplicates without moving the data out of the cloud. The merging process \n\npreserves as much of the data as possible, but if you have a lot of data collisions (e.g., two different mobile phone numbers for the same person), \n\nyou may need to create shadow fields and pre-populate them with divergent data prior to the merge. For a number of reasons, data merging must \n\nbe done in phases: it takes a lot of CPU time, as well as your think-time, to get rid of 100,000 dupes. Do not rush it, as there is no undo for a \n\nmerge.Most of the above is a one-time fix, rather than a process change. If you aren't willing to invest in enhancing your data management processes, \n\nyou may need to revisit these issues on a quarterly basis. Pretty much forever. David Taber is the author of the new Prentice Hall book, "Salesforce.com Secrets of \n\nSuccess" and is the CEO of SalesLogistix, a certified Salesforce.com consultancy \n\nfocused on business process improvement through use of CRM systems. SalesLogistix clients are in North America, Europe, Israel, and India, and \n\nDavid has over 25 years experience in high tech, including 10 years at the VP level or above.Follow everything from CIO.com on Twitter @CIOonline.