How to Discard Data: Solving the Hidden Challenge of Large-scale Data Deletion

BrandPost By Ellen Friedman
Apr 01, 2021
Data ArchitectureIT Leadership

Credit: shutterstock

When you work with large-scale data, it’s important to “…know when to hold ‘em, know when to fold ‘em…”. 

Most modern businesses rely on large-scale data, so it’s natural to focus on the best ways to quickly ingest and store huge amounts of data. But people often overlook how to delete data in very large-scale systems. It may sound simple, but it becomes a non-trivial task when data sets are in the tens of terabytes to petabyte or even exabyte scale

The challenge of deleting large-scale data is a widespread issue that can have significant consequences if not done properly. For example, think about the ramifications of not following auditing requirements concerning data deletions mandated in the GDPR or California’s equivalent, CCPA. Even when data is deleted for more mundane reasons, such as freeing up resources, the process can have a negative impact on critical operations if a system is not built with specific mechanisms to deal with deletion at scale.

Why data deletion at scale can be challenging

It’s important to have efficient and reliable ways to handle large scale data deletion. Before we look at how best to best handle large-scale data deletion, let’s first consider what can make it so difficult.

One of the most common problems is not recognizing the significance of this issue and thus not taking it into account when planning for data logistics. You don’t just type rm -rf!  With many aspects of scalability, the sheer amount of data can make otherwise simple tasks much more complicated, more time-consuming, or more costly. For data deletion at large scale, consider these issues: 

  • It may take a long time for large data sets to be deleted, so queries may see partial data

  • Other systems could have I/O performance impaired while you wait for data to be deleted

  • If a failure occurs during the deletion process, you may think data is entirely deleted when it is not. This might happen because the process stopped mid-way through deletion. If you track manually, you may miss that fact because you stopped watching the slow process.

  • Large-scale data generally comes in many Deletion can be expensive simply because hundreds of millions or even billions of files must be removed.

Any of these issues become more troublesome if they are not taken into account when developing a comprehensive data strategy.

What’s the best way to approach this problem? One idea is to leverage the cost effectiveness of the modern distributed data infrastructure and avoid the hassle of deleting large datasets by saving everything. Here’s why that’s not a good solution.

If data deletion is difficult, why do it?

Modern scalable systems do make it feasible to keep large amounts of data. Therefore, given the potential value to be mined from data, an organization may have good reasons to store even raw data over time. For some projects, large-scale data sets reveal patterns or insights that cannot be accurately obtained on smaller data sets. Value can be a matter of finding the golden nuggets hidden in large amounts of data. 

The scalability of modern systems also offers the advantage of flexibility; you don’t have to know at the time of data ingestion all the ways you’ll want to use that data. Raw data may be processed for one particular purpose, yet it might retain different value for other reasons.

Given the potential value of data and the hassle of deleting it, is it practical to save it all forever? And is saving it all desirable?

The answer is generally no. There comes a point where the declining incremental value of old data is not worth the constant marginal cost or hassle of storing it. 

Furthermore, compliance with regulatory requirements may come into play. And interestingly, compliance can go either way. In some cases, regulatory considerations require you persist data for a specific time period, perhaps to provide an auditable record of financial transactions or communications. But compliance also can force data deletions due to regulatory issues such as data privacy. In this case, you not only need to delete data, but also be confident that it has been deleted beyond hope of recovery. 

Clearly many enterprises need a way to delete large amounts of data, at the appropriate time and without the process creating additional problems.

Shape of the solution

To deal with the deluge of data modern enterprises face, it’s important to have a system that allows data tiering. Less frequently accessed data should be persisted in various ways that optimize resource use and keep costs down. But there comes a time when data needs to go away entirely. It’s important to architect your data structure up-front to ensure easy deletions in the future versus coming up on the problem when SLAs requirements are urgent.    

Doing this requires a data infrastructure specifically designed and engineered to handle large-scale data deletion efficiently. Even though data deletion might be initiated manually, an effective data infrastructure should remove the deleted data from view instantly and commit to completing the process automatically, behind the scenes, and in such a way that other processes are not adversely affected. Perhaps most importantly, it should be a requirement that your data infrastructure tracks the process so you can have confidence the data has been deleted completely. And if the data platform is also engineered with mechanisms to massively speed up bulk data deletion, so much the better. 

Data deletion with HPE Ezmeral Data Fabric

The challenges of data deletion in large-scale systems were taken into account in the original design and engineering of the HPE Ezmeral Data Fabric, a highly scalable, software-defined and hardware-agnostic data infrastructure. As a unifying data layer that spans an enterprise, including edge and cloud deployments, the Ezmeral data fabric can handle huge amounts of data — up to exabyte scale. Not only does HPE Ezmeral Data Fabric provide several mechanisms for efficient data tiering, it also carries out data deletion efficiently and reliably.

One way this can be done is through a management construct known as a data fabric volume. Files, tables, and event streams (all built into the data fabric) are stored and managed via data volumes. Data tiering policies, for example, are set at the volume level. 

Data deletion can be done at the individual file level. But when millions or billions of files are involved, it’s much more efficient to delete them in bulk by simply deleting the entire volume that contains them. This is not only a convenience for the user—having this option can speed up the process.

Furthermore, HPE’s Ezmeral data fabric is designed to manage deletion behind the scenes. From the user’s perspective, data deletion is immediate. Internally, the data fabric automatically tracks the deletion process to completion in an audit-tracking system so IT can be confident the data is truly deleted and all space is reclaimed for re-use. 

When it comes to large-scale data, HPE Ezmeral Data Fabric not only makes it easier to hold ‘em, but also – when you’re ready — easier to fold ‘em!

Find out more about scale-efficient systems

Read about the impact of flexible data access in the new customer case study from New Work SE  “Accelerating Data Insight for a Better Work Life”

Download free O’Reilly ebook AI and Analytics at Scale: Lessons from Real-World Production Systems by Ted Dunning and Ellen Friedman © January 2021

Watch the short video interview by Ronald van Loon with Ellen Friedman and Ted Dunning “How to Solve the Siloed Data Challenge


About Ellen Friedman

Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.