When you work with large-scale data, it\u2019s important to \u201c\u2026know when to hold \u2018em, know when to fold \u2018em\u2026\u201d.\u00a0\nMost modern businesses rely on\u00a0large-scale data, so it\u2019s natural to focus on the best ways to quickly ingest and store huge amounts of data. But people often overlook how to delete data in very large-scale systems. It may sound simple, but it becomes a non-trivial task when data sets are in the tens of terabytes to petabyte or even exabyte scale.\u00a0\nThe challenge of deleting large-scale data is a widespread issue that can have significant consequences if not done properly. For example, think about the ramifications of not following auditing requirements concerning data deletions mandated in the GDPR or California\u2019s equivalent, CCPA. Even when data is deleted for more mundane reasons, such as freeing up resources, the process can have a negative impact on critical operations if a system is not built with specific mechanisms to deal with deletion at scale.\nWhy data deletion at scale can be challenging\nIt\u2019s important to have efficient and reliable ways to handle large scale data deletion. Before we look at how best to best handle large-scale data deletion, let\u2019s first consider what can make it so difficult.\nOne of the most common problems is not recognizing the significance of this issue and thus not taking it into account when planning for data logistics. You don\u2019t\u00a0just type rm -rf!\u00a0 With many aspects of scalability, the sheer amount of data can make otherwise simple tasks much more complicated, more time-consuming, or more costly. For data deletion at large scale, consider these issues:\u00a0\n\n\nIt may take a long time for large data sets to be deleted, so queries may see partial data\n\n\nOther systems could have I\/O performance impaired while you wait for data to be deleted\n\n\nIf a failure occurs during the deletion process, you may think data is entirely deleted when it is not. This might happen because the process stopped mid-way through deletion. If you track manually, you may miss that fact because you stopped watching the slow process.\n\n\nLarge-scale data generally comes in many Deletion can be expensive simply because hundreds of millions or even billions of files must be removed.\n\n\nAny of these issues become more troublesome if they are not taken into account when developing a comprehensive data strategy.\nWhat\u2019s the best way to approach this problem? One idea is to leverage the cost effectiveness of the modern distributed data infrastructure and avoid the hassle of deleting large datasets by saving everything. Here\u2019s why that\u2019s not a good solution.\nIf data deletion is difficult, why do it?\nModern scalable systems do make it feasible to keep large amounts of data. Therefore, given the potential value to be mined from data, an organization may have good reasons to store even raw data over time. For some projects, large-scale data sets reveal patterns or insights that cannot be accurately obtained on smaller data sets. Value can be a matter of finding the golden nuggets hidden in large amounts of data.\u00a0\nThe scalability of modern systems also offers the advantage of flexibility; you don\u2019t have to know at the time of data ingestion all the ways you\u2019ll want to use that data. Raw data may be processed for one particular purpose, yet it might retain different value for other reasons.\nGiven the potential value of data and the hassle of deleting it, is it practical to save it all forever? And is saving it all desirable?\nThe answer is generally no. There comes a point where the declining incremental value of old data is not worth the constant marginal cost or hassle of storing it.\u00a0\nFurthermore, compliance with regulatory requirements may come into play. And interestingly, compliance can go either way. In some cases, regulatory considerations require you persist data for a specific time period, perhaps to provide an auditable record of financial transactions or communications. But compliance also can force data deletions due to regulatory issues such as data privacy. In this case, you not only need to delete data, but also be confident that it has been deleted beyond hope of recovery.\u00a0\nClearly many enterprises need a way to delete large amounts of data, at the appropriate time and without the process creating additional problems.\nShape of the solution\nTo deal with the deluge of data modern enterprises face, it\u2019s important to have a system that allows data tiering. Less frequently accessed data should be persisted in various ways that optimize resource use and keep costs down. But there comes a time when data needs to go away entirely.\u00a0It\u2019s important to architect your data structure up-front to ensure easy deletions in the future versus coming up on the problem when SLAs requirements are urgent.\u00a0\u00a0\u00a0\u00a0\nDoing this requires a data infrastructure specifically designed and engineered to handle large-scale data deletion efficiently. Even though data deletion might be initiated manually, an effective data infrastructure should remove the deleted data from view instantly and commit to completing the process automatically, behind the scenes, and in such a way that other processes are not adversely affected. Perhaps most importantly, it should be a requirement that your data infrastructure tracks the process so you can have confidence the data has been deleted completely. And if the data platform is also engineered with mechanisms to massively speed up bulk data deletion, so much the better.\u00a0\nData deletion with HPE Ezmeral Data Fabric\nThe challenges of data deletion in large-scale systems were taken into account in the original design and engineering of the HPE Ezmeral Data Fabric, a highly scalable, software-defined and hardware-agnostic data infrastructure. As a unifying data layer that spans an enterprise, including edge and cloud deployments, the Ezmeral data fabric can handle huge amounts of data -- up to exabyte scale. Not only does HPE Ezmeral Data Fabric provide several mechanisms for efficient data tiering, it also carries out data deletion efficiently and reliably.\nOne way this can be done is through a management construct known as a data fabric volume. Files, tables, and event streams (all built into the data fabric) are stored and managed via data volumes. Data tiering policies, for example, are set at the volume level.\u00a0\nData deletion can be done at the individual file level. But when millions or billions of files are involved, it\u2019s much more efficient to delete them in bulk by simply deleting the entire volume that contains them. This is not only a convenience for the user\u2014having this option can speed up the process.\nFurthermore, HPE\u2019s Ezmeral data fabric is designed to manage deletion behind the scenes. From the user\u2019s perspective, data deletion is immediate. Internally, the data fabric automatically tracks the deletion process to completion in an audit-tracking system so IT can be confident the data is truly deleted and all space is reclaimed for re-use.\u00a0\nWhen it comes to large-scale data, HPE Ezmeral Data Fabric not only makes it easier to hold \u2018em, but also \u2013 when you\u2019re ready -- easier to fold \u2018em!\nFind out more about scale-efficient systems\nRead about the impact of flexible data access in the new customer case study from New Work SE\u00a0 \u201cAccelerating Data Insight for a Better Work Life\u201d\nDownload free O\u2019Reilly ebook AI and Analytics at Scale: Lessons from Real-World Production Systems by Ted Dunning and Ellen Friedman \u00a9 January 2021\nWatch the short video interview by Ronald van Loon with Ellen Friedman and Ted Dunning \u201cHow to Solve the Siloed Data Challenge\u201d\n____________________________________\nAbout Ellen Friedman\n\nEllen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O\u2019Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.