Credit: shutterstock Dealing with scale is a multi-dimensional challenge and, if you’re not careful, it may feel like you’re battling a multi-headed monster. Just when you think you’ve solved scale by having a way to store large amounts of data, another problem pops up. Solving issues created by large-scale systems is like facing the mythical creature Hydra – when you cut off one of its heads, two more grow back, ready to consume you. How do you survive? You’ll find a hint in the myth: deal with all the monster’s heads, not just one, to defeat it. You cannot win by solving a single challenge of scale: you need to be aware of different aspects of the problem, some of which may be hidden. SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe You should be able to meet your SLAs, in an affordable way, at your current scale and as your system grows. For this to be practical, you must do it without having to scale up your IT team to match the growing quantity of data. How can you do this? Start with the idea that “big” isn’t just about the number of bytes. There’s more to scale than the number of bytes Scale includes large amounts of data but also involves other things, such as data diversity. Traditionally, people mainly had files and databases. Now data includes images, media, events, and so much more. Data diversity is just one issue you face with large-scale projects. Either you conquer the various aspects of scale, or the problems that ensue will be large. Here are four key dimensions of scale, each of which must be won over. Conquering the amount of data How big is big data? People work with amounts of data today that just a few years ago would have been considered unusual. Consider it big when the size of data itself becomes a technical challenge. People who work with less than 50 terabytes may not feel the challenge of scale. Businesses the Ezmeral team at HPE routinely works with have data ranging from 50 terabytes to 500 petabytes and beyond, so their data infrastructure needs to handle scale easily. One way to conquer large data size is to not make it bigger than it needs to be. Limitations imposed by some data technologies lead people to make unnecessary copies of huge data sets. This happens in part because their infrastructure lacks open and flexible data access methods needed to accommodate different analytics or machine learning tools and languages. Data also gets copied unnecessarily to deal with “noisy neighbors” – competing applications that may create hot spots or congestion. Data infrastructure should have platform-level capabilities that automatically make local copies of small subsets of data to avoid hot spots. In addition, your data infrastructure needs fully distributed metadata to avoid congestion when multiple applications access large data sets. Finally, when you have a lot of data, you need a convenient way to find it. Data infrastructure with familiar file and directory access and a global namespace can improve the efficiency of your team because references to data can remain stable. The lesson is this: Conquer data size while meeting SLAs with the help of your data infrastructure. It should provide platform-level capabilities for secure, affordable large-scale data storage plus flexible data access and management. This lets you meet current and future challenges of scale. Businesses shouldn’t have to re-architect their system as data grows. Don’t be slayed by the number of objects Another challenge of scale is a large number of objects. Working with hundreds of millions or even billions of small files can swamp your data infrastructure unless it is designed to handle scale in terms of the number of objects as well as the amount of data. Data infrastructure really matters; we see customers who routinely work with trillions of files. While most businesses don’t have to deal with that scale, common use cases involve tens of millions to billions of objects. Businesses using IoT-based sensor data and metrics-oriented service companies tend to have large numbers of files. Consumer websites are a typical example: they often need to display multiple images of the items they sell. While offerings are not usually in the millions, there are many versions of each item or service. An online retail catalogue may show multiple colors, sizes, and views for each of many products, all stored as images. Hundreds of images per product across millions of products adds up to a lot of small images. The point is, don’t assume you’re safe if you’ve conquered data quantity. Choose a system that also handles a large number of objects. Tackle the number of applications running simultaneously As businesses take advantage of multi-tenancy, they naturally expand the number of applications they run. Ideally, these applications should be able to run on the same cluster. I’ve seen a large financial customer, for instance, who started with one or two applications running on a new large-scale data set. Given the success of these initial applications, this customer soon added hundreds more on the same cluster. In contrast, if your infrastructure doesn’t make it feasible for different applications and groups to share data (thanks to open APIs and ways to avoid interference) you may end up with sprawling proliferation of machines and unnecessary data copies. Both put an added burden on IT. Containerized applications, orchestrated via the open source Kubernetes framework, also help address the challenge of running many applications on the same cluster. Your data infrastructure should work in concert with Kubernetes to provide a way to persist data from stateful applications that are running in containers. Take the battle to geo-distributed locations Another challenge of scale is the number of geo-distributed locations that serve as data sources. Keep in mind, geo-distributed data is not just for large industrial-based use cases. Edge comes into play in a variety of more common situations, including retail and financial, with transactions happening at many locations. Can you deal with this challenge? A data infrastructure that stretches from core to edge can effectively deal with geo-distributed data sources by handling data motion efficiently at a platform level. This capability is important to capture data at many sources and move it as necessary back to core data centers—on premises or in the cloud. Your infrastructure also should let you move applications to the edge, where partial processing, analytics, or updated models need to run. Your Multi-faceted Defense The best way to conquer a multi-headed creature is to keep all the heads in view and have a multi-faceted weapon to counter them all at once. When the creature is scale, an excellent defense is HPE Ezmeral Data Fabric, a data infrastructure engineered to handle all these aspects of large scale at the same time. The data fabric (formerly the MapR Data Platform) is part of theHPE Ezmeral Software Portfolio. HPE Ezmeral Data Fabric also provides the core data infrastructure of the HPE Ezmeral Container Platform. To find out more about how these technologies can help you fight your own Hydra, explore these resources: Read the blog post “If HPE Ezmeral Data Fabric is the answer, what is the question?” Read the blog post “HPE Ezmeral Data Fabric: A sneak peek at what’s coming in 6.2” Explore the HPE Ezmeral Data Fabric platform page in the HPE Developer Community ____________________________________ About Ellen Friedman Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series. Related content brandpost How ML Ops Can Help Scale Your AI and ML Models Machine learning operations, or ML Ops, can help enterprises improve governance and regulatory compliance, automation, and production model quality. By Richard Hatheway Apr 07, 2022 7 mins Machine Learning IT Leadership brandpost Edge Computing is Thriving in the Cloud Era Todayu2019s edge technology is not just bolstering profits, but also helping reduce risk and improve products, services, and customer experience. By Denis Vilfort, Al Madden Apr 06, 2022 11 mins Edge Computing Artificial Intelligence IT Leadership brandpost 5 Types of Costly Data Waste and How to Avoid Them Poor choices in data infrastructure and data habits can lead to data waste u2013 but a comprehensive data strategy can help resolve the problem. By Ellen Friedman Mar 29, 2022 11 mins Data Center Management Data Architecture IT Leadership brandpost 2022 is the Year of the Edge By Matthew Hausmann Feb 28, 2022 9 mins Data Science Edge Computing IT Leadership Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe