Dealing with scale is a multi-dimensional challenge and, if you\u2019re not careful, it may feel like you\u2019re battling a multi-headed monster.\u00a0\nJust when you think you\u2019ve solved scale by having a way to store large amounts of data, another problem pops up. Solving issues created by large-scale systems is like facing the mythical creature Hydra \u2013 when you cut off one of its heads, two more grow back, ready to consume you. How do you survive?\nYou\u2019ll find a hint in the myth: deal with all the monster\u2019s heads, not just one, to defeat it. You cannot win by solving a single challenge of scale: you need to be aware of different aspects of the problem, some of which may be hidden.\u00a0\nYou should be able to meet your SLAs, in an affordable way, at your current scale and as your system grows. For this to be practical, you must do it without having to scale up your IT team to match the growing quantity of data.\nHow can you do this? Start with the idea that \u201cbig\u201d isn\u2019t just about the number of bytes.\u00a0\nThere\u2019s more to scale than the number of bytes\nScale includes large amounts of data but also involves other things, such as data diversity. Traditionally, people mainly had files and databases. Now data includes images, media, events, and so much more. Data diversity is just one issue you face with large-scale projects. Either you conquer the various aspects of scale, or the problems that ensue will be large.\u00a0\nHere are four key dimensions of scale, each of which must be won over.\nConquering the amount of data\nHow big is big data? People work with amounts of data today that just a few years ago would have been considered unusual. Consider it big when the size of data itself becomes a technical challenge. People who work with less than 50 terabytes may not feel the challenge of scale. Businesses the Ezmeral team at HPE routinely works with have data ranging from 50 terabytes to 500 petabytes and beyond, so their data infrastructure needs to handle scale easily.\nOne way to conquer large data size is to not make it bigger than it needs to be. Limitations imposed by some data technologies lead people to make unnecessary copies of huge data sets. This happens in part because their infrastructure lacks open and flexible data access methods needed to accommodate different analytics or machine learning tools and languages.\nData also gets copied unnecessarily to\u00a0deal with \u201cnoisy neighbors\u201d \u2013 competing applications that may create hot spots or congestion. Data infrastructure should have platform-level capabilities that automatically make local copies of small subsets of data to avoid hot spots. In addition, your data infrastructure needs fully distributed metadata to avoid congestion when multiple applications access large data sets.\nFinally, when you have a lot of data, you need a convenient way to find it. Data infrastructure with familiar file and directory access and a global namespace can improve the efficiency of your team because references to data can remain stable.\u00a0\nThe lesson is this: Conquer data size while meeting SLAs with the help of your data infrastructure. It should provide platform-level capabilities for secure, affordable large-scale data storage plus flexible data access and management. This lets you meet current and future challenges of scale. Businesses shouldn\u2019t have to re-architect their system as data grows.\u00a0\nDon\u2019t be slayed by the number of objects\nAnother challenge of scale is a large number of objects. Working with hundreds of millions or even billions of small files can swamp your data infrastructure unless it is designed to handle scale in terms of the number of objects as well as the amount of data.\u00a0Data infrastructure really matters; we see customers who routinely work with trillions of files.\u00a0\nWhile most businesses don\u2019t have to deal with that scale, common use cases involve tens of millions to billions of objects. Businesses using IoT-based sensor data and metrics-oriented service companies tend to have large numbers of files.\u00a0\nConsumer websites are a typical example: they often need to display multiple images of the items they sell. While offerings are not usually in the millions, there are many versions of each item or service. An online retail catalogue may show multiple colors, sizes, and views for each of many products, all stored as images. Hundreds of images per product across millions of products adds up to a lot of small images.\nThe point is, don\u2019t assume you\u2019re safe if you\u2019ve conquered data quantity. Choose a system that also handles a large number of objects.\nTackle the number of applications running simultaneously\nAs businesses take advantage of multi-tenancy, they naturally expand the number of applications they run. Ideally, these applications should be able to run on the same cluster. I\u2019ve seen a large financial customer, for instance, who started with one or two applications running on a new large-scale data set. Given the success of these initial applications, this customer soon added hundreds more on the same cluster.\nIn contrast, if your infrastructure doesn\u2019t make it feasible for different applications and groups to share data (thanks to open APIs and ways to avoid interference) you may end up with sprawling proliferation of machines and unnecessary data copies. Both put an added burden on IT.\nContainerized applications, orchestrated via the open source Kubernetes framework, also help address the challenge of running many applications on the same cluster. Your data infrastructure should work in concert with Kubernetes to provide a way to persist data from stateful applications that are running in containers.\u00a0\nTake the battle to geo-distributed locations\nAnother challenge of scale is the number of geo-distributed locations that serve as data sources. Keep in mind, geo-distributed data is not just for large industrial-based use cases. Edge comes into play in a variety of more common situations, including retail and financial, with transactions happening at many locations. Can you deal with this challenge?\nA data infrastructure that stretches from core to edge can effectively deal with geo-distributed data sources by handling data motion efficiently at a platform level. This capability is important to capture data at many sources and move it as necessary back to core data centers\u2014on premises or in the cloud. Your infrastructure also should let you move applications to the edge, where partial processing, analytics, or updated models need to run. \u00a0\nYour Multi-faceted Defense\nThe best way to conquer a multi-headed creature is to keep all the heads in view and have a multi-faceted weapon to counter them all at once. When the creature is scale, an excellent defense is HPE Ezmeral Data Fabric, a data infrastructure engineered to handle all these aspects of large scale at the same time. The data fabric (formerly the MapR Data Platform) is part of theHPE Ezmeral Software Portfolio. HPE Ezmeral Data Fabric also provides the core data infrastructure of the HPE Ezmeral Container Platform.\u00a0\nTo find out more about how these technologies can help you fight your own Hydra, explore these resources:\nRead the blog post \u201cIf HPE Ezmeral Data Fabric is the answer, what is the question?\u201dRead the blog post \u201cHPE Ezmeral Data Fabric: A sneak peek at what\u2019s coming in 6.2\u201dExplore the HPE Ezmeral Data Fabric platform page in the HPE Developer Community\n____________________________________\nAbout Ellen Friedman\n\nEllen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O\u2019Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.