Unlocking the secret to scale-efficient systems: A clue from skyscrapers

istock 486351982
istock

Scale, done well, isn’t just bigger than what came before. It’s something different altogether. 

Take the Chrysler Building in New York City, for instance. It’s an example of an early skyscraper, an Art Deco classic design.  At 77 stories, it may seem like just a taller version of a 3-story building, but it’s not. Completed in 1930, it took advantage of a technical revolution that was key to the skyscraper era: innovations in the infrastructure on which tall buildings could be built.

Traditional methods would not work at a much taller scale. In order to support the greater height, the walls would be so thick that there would be almost no interior space.  

Instead, the new tall buildings featured walls that were non-loading bearing – so-called curtain walls. These skyscrapers used a steel framework as the fundamental infrastructure to support the weight of the building. Other revolutionary changes in infrastructure included passenger-bearing elevators and new methods for fireproofing. 

So what do skyscraper technical advances have to do with building modern large-scale data and computational systems? The key lesson is not to rely on just doing a bigger version of what you’ve done before. Instead, it’s important to take full advantage of modern technological advances in data infrastructure and orchestration of computation in order to build a truly scale-efficient system.

What is “scale efficient”?

Building and maintaining large-scale systems that function well in production is challenging, but people often make it harder, and more expensive, than it needs to be. Alternatively, many people working with a scale-efficient system find this relatively straight-forward even reasonably easy to adapt to changes as systems grow.

Being scale-efficient is not just a question of whether or not you can get an analytics or AI system working at large scale at all. It’s a matter of working with a system optimized for scale and changes in scale, a system that is cost-effective and does not overly burden IT in meeting current and future SLAs. Scale-efficient also means having the flexibility to adapt to change without having to completely re-architect your system. 

There are several fundamental requirements for scale-efficiency. One is to have a way to efficiently orchestrate computation across different locations and in a variety of easily specified environments. Containerization of applications and use of an efficient orchestration framework such as Kubernetes can help.

Other requirements have to do with data, and these are more often overlooked. It’s essential to develop a comprehensive data strategy and to use modern data infrastructure specifically engineered for scale-efficiency across different locations on-premises, at edge locations, and in cloud deployments. 

Self-test: Is your system scale-efficient? 

How can you tell if the system you’ve built is scale-efficient? There’s no single test to measure scale-efficiency, but there are a number of tell-tale signs of systems that are unnecessarily cumbersome, expensive, or lack the flexibility needed to adapt to changes. In the recently published short book AI and Analytics at Scale: Lessons from Real-World Production Systems, with my co-author Ted Dunning, we’ve described many indicators that show your system is not scale-efficient, causing you to accept trade-offs that are avoidable. The following list has some of the most important indicators. Do you see your own system in any of these?

  • You cannot substantially scale up (or down) without having to re-architect your system

  • You cannot easily add new types of applications or make use of new tools without having to re-architect your system

  • It’s necessary to increase IT staff to accommodate short-term or seasonal changes in the scale of system workloads or amount of data

  • Legacy applications cannot run alongside modern big-data applications, sharing the same data infrastructure

  • Data motion between edge locations and core data centers or between on-premises locations and cloud must be programmed at the application level instead of being configured and handled natively by the data infrastructure

  • AI and analytics projects interfere with each other and need to run on separate systems (clusters) in order to function reliably and with good performance

If one or more of these indicators sound familiar, you should consider making changes to your overall data strategy and data infrastructure. Here are some examples of the difference a scale-efficient approach can make. 

Example #1: Simplify workflow and architecture at scale

One common impact of scale-efficient systems is that typical workflows can often be simplified. This happens because a scale-efficient system often requires fewer steps to achieve the same goals and fewer separate physical components on which the workflow runs versus systems put together from a collection of point solutions for data infrastructure. The following pair of figures illustrate the simplification that can result from running workloads on a modern data infrastructure specifically designed to meet the requirements of scale efficiency.

Here is a typical large-scale workflow on a non-scale-efficient system built over time from point solutions:

chart 1

Contrast that complex set-up with the simplification you see in the same workflow shown below as it would run on a modern, scale-efficient data infrastructure (in this case, the HPE Ezmeral Data Fabric):

chart 2

(More details about this radical simplification in workflow and architecture can be found in this blog post I’ve written on the topic.)

But as important as simplification is, there are other significant differences in the way scale-efficient systems work. The following two examples provide a taste.

Example #2: Self-healing data infrastructure for business continuity and smooth transitions 

Reliable self-healing is a key capability for scale-efficient data infrastructure. A large-scale data platform should seamlessly and automatically provide continuity of data access by applications even when a disk or machine fails. In addition, the platform should automatically restore the system’s resiliency through automatic data replication from back-up copies. All this should happen “in the background” such that to the human user or workload, the hardware failure is invisible.

Self-healing doesn't just provide safety from data loss and ensure business continuity. It also allows you to add additional hardware without any interruption. That means you can scale the system up or try out new applications easily and safely. 

Example #3: Accommodation of mission creep: meet SLAs without having to re-architect

Meeting your current scale needs is a good thing, but it’s not sufficient. It’s almost inevitable that there will be mission creep: new types of applications and new data sources will be introduced, the number of users will change, and your system may need to grow to new locations. The combination of a comprehensive data strategy together with efficient orchestration of containerized computation and efficient orchestration of data storage and management make it possible to accommodate mission creep over time without having to re-architect your system. 

Scale-efficient is not just an aspirational concept

Scale-efficiency as described in this article is not just aspirational. Over the past several years, I’ve observed many examples of enterprises taking advantage of systems built for scale-efficiency, with very impressive results. 

If you’d like to find out more about what people have done with scale-efficient systems, read the accounts of almost twenty real-world use cases by downloading a free copy of  AI and Analytics at Scale: Lessons from Real-World Production Systems.

____________________________________

About Ellen Friedman

ellenfcr
Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.

Copyright © 2021 IDG Communications, Inc.