Learning Lessons from Extreme Cases

BrandPost By Ted Dunning
Oct 16, 2020
IT Leadership

shutterstock 597643982 1200
Credit: shutterstock

What is it that makes stories of mountaineers, arctic explorers, and test pilots so captivating? Most of us are never going to climb the Eiger or fly a new airplane to the edge of space. And, we’re probably not going to be the next Amelia Earhart or Neil Armstrong. So why are these stories so popular?

In a similar vein, why do so many people in IT fixate on how Google or Facebook manage their systems? Again, we will never have even a tiny fraction as much data, so why are we interested?

The fact is, we can learn many valuable lessons on the edge of the impossible – lessons we apply in less extreme situations. We may not have hundreds of petabytes of data with strict time limits for processing. And at this moment, we may not have trillions of files spread across hundreds of data centers large and small.

Even so, we can learn valuable lessons by examining systems running at extreme scale and speed. Learning from the experience of others (both successes and failures) can help us make our own systems work better. These lessons can also help us anticipate solutions to difficulties we are likely to face as our processing needs inevitably increase.

The stories below describe extreme situations with practical lessons for building systems that handle big (but not crazy big) amounts of data. These stories have helped me build similarly big systems and might help you.

Lesson #1: Data from the Field

I have worked with several prominent car makers on their autonomous vehicle development efforts. For all these manufacturers, the key requirement is they need to acquire vast amounts of data from real cars driving in realistic conditions. This data needs to be processed, selected, and redacted in the field and then transferred to the core systems for analysis and machine learning.

The common factors include the tremendous scale in terms of raw size, number of files, and data rates involved. Also, these systems have typically grown by a factor of ten while in production use – -from tens of petabytes to hundreds. Some are even verging on exabyte scale.

In terms of implementation, all these projects have opted to use a unified data fabric that provides:

Delegating so much responsibility to the data fabric was a bold move for these automakers. Yet, it has paid off in terms of dramatically lower system complexity for those who maintain the infrastructure as well as application developers and data scientists.

The lesson here is clear. Standard computational and data fabrics that uniformly extend into the field (where the data starts) massively simplifies the logistics of dealing with data. This simplification applies even if the quantity of data is decreased by a factor of 100 or a thousand. Letting the platform deal with what runs where and which data moves when saves time and effort that should be applied to solving the real problems.

Lesson #2: Ploughing the Field

At Google, in the early days of building large systems, it became clear that people came to expect perfection if a system stayed up too long. No matter what promises anyone made about 99% uptime, if the service was accidentally perfect for more than a few months, users presumed it would continue in its perfect ways. Users even based their own guarantees on this mistaken perfection. Services that aren’t supposed to be perfect should have imperfection forced upon them just to make sure that unreasonable expectations don’t build up.

This idea that accidental perfection is actually bad has led to the practice known as ploughing the field. The idea is that all processes should be restarted at intervals and hardware should be rebooted periodically. This practice has surprising benefits in the security field. Advanced persistent threats (APT) cannot go dormant very long; they must repeatedly breach machines as they restart. That makes these attacks much easier to spot.

The lesson here is to deliver what you promise, but make sure you don’t deliver too much more than that. Ploughing the field regularly is a good thing.

This technique sounds seriously disruptive, but I have seen large organizations plough the field continuously without service disruption. They achieve this by aggressive containerization supported by a ubiquitous data fabric. Even where legacy systems have not been converted to be cloud native (that is, as services implemented by clouds of containers), running legacy systems in containers with persistent data in a data fabric is fairly easy, requires very little change, and gets much of the benefit. Commonly, very little disruption occurs as containers are stopped and restarted on different hardware because all persisted data is still available.

Lesson #3: Allow for Explosive Growth, but Don’t Be Silly

I’ve built several systems in startups that experienced fairly explosive growth. One consistent pattern I have seen is techniques that are fine at one level of scale can prove woefully inadequate at 100x that scale. Unfortunately, the table stakes for data size in all kinds of fields increase that much or more every few years. Legacy systems and architectures that were fine a decade ago now represent a substantial and insidious technical debt.

Conversely, I’ve seen startups fail because they used complex (but fashionable) systems and ran into problems because these fancy systems couldn’t reasonably scale down.

Larger, more stable companies fall into this trap as well. Prototype projects either have to bear the burden of full-scale operation from the start (and are thus too expensive to even try, given the risk of failure), or they are implemented using techniques that cannot be moved into production without a complete architectural overhaul. The most common outcome is a near complete lack of any high-risk experimentation and consequent inability to adapt.

The simple answer here is to use a foundation that allows simple experiments to co-exist with production-scale implementations interchangeably. This allows toy prototypes to merge with production systems easily and quickly; and then if they work, seamlessly upgrade them. Practically speaking, the key aspect of this type of foundation is a universally accessible data fabric that can transparently scale up or down. Basically, use the coattails and sunk costs of large (successful) applications to prototype new projects. Since these prototypes already work against live data, they can be upgraded to production quality bit by bit as they prove themselves valuable. The data fabric underneath can scale gracefully as this happens.

Done well, this coattails approach often means prototypes can be developed and proven in just days or weeks with no incremental workload imposed on the infrastructure teams and no incremental system cost. This means lots of highly speculative concepts can be tested and successful ones can be subsequently productized relatively easily.

How Can I Put these Lessons into Practice?

Although I have described extreme cases here, we can learn some core lessons from them. They all used basic technologies that helped them succeed: notably, a solid and scalable data fabric that supports multi tenancy together with a solid container platform.

The outcome? The teams I describe in the stories above are succeeding because they are confronting extreme situations and putting in place solutions that effectively combat them. And although the rest of us probably won’t face the same level of challenge, we can benefit using the same type of solutions.

To learn more, visit the HPE Ezmeral or the HPE Ezmeral Data Fabric web pages. For a more in-depth technical view, check out the HPE Ezmeral Data Fabric developer page.

____________________________________

About Ted Dunning

head shot buz 15 inartifle
Ted Dunning is chief technologist officer for Data Fabric at Hewlett Packard Enterprise. He has a Ph.D. in computer science and is an author of over 10 books focused on data sciences. He has over 25 patents in advanced computing and plays the mandolin and guitar, both poorly.