Designing to Fail: A Paradigm Shift for CIOs

BrandPost By David Torgerson
Aug 10, 2021
IT Leadership

Any piece of an IT system can fail at any point; better to plan for this ahead of time with a design-to-fail approach.

vector illustration concept of seo vector id1026102924
Credit: iStock

Failure is inevitable. When we assume that services must be up 100 percent of the time, that’s when IT teams run into issues. This is where designing to fail, or what InfoQ refers to as designing for resilience, plays a critical role.

Designing for failure means that your team has automated processes in place for when your system fails, in addition to having as much control as possible over how this failure occurs. A system designed for failure is more capable of self-healing, restarting and maintaining service when the worst happens. 

By shifting our focus from designing systems to constantly achieve high uptime to instead designing our systems to fail in a predictable way, we can ensure quicker recovery and minimal downtime.

A design-to-fail blueprint

Organizations need more than just a disaster recovery plan. Systems that are designed to fail should self-recover as much as possible, regardless of the intimate domain knowledge required to execute disaster recovery plans. Furthermore, these plans often don’t account for how applications built on top of the cloud need to be designed to support all the elements of a service or application that can fail individually—hardware failures, OS or system failures, internet failures, BGP, peering issues, and other aspects that may be outside of your control.

An easy way to get started is to have a post-mortem pretending that you have just had a massive failure event. Which systems were impacted? How do these systems restart? What dependencies must be accounted for? Who was notified? What time did the event occur? What was the impact on our users? This requires a mindset shift and an ability to visualize your real-time and future states. Let’s look at 4 important elements of the “design to fail” approach to your systems.

1.   Visualize your systems

IT teams shouldn’t just know how their systems look when uptime is 100 percent. They should also anticipate changes to the cloud environment brought about by downtime incidents. Visualization helps teams see real-time and future states with the appropriate context needed to plan for failure.

2.   Understand your dependencies

During an incident, dependencies can also change. If essential tools experience downtime, IT teams must have a plan for how to move forward with minimal issues until those services are back online.

These teams need to develop an understanding of the types of data that persist in your system, where data persists, what the replication schemes are, what data durability requirements apply, and so on. By using visualization and documentation to know which dependencies apply, IT leaders can determine how an organization or team will respond if your system’s dependencies begin to fail. You can more easily build in redundancy among your dependent components so that no single point of failure can weaken or collapse your system.

3.   Bring all stakeholders up to speed

Designing to fail requires a variety of stakeholders in the planning process, including IT leadership, cloud architects, application DevOps teams, and others. In addition, business leaders without a technical background are often looped in when large-scale failure incidents occur. CIOs and IT leaders need to determine who should be involved in the failure planning process and then ensure that these stakeholders have input, access and alignment.

Without effective collaboration, IT leaders run the risk of their teams struggling to catch up during an incident. Misinformed stakeholders can’t fully participate during an event or in the planning stages. Visuals, like incident management process flows, are a great way for leaders to communicate to broader internal and external audiences about the potential and actual implications of a downtime incident. In the context of designing to fail, the IT and infrastructure teams can role-play incident response from there and plan various scenarios while bringing all kinds of stakeholders up to the same level of understanding.

4.   Consider low-risk resiliency strategies

Leveraging multi-region solutions  is an important strategy for building enterprise-scale resiliency. One way to accomplish multi-region solutions is to leverage multiple cloud providers. AWS, Microsoft Azure, and Google Cloud all have solutions for multi-cloud and hybrid service options. IT leaders who are seeking greater resiliency for their organizations would be wise to consider new models and opportunities from these public cloud providers.

Designing for failure means striking the right balance that gives the organization control while also preparing for what very well could go wrong. 

The cloud is designed to fail

Although our applications aren’t usually designed to fail, the cloud is. Ensuring high levels of cloud uptime requires safeguards, such as the ability to route traffic to different geographic regions. When failure does occur, the last thing a CIO or an organization wants is for failure to unfold without a guiding plan.

Retooling failure into a controlled fall returns agency in otherwise troubled situations. This principle is similar to one found in Aikido, a Japanese self-defense martial art. Aikido teaches practitioners how to fall properly—because everyone falls (and sometimes you’re pushed). By falling in the right way, you can roll back onto your feet and minimize injury in the process.

Applications should fall in a similar way. Careful planning, an intimate knowledge of cloud architecture, and design that understands and quickly responds to failure can bring organizations back to their feet again quickly so they can meet uptime requirements and keep customers happy.

How organizations recover from an incident makes a difference in minimizing damage, and these recovery plans can become even more effective and efficient when we approach failure as inevitable and plan for it accordingly. While this requires a massive paradigm shift across the industry, CIOs and IT leaders need to spend the time and resources today to proactively design their systems to fail, allowing organizations to have more effective failure plans in place and achieve the high uptime that keeps us all moving forward.

Learn more about the importance of designing to fail from Lucidchart.