Current headlines from Ukraine have many companies concerned about the safety of employees or contractors residing there. Events like this highlight the importance of developing contingency plans based on events in the world that can impact businesses.
Business continuity is an essential part of the planning process for CIOs and CTOs. Black swan events can impact businesses in significant ways. Some of these events cannot be anticipated – but some can be planned for, even expected, beforehand. Business continuity is about assessing the threat landscape and having plans in place. This helps address foreseeable threats and builds operational resiliency against threats.
The threat landscape
A best practice for leadership teams is to constantly think about the threat landscape, identify potential problems, and prepare for them. Not doing so can result in significant financial impact on companies.
A non-exhaustive set of events that may need to be planned for are:
- Geopolitical threats (e.g., the Russian invasion of Ukraine)
- Natural disasters (e.g., earthquakes)
- Directed threats (e.g., ransomware)
- Regulatory changes
Some of these threats require implementation and execution up front. Others require a plan in place to ensure the team knows what the key objectives are and actions to be taken in the face of a threat. CIOs and CTOs need to constantly monitor the threat landscape and update them as necessary. Inspections like SOC-2 certifications are good forcing functions that allow an external inspection of some of the threat surfaces.
Planning for geopolitical threats
At my company, Inflection,planning for possible business disruptions related to Ukraine started a year and a half ahead of the actual conflict. We formulated a set of principles and built out a plan based on those principles. In this case, the key principles we used were:
- Build a geo-diverse team. In addition to Ukraine, we built a substantial presence in the US and Brazil.
- Build work diversity. Rather than having complete functional silos in each region, we asked teams to collaborate across regions. There are downsides to this (additional communication, for example) but it was the right tradeoff for us.
- Prioritize employee and contractor safety. We know that a geopolitical event might have additional financial implications to ensure safety, and we were OK with spending additional monies to ensure safety. Inflection offered three months of living expenses to team members in Ukraine to move to a different location, in addition to taking care of logistics like payroll.
- Emphasize written over verbal communication. As an example, every engineering decision of significance goes through a rigorous architecture decisioning process.
These proactive steps allowed us to prioritize employee safety while ensuring business continuity. In addition to these principles, there was a detailed plan to ensure how we would cover for employees unavailable for extended periods of time.
Continuity planning in practice: a deep dive on software availability planning
An example of proactive planning is related to natural disasters. What is your organization’s plan if a disaster (e.g., an earthquake) were to strike the region in which your data center is located and cause a network partition? The example below will work through the thinking assuming you are using a public cloud vendor.
A starting point for planning availability is the promise you make to customers regarding uptime. The standard SaaS uptime benchmark is 99.95% availability, which corresponds to 4h 22m 58s of allowed unavailability annually. In planning this out, you need to think about:
- What is your RTO (Recovery Time Objective) and RPO (Recovery Point Objective) when an incident does happen? An agreement on these metrics is required to make tradeoff decisions.
- Do you have maintenance windows? If so, subtract that from the unavailability budget. (You should also be asking yourself why you have a maintenance window.)
- What is the underlying assurance from the platform you are on? Cloud vendors typically do not offer any uptime guarantees.
- What should your plan be if an availability zone (a data center) loses availability?
- What should your plan be if a region (multiple availability zones) suffers an outage?
- What is your plan if the vendor (multiple regions) is unavailable?
There are different cost-complexity tradeoffs for these questions. Smaller companies may choose to avoid greater complexity, whereas that might not be an option for larger enterprises.
The goal of planning is to have a clear posture for each of these questions.
Should you support high availability via multiple availability zones? For most organizations, this is a simple decision: Supporting multiple availability zones in AWS is not complex and can be done with relatively little expense and complexity.
What should you do if there is a regional outage – a disaster recovery (DR) situation? Doing cross-regional synchronization is complex and expensive. Fewer organizations choose to do this. Instead, you could choose to back up your data to another region, and have your RTO/RPO reflect the fact that your tradeoff is longer recovery for a simpler architecture.
What if there is a complete outage for a cloud vendor? Doing cross-vendor deployments is extremely complex and expensive. In most cases, a backup of your data to a different cloud provider is sufficient. But if you are operating a large enterprise, you will probably want to be in multiple cloud vendors both for cost and scale reasons.
Taking all of this into account, a plan needs to be formulated and agreed upon by company executives. Communication plans need to be put in place when an event does occur (e.g., how will we inform customers?), and most importantly, the plans need to be tested. These plans will be meaningless unless they are practiced regularly.
At Inflection, we chose to make the following decisions:
- Support high availability by deploying to multiple availability zones. The loss of a single data center is imperceptible to customers.
- Synchronize data between multiple regions to support an RPO of less than 24 hours and an RTO of less than 72 hours for a regional disaster.
- Synchronize data to a secondary cloud vendor to ensure that in case of a cloud provider full outage, we can still recover.
- Finally, we practice database restoration annually, and test DR every quarter.
Planning for directed threats
Threats like ransomware have increased significantly in the past few years. These threats need to be met head on. At Inflection, we do so by:
- Getting SOC-2 certified and ensuring our processes compare with the best in the industry
- Ensuring that data at rest and transit are always encrypted
- Engaging with bug bounty programs
- Having external agencies run penetration tests
- Ensuring that employee machines are encrypted and have proper software protection against malware, phishing, and other attacks
- Insuring ourselves
A useful exercise for leaders to consider is a “pre-mortem.” In thinking about business continuity, it is best to be proactive rather than reactive.
A pre-mortem is the opposite of a post-mortem (more details in my writeup on Root Cause Analysis). While a post-mortem allows us to analyze what went wrong – after it has already happened – a pre-mortem asks, “What could go wrong? How could we prevent that from happening?” Pre-mortems allow deeper planning of business continuity and a “don’t make me think” approach to reacting to actual incidents because they were already planned for.
Planning business continuity is a requirement for executives. Companies who wait until disaster strikes will not be able to react quickly. Your executive team must agree on the principles and cost/complexity tradeoffs.