Let's face it: downtimes are not only frequent, but expected. What is your company doing to to ensure speedy recovery and restoration of service when the inevitable occurs? Credit: Thinkstock The cost of downtime to business, company reputation, customer experience and trust has never been higher. Given the constant and connected nature of software driven businesses, customers and users have grown to be less forgiving and more fickle with their attention. An outage in a single service can impact all of its users. An outage in a multi-tenant platform has an exponential impact as it impacts the users of all the individual service providers running their services on the platform. Balancing preparedness for a black swan event against minor, downtime events As enterprises look towards designing their disaster recovery solutions, it is easy to get focused on preventing the big disasters and outages. These are the “black swan” events that have an incredibly large, almost decimating impact on service availability. The impact can be wide ranging i.e. it can extend the duration of time the service is out of commission and the amount of data that is lost. As big as these are, the impact of minor but frequent downtime cannot be ignored. Enterprises need to pay attention to determining, discovering and preventing these smaller outages that can occur more frequently. These small downtimes can add up over the course of a year and completely topple the service availability targets and goals. There are several options available for disaster recovery from onprem disaster recovery solutions to cloud-based disaster recovery solutions that leverage infrastructure and platform capabilities offered by major cloud operators such as AWS, GCP and Microsoft Azure. SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe Cost of small downtime events The cost of such minor downtimes can easily add up. Frequent downtimes increase that likelihood that a larger number of users are impacted by the downtime. In addition, the likelihood of the same user being impacted repeatedly across outages also increases. Such frequent downtimes can erode trust in the service. Even if an immediate abandonment of the service does not occur, the impact of repeated downtimes can be felt at renewal time. Either the customer does not expand the size of the engagement and could even decide to not renew their engagement. SaaS businesses that depend on monthly recurring revenue or annual recurring revenue are extremely susceptible to the impact of frequent, minor downtimes. Key capabilities for developing resiliency Enterprises looking to develop a resiliency against both major and minor downtime events should focus on developing and maintaining the following capabilities: Continuous Backups All key systems that serve traffic should be continuously backed up. In addition to being designed in a RESTful manner, the data generated, updated and maintained by these services should be continuously backed up to a local, centralized or cloud-based disaster recovery system. Backups should be as frequent as possible while not impacting the service quality and performance of the system. At the same time, backups should be both incremental and snapshot-based to offer flexibility and ability to recover from any time or size of downtimes. In addition, backups should also be multi-level to ensure that the backup system is not impacted by the same outage that is impacting the primary system. Continuous Monitoring All key systems that serve traffic should also be continuously monitored. This is critical to ensure that outages are detected as soon as possible, and disaster recovery is put in motion immediately. Similar to backup, monitoring needs to be implemented on a system that is not impacted by the same outage that has hit the primary service. In parallel, customer feedback systems also need to be monitored for service outage reports. As soon as reports begin arriving or the monitoring systems alerts to an outage, the outage should be confirmed, and the disaster recovery should be put in motion. Failover Once a disaster has been detected, reporting and confirmed, a failover process should be initiated that can spin up new servers with the ability to continue servicing any traffic. This is done by ensuring that the servers take on the roles of the servers impacted by the downtime. The failover servers should be configured to access the backups that contain the state and information required to serve the traffic. Failback When the downtime is over and the underlying issues in the primary service environment have been diagnosed, fixed and confirmed fixed, a failback process should revert all services to the primary environment. Once the failback has been confirmed successful, failback servers can be reclaimed and destroyed. Conclusion In a recent survey, it was reported that only 37% of the respondents met their service availability goals. It was also reported that 71% of respondents had experienced a downtime event in the last 12 months, with 41% reporting having experienced a downtime event in the last 3 months. This shows that downtimes are not only frequent but also expected, and thus require careful planning and design to not only mitigate but ensure speedy recovery and restoration of service. Enterprises have several options at their disposal and should carefully evaluate and choose the solution that best fits their needs, and guarantee the agility required to detect and recover from unexpected downtimes. Related content opinion Key cloud trends for 2018 Some realizations that enterprises are likely to have about how the cloud impacts their business. By Kumar Srivastava Jan 23, 2018 4 mins Technology Industry Cloud Computing opinion Planning for disaster recovery How do leaders of enterprises plan for outages to minimize the impact on the users of all the individual service providers running their services on the enterprises' platforms? By Kumar Srivastava Jan 03, 2018 5 mins Disaster Recovery IT Strategy IT Leadership opinion Nobody likes apps that crash Why developers should pay attention to their crash reports. By Kumar Srivastava Dec 27, 2017 5 mins Application Performance Management Developer Technology Industry opinion Best practices for a secure and trustworthy container platform strategy Container technology has a direct impact on the agility of a software development team and consequently have seen a huge increase in interest, adoption and usage. By Kumar Srivastava Dec 07, 2017 5 mins Containers Technology Industry Data and Information Security Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe