Outages at Delta and Southwest Airlines hold valuable management lessons

The importance of a disaster recovery and business continuity plans

delta airlines outage
Credit: Tami Chappell, Reuters

Delta’s and Southwest’s planes are flying again, but the system outages both suffered last summer were painful for them. What can your company learn from their unfortunate experiences?

Both airlines’ outages were massive, requiring days to fully recover. Both had to cancel thousands of flights and lost millions of dollars in revenue. Because airlines are highly dependent on IT, both companies undoubtedly had a disaster recovery plan (DRP) and a business continuity plan (BCP.) Apparently, their plans were not up to the task. Are yours? Clearly, a robust DRP and BCP are critical for all enterprises. In addition, IT leadership, executive management and the board need to consider the following:

  • An outage generates additional costs. In addition to the lost revenue, Delta incurred other expenses associated with the outage. On-duty pilots and crews had to be paid, even if they didn’t fly. Ed Bastian, the CEO, said, “We’ve got Delta teams working around the clock to restore our system capability.” Clearly, those efforts consumed a great deal of time that took employees away from their normal responsibilities. And given the difficulty of the problem, many employees probably got paid overtime. On top of that, Delta sent travel vouchers to customers who were delayed more than three hours.
  • The damage is not just financial. The outage also harmed Delta’s reputation. Customers initially expressed their frustrations on social media. Conversations about Delta spiked to 43,000 that day, compared to about 3,600 on a normal day. A number of posts included pictures of long lines of waiting passengers. News articles kept the problem visible, long after it had been solved.

Although there is no evidence that there was internal finger pointing at either Southwest or Delta, the blame game is common in many enterprises. Highly visible problems embolden some people to use the crisis as a political weapon against their rivals.

  • Robust communications are critical. Delta’s communications efforts worked well. Beginning at 5:05 a.m. on the day of the outage, Delta began posting periodic status updates on its website. That afternoon, the CEO apologized in a video and offered to waive change fees for passengers who were inconvenienced. The following day, Bastian posted a second video in which he apologized again, saying, “This is not who we are.” He went on to say that, although systems had been restored, Delta would have some cancellations throughout the following day. Customers responded favorably to the updates and videos, appreciating Delta’s frankness.
  • Fallback options can be limited. When the BCP does not work properly, organizations that are highly dependent on IT often find that it is virtually impossible to process business transactions manually. During the Delta outage, some people who had opted to have their boarding passes sent to their phones found that the passes had disappeared when they arrived at the airport. For flights that departed, people with printed boarding passes were able to board. Gate agents wrote passenger names on paper and presumably entered the information into the reservation system when it was eventually restored.

While the reservation system was down, Delta was not able to sell tickets, even on flights that were about to depart with empty seats. When an unavailable computer system is the only repository of operational data, very little can be accomplished manually, and the enterprise (along with its revenue) grinds to a halt.

  • Architectural weaknesses can cause additional problems during an outage. Most large carriers still operate their reservation systems based on Transaction Processing Facility (TPF), a specialized IBM operating system. Originally created in the 1960s and still supported by IBM, TPF can handle tens of thousands of transactions per second. Although it is highly reliable, it takes time to master and is not well understood beyond the airlines, some hotels and a few credit card processors.

Delta’s passenger service system, Deltamatic, handles ticketing, reservations, standby-lists and other passenger-centric functions. The 52-year-old system is closely integrated with TPF. According to Bob Edwards, United Airlines’ former CIO, many airlines continue to operate TPF and other older systems, but have developed modern user interfaces to make things easier for agents. This allows the older systems to continue to assign seats, change reservations and perform other functions for passengers without forcing agents to memorize hundreds of two-to-four-letter codes.

Unfortunately, when the Delta reservation system failed, TPF and Deltamatic failed to synchronize properly with the newer user interface. Employees were forced to use Deltamatic directly for multiple hours until resynchronization was complete.

  • Funding must be allocated for DRP and BCP. Unfortunately, recovery plans are expensive, and they create no new services that either increase revenue or cut costs. This often makes it difficult to get adequate funding, especially during times of financial hardship. According to Scott Nason, American Airlines’ former CIO, during the years when the airlines were close to bankruptcy, most carriers demanded that any investments have short paybacks. Understandable, but unfortunate. Frankly, executive management in most enterprises expects these plans will be in place, but (similar to insurance) doesn’t want to pay for them until they are needed. This is always too late to be effective.
  • Recovery plans should continue to evolve. As business and IT systems become ever more complex, it becomes increasingly difficult to conceive of every possible problem and to test all potential scenarios thoroughly. However, without careful planning and testing, timely recovery from an outage is virtually impossible. Since both business and IT environments are constantly changing, the enterprise’s recovery plans must be funded, updated, and tested on a regular basis.

Low-probability, high-impact events can happen to any enterprise. And they do. So don’t fall prey to the platitude, “It can’t happen here.” The more prepared you are, the less your business will suffer.

Bart Perkins is managing partner at Louisville, Ky.-based Leverage Partners Inc., which helps organizations invest well in IT. Contact him at BartPerkins@LeveragePartners.com.

This story, "Outages at Delta and Southwest Airlines hold valuable management lessons " was originally published by Computerworld.

To comment on this article and other CIO content, visit us on Facebook, LinkedIn or Twitter.
Download the CIO October 2016 Digital Magazine
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.