The week\u2019s high-profile resignation of Optus CEO Kelly Bayer Rosmarin in the wake of the Australian telco\u2019s massive outage that left 10 million Australians and 400,000 businesses without phone or internet for up to 12 hours earlier this month underscores the stakes involved when it comes to setting an IT strategy for business resilience.\n\nAt a Australian Senate inquiry last week, Lambo Kanagaratnam, the telco\u2019s managing director of networks, told lawmakers that Optus \u201cdidn\u2019t have a plan in place for that specific scale of outage.\u201d Rosmarin herself admitted that prior to the outage she carried a spare SIM card from competitor Vodafone \u2014 and that since the outage she now carries a second spare SIM from rival Telstra.\n\nDuring the outage, Optus failed to connect 228 triple-0 emergency calls, including one from the colleague of a man suffering a heart attack.\n\nThe network outage, which shows the vulnerabilities in interconnected systems, provides a reminder that, despite sophisticated systems, things can, and will, go wrong, and it offers some important lessons for CIOs to take prudent action now.\n\nAs dramatic and widespread the Optus outage was, such incidents are far from isolated anomalies and happen to many organizations with differing levels of severity. And industry analysis finds the cost of such outages is increasing, according to Uptime Institute\u2019s Annual Outage Report 2023.\n\nFor CIOs, handling such incidents goes beyond just managing IT systems. It demands a blend of foresight, strategic prioritization, and having effective disaster recovery plans in place. The Optus outage provides a prompt for assessment, offering IT leaders insights into how to better strengthen defenses and how to better respond when things go wrong. Here are some of the key lessons of this latest high-profile IT outage.\n\nAdopt a protocol to test updates first\n\nInitial reports from Optus connected the outage to \u201cchanges to routing information from an international peering network\u201d in the wake of a \u201croutine software upgrade.\u201d Parent company SingTel has since refuted that explanation, citing safety systems in routers at Optus being at fault, not the software upgrade. \n\nIn her Senate testimony, Bayer Rosmarin stated that the root cause was that the company\u2019s routers \u201chit a fail-safe mechanism, which meant that each one of them independently shut down,\u201d an event she said was \u201ctriggered by the upgrade on the SingTel international peering network.\u201d\n\nBe that as it may, the outage underscores an important point: Before rolling out updates, particularly organization- or network-wide updates, it\u2019s advisable to test on an internal system before uploading to the network. \u201cIt\u2019s what they call \u2018fat fingers,\u2019\u201d says telecommunications analyst Paul Budde.\n\n\u201cIf there is an error in it, you want the network to recognize it and filter it out or you can get this cascading effect across the whole system,\u201d Budde says. \u201cAnd if the whole network is down, technicians will have problems just getting into the system. Then the question becomes: What is your redundancy?\u201d\n\nIn the case of Optus, the fix involved a system reset of more than 100 devices in 14 sites across Australia. In all, a core group of 150 engineers and technicians worked to remedy the outage, \u201cwhile 250 other workers and five international companies also provided support,\u201d according to a report from ABC News based on Senate inquiry documents.\n\nMap weak points and address them\n\nGabby Fredkin, head of data and analytics at IT research and advisory firm Adapt, says it is vital to map your company\u2019s infrastructure, segment services so they can stand alone in the event of an outage, identify weak points, and stress-test those weak points to understand any vulnerabilities in the system.\n\n\u201cIt\u2019s easier said than done,\u201d Fredkin concedes.\n\nStill, networks are only as robust as their weakest points, and when there\u2019s a single point of failure, especially if it relates to critical infrastructure, it can result in crippling system-wide outages. At the very least, CIOs must know where these single points of failure exist in their systems to help ensure redundancy and provide context for making decisions around priorities and budget.\n\n\u201cYou may not be able to have redundant paths across your entire network; it\u2019s just too expensive. But when major outages happen to your organization or others, it\u2019s an opportunity to review the risk versus the cost,\u201d says Matt Tett, managing director of Enex Test Lab.\n\n\u201cIt is worth reviewing the budget and considering whether it\u2019s good to have more dual loading on the network to save a bit of pain in the future,\u201d he says.\n\nPlanning for inevitable outages\n\nEven if they\u2019re not overseeing vast networks like Optus\u2019, IT leaders and their executive counterparts must plan for outages, their own or those of their service providers, as even small or localized outages can still disrupt the business and its customers.\n\n\u201cIt\u2019s important to review your business continuity plans and ensure you\u2019ve got some kind of backup, where possible, to continue with [business as usual],\u201d says Tett.\n\nThis business continuity plan might include processes for reverting to paper-based systems, shifting to cellular coverage instead of internet, ensuring executives and key staff have dual SIM phones to switch networks to ensure continuity of communications, or whatever is relevant to the organization.\n\n\u201cIt\u2019s like having a flight manual so that if you lose a significant part of the technology you can try and ensure there are some offline ways to continue functioning,\u201d he says.\n\nSpark the disaster recovery conversation\n\nCIOs can use these headline-making incidents to spur conversations with their infrastructure leaders to review their disaster recovery plan. \u201cDon\u2019t wait for something to happen. It should be an ongoing, systematic approach to look at where vulnerabilities lie,\u201d says Fredkin, who cites Netflix\u2019s Chaos Monkey, which creates random outages in its production environment, as a key component of the streaming media giant\u2019s strategy for improving the resiliency of its complex systems.\n\n\u201cCausing chaos in their system allows them to expose weak points, see how things might pan out, and plan and run drills of what could happen,\u201d he says.\n\nConversations around disaster recovery need to involve the CFO and CEO to map the risks of being offline and of losing customer trust, as well as the costs to mitigate those risks. \u201cHow one company is impacted can differ substantially to the way another company\u2019s impacted, so you\u2019ve got to take that into account to,\u201d Fredkin says.\n\nUnderstand third-party risks\n\nAccording to Uptime, managed digital infrastructure services, including cloud, colocation, telecom, and hosting companies, account for a growing proportion of outages today. As such IT leaders must be aware of \u2014 and know how to manage \u2014 third-party vendor risks, says Budde, \u201cparticularly in a technological landscape where cost-saving measures and outsourcing have become common.\u201d\n\nFor software or hardware updates, it\u2019s vital to have a list of critical vendors along with the timing and nature of updates. CIOs need to look at whether it\u2019s feasible to roll out updates to some customers and not others or to parts of your infrastructure and not others, Fredkin says. They also need to find \u201ca way you can do some testing so it doesn\u2019t impact the entire by production environment,\u201d he adds.\n\n\u201cHaving good relationships with the people who provide the hardware and the software is crucial. Knowing when something, like an update, is coming ahead of time, and having some sort of control over when that update is pushed through to your organization can be very beneficial,\u201d he says.\n\nMake the case for IT modernization\n\nAs unfortunate as they are, headline-grabbing outages often offer the opportunity for IT leaders to make their own case for IT modernization, Fredkin advises. Although not expressly the case with Optus, when systems go offline, it is often related to a legacy technology issue, and these incidents can help motivate buy-in at the leadership and board level to update systems to ensure they\u2019re secure and resilient at speed and at scale, he says.\n\n\u201cWhen CIOs are making a modernization use case, they need to have the stakeholder buy-in for the business to come along the journey,\u201d he says.\n\nComplex, mission-critical functions can take two to three years to complete, so there needs to be a way of ordering and prioritizing efforts as well. \u201cThink of it like a traffic-light system,\u201d Fredkin says, looking at what is crucial and critical, and what is urgent. \u201cWhat are the biggest gaps in the system? And in terms of the longer-term refresh, that\u2019s a different prioritization, because some things need to be done in a specific order,\u201d he says.\n\n\u201cIt\u2019s that classic waterfall mentality, which still has a very big place when it comes to redesigning critical infrastructure,\u201d he adds.\n\nConsider the larger picture\n\nWhether they originate with your systems or are the result of connected networks, outages can impact a wide range of businesses at once. As such, IT leaders might want to consider thinking beyond their organization\u2019s four walls, Budde says.\n\n\u201cA tailored disaster and resilience plan needs to include compliance with industry standards and regular review of IT systems and protocols to ensure robustness, particularly in response to potential network stress and security threats,\u201d he says, adding that such efforts might need to go further than just your organization, depending on your industry.\n\n\u201cWe may need some out-of-the-box thinking and start looking at nationwide solutions and industry-wide solutions in how organizations can assist each other in these situations,\u201d he says.\n\nOverlook communications to your peril\n\nLast, but by no means least, organizations need a comprehensive communications playbook for when outages or disruptions occur, regardless of whether those outages originate with them.\n\n\u201cIt\u2019s vital to have clear, concise communication about any outages or issues,\u201d says Enex Test Labs\u2019 Tett. This communication should be up the chain to the CEO as well as outward to customers and the media to provide as much clarity as possible about the situation.\n\n\u201cThe first thing organizations need to think of is how to clearly communicate with their customers, even if it\u2019s not them that\u2019s causing a disruption. And the second is, if they can\u2019t communicate with their customers because of network outages, have a strategy in place to be able to communicate via the media,\u201d he says.\n\nIt should also include some kind of time frame to help manage expectations around downtime and restoration of business as usual. \u201cWhether it\u2019s a few hours or 48 hours, be open and transparent,\u201d says Tett.