Halamka on Beth Israel's Health-Care IT Disaster
The easiest thing to do would be to cut the links, eliminating the potential for spanning tree loops. But that would isolate the outlying hospitals. Instead, the IS team, along with Callisma engineers, chose a more complex option. They would try converting from switching to routing between the core network and the outlying hospitals. That would eliminate spanning tree issues while keeping those hospitals connected.
They tried for seven hours, and, for arcane reasons that have to do with VLAN Trunking Protocol (VTP), they never got the routing to work. The network flapped all day.
Around midmorning, as Halamka was explaining the routing strategy to CareGroup executives in an ad hoc meeting, a patient, an alcoholic in her 50s, was admitted to Beth Israel Deaconess’s ICU. Dr. Daniel Sands, a primary care physician and director of the hospital’s clinical computing staff, saw her. She had what Sands calls "astounding electrolyte deficiencies," a problem common to people who drink their meals. In fact, Sands says, "It was incredible she was alive.
"I needed to be careful with this woman. I needed to try treatments based on lab reports and then monitor progress and adjust as I went," recalls Sands. "But all of a sudden, we couldn’t operate like that. Usually I get labs back in less than an hour; they were taking five hours, and here I have a patient who could die. I was scared." (The patient would survive.)
At 4 p.m., Halamka met with a minicrisis team that included the head of nursing, the heads of the lab and the pharmacy, and hospital COO Dr. Michael Epstein. "Even then," Halamka says, "I’m still saying, ’We’re one configuration change away,’ and my assumption is things will be back up soon."
But his team was tense and frustrated. CareGroup’s help desk had been flooded with calls. They were hearing everything from "I can’t check my e-mail" to "I don’t know if the blood work I just requested went through."
At 3:50 p.m., Beth Israel closed its emergency room. It stayed closed for four hours, until 7:50 p.m., according to Massachusetts Department of Public Health documents.
It was at the 4 p.m. meeting that COO Epstein says he realized "this was more than a garden-variety down-and-up network." Clinical users, like Sands, were signaling that they were worried. Epstein and Halamka, along with hospital executives and network consultants, decided to take extreme measures. They called Cisco Systems, the hospital’s San Jose, Calif.-based equipment and support vendor. Cisco responded by triggering its Customer Assurance Program (CAP), a bland name that belies how rare and how serious CAPs are. CAP means Cisco commits any amount of money and every resource available until a crisis is resolved.





