Halamka on Beth Israel's Health-Care IT Disaster
Sitting in his office three weeks after the crash, Halamka appears relaxed and self-possessed. There’s another reason he’s opening up, talking now about the worst few days of his professional life at CareGroup. "It’s therapeutic for me," he says, and then he begins reliving the disaster.
Wednesday The Network Flaps
On Nov. 13, 2002, a foggy, rainy Wednesday, Halamka was alone in his office at Beth Israel when he noticed the network acting sluggishly. It was taking five or 10 seconds to send and receive e-mail. Around 1:45 p.m., he strolled over to the network team to find out what was up.
A few of his 250 IT staff members, who range from low-level administrators to senior application developers, had already noted the problem. They told him not to worry. There was a CPU spike?a sudden surge in traffic. RCA, one of the core network switches, was getting pummeled. From where, they didn’t know. It might have to do with a consultant who was working on RCA, preparing it for a network remediation project.
"We happened to have had a guy in there," recalls Russell Rusch of Callisma, the company leading the remediation project. "We knew [the hospital] had had similar incidents in the past few months." Those previous CPU spikes lasted anywhere from 15 minutes to two hours, he says. Then they worked themselves out. Like indigestion.
Halamka’s team decided to begin shutting down virtual LANs, or VLANs. They would turn off switches to isolate the source of the problem, much in the same way one would go around a house shutting off lights to find out which one was buzzing. Halamka thought the plan sounded reasonable.
It was a mistake.
Shutting switches forced other switches to recalculate their traffic patterns. These calculations were so complex that those switches gave up doing everything else.
Traffic stopped. The network was down.
Within 15 minutes, by 2 p.m., the team reversed course and turned all the switches back on. A sluggish network, they figured, was preferable to a dead one.
For the rest of the day and into the night, the network flapped?a term Halamka uses to describe the network’s state of lethargy dotted by moments of availability and, more often, spurts of dead nothing. The team searched for the cause. Around 6 o’clock, when most of the doctors, nurses, staff and students left, the network settled down. Finally, at 9 p.m., the IT staff found its gremlin: a spanning tree protocol loop.





