Even More Tales of Technology Terror: Personal Stories of Tech Disaster
From earthquakes to worldwide email disruption to business processes that won't stay dead, we round up personal tales of IT terror.
The Day the E-Mail Stood Still, and the Man Who Caught the Blame…
Back in the mid-90s, Brad Knowles was senior Internet mail adminstrator at America Online, at the time the largest online service provider in the world. But with great power comes great responsibility…
It’s known as Black Wednesday, August 10, 1996, the day all of AOL's routers went down, and no one could get any packets to our systems—they all just got thrown away. But computers could still contact our backup name servers at ANS (a subsidiary of AOL that ran all of our external WAN connections), so they knew who all of our mail servers were and how many IP addresses we had listed.
Now, it's important to know that the Internet RFCs requires that that you wait at least two minutes when you start to set up a standard TCP/IP connection before you finally declare the other end to be dead. The standard practice is also to attempt to connect to each of the IP addresses you know for a given name, usually in the sequence in which you received them. At the time, the standard practice for mail servers was that you contacted all listed mail servers for a given domain before you gave up.
Now, step back and do the math for seven names with seven IP addresses each, and two minutes per IP address:
7 x 7 x 2 = 49 x 2 = 98
So, just making one delivery attempt to a single user at AOL was taking 98 minutes to time out. Then another 98 minutes to time out for the next user or the next message for a user at AOL.
At the time, most sites were running Sendmail. They were set to rerun their queue once an hour, and many sites would typically have just the one queue runner process. Each time you'd start up a queue runner, if you had even a single message queued up to a single person at AOL, that process would sit there and spin its wheels for at least 98 minutes trying to talk to the AOL mail servers before giving up—and it would block and not do anything else while it was spinning its wheels. But less than 60 minutes after that happened, another queue runner would get fired up—and would almost certainly hang on the same message going to AOL, or on another message going to AOL.
Do that often enough, and you get enough queue runners hung up to AOL that your queue is clogged and you're not getting mail through to anywhere else in the world. Do that long enough, and you've got so many queue runners hung up to AOL that you run out of RAM and swap space and your mail servers crash.
Well, that’s what happened, and I was personally blamed for taking out all Internet e-mail across the entire world. As a result, angry spammers publicly handed out my private telephone numbers and people were asked to complain directly to me. I was also told about at least one business that went bankrupt because it was waiting on a time-critical RFP to come in and it didn't get its bid into the system in time, so it lost the contract.
Once we finally did come back up, it literally took days for us to recover and to catch up to all the backlog that was created for us on the Internet—and it took the rest of the world a few more days beyond that to recover from the rest of their backlog.
or...



