If you follow either baseball or IT infrastructure, I’m sure you’re aware of the crash of the Colorado Rockies’ web server, shortly after they started selling tickets for the World Series games. The site is back up, although the company isn’t admitting to the actual source of the problem other than to say that they were “the victim of an external, malicious attack that shut down the system.”
Journalists are known for shoving their microphones (virtual and otherwise) into the middle of a disaster and asking the victims, “How do you feel?” I haven’t called anyone at or their partner Paciolan (the service that actually runs the e-commerce site). Aside from the fact that they’re in the middle of a technology and PR firestorm, with the company located in Irvine, California, I think they have another kind of fire to put out right now. (How kind of me, huh?)
Instead, I’ve found myself contemplating the IT and business issues. By its nature, ticket sales for the World Series must be the worst kind of load testing nightmare imaginable: a sputter of traffic followed by several million people who want to witness whether Matt Holiday will actually be able to touch home plate this time. But if the ticket sales site was the victim of a malicious attack (and my cynical side whispers, “Sure sounds better to say that than to admit they screwed up, doesn’t it?”) — could they have done anything to mitigate the problem?
You probably don’t have to cope with millions of angry customers… but if your server went down at a critical time, the result might be equally devastating.
What—if anything—could the IT folks at Paciolan have done to prevent this from happening? Was this a failure of load testing? Did they choose the wrong architecture? What lessons can you learn from their misfortune?
The architecture, first: The company doesn’t say much about the technology it’s using, but thanks to a smart correspondent on the Software QA Forums, it’s easy enough to find out from their job search listings: currently Pick/Universe but moving to J2EE, with some uncertainty about the database to use.
Though I doubt the problem had anything to do with their platform. According to performance testing consultant Roland Stens, the site probably suffered a Distributed Denial of Service (DDOS) attack or automated scripts hitting the site trying to snatch tickets. And, say Stens, they tried to minimize this by blocking IP addresses that had repeated requests for the same info:
“Alves explained that those who saw a “page cannot be displayed” message had “IP addresses that we blocked due to suspicious/malicious activity to our website during the last 24 to 48 hours. As an example, if several inquiries came from a single IP address they were blocked.”
Said Stens, “This, effectively, also blocked traffic from sites with multiple users that use only one external IP address”—which most companies do. Stens also pointed out that Paciolan’s online ticketing brochure promotes their solution as “Power up your ticket sales, with the industry’s most robust online ticketing solution.” This, he remarked, has obviously and painfully been contradicted.
DDOS attacks are not rare. Brian Karas, who’s built several high availablity server farms, offers several specific suggestions. “This can be a particular weak point in many modern web servers because the content is often dynamically generated or pulled from a database server (as opposed to being a “flat” file of static content),” Karas said.
While DDOS attacks are hard to prevent entirely, there are a few steps you can take to minimize their impact. Working from the ground up, Karas suggests:
- Make sure your router and/or network has up-to-date firmware with some form of DDoS protection. Most of the mid to high-end gear from companies like Cisco has some at least rudimentary amount of DDoS detection and mitigation capability built-in.
- Use a CDN, or Content Distribution Network, to distribute the load on your website. This is basically a fancy way of saying spread the load out to as many servers as reasonable. A typical webpage has over a dozen “objects” that have to get downloaded. The MLB page has a lot of data, serving images from a dedicated server, scripts and style sheets from another, and the HTML text from a 3rd will help spread the traffic load out to more processors. This can also be expanded to house servers at multiple datacenters, so that your incoming bandwidth is also not a single choke point.
- If you’re anticipating a high traffic load (like many people suddenly trying to by tickets) setup a dedicated subdomain on a new server (many data centers will lease you extra capacity for a short term) to feed those requests. Something like tickets.mlb.com, and make it easy for people to find that link on the front page of your website. Speaking of the front page, you can flatten it out temporarily and put the multimedia and graphical content more “inside” your site, so that people coming to the main page just to buy tickets are putting additional load on your server to build and send content that those visitors aren’t really interested in.
- Use technology available on the two most common webservers (Apache and IIS) to add some functionality to detect when automated or high- volume requests are being made, and react accordingly. In many cases, it is easy to tell these automated robotic requests from a “real” user, and feed them a basic “Go Away” webpage. You can also detect repeated requests from the same IP address (like in the case of a machine launching robotic attacks), and requests from IP subnets that are not likely to be valid (ex: an IP subnet from Siberia is probably not a relevant visitor. You don’t have to ignore them, but you can send them a lighter-weight page.
- Have a test environment that can be easily converted to help handle production traffic spikes. It’s a Good Idea to test new content before releasing it on the main site. Your test network should be a close mirror of your production network anyway, so have those servers setup in a way that they can be called in to action to share the load, if required.
Stens has a few lessons to add:
- You need to make sure that problems with one client do not affect ALL your other clients. (He had pointed out that, based on at least one article, all Paciolan’s clients were affected by the outage. Bad idea.)
- You need to test the extreme scenarios, understand what the bottlenecks are and understand how your system will fail under stress.
- Your system needs to have the capacity for peaks and not just for average load
- You need to be able to temporarily scale up your capacity when an event like the Rockies ticket sales is planned.
“Sure, this is expensive and will take time to get right,” admitted Stens. “But in the end what will be a larger cost: A loss of confidence and reputation and impact to all your clients or building a solution and strategies based on facts obtained by load testing?”
I couldn’t agree more.