IDG News Service — Twitter's persistent and disruptive service outages entered a second week, as the company scrambles to bring its site availability back to acceptable levels.
After multiple incidents brought Twitter.com and its platform for third-party applications down several times last week, the company said on Friday that it had identified the causes and had taken concrete steps to resolve the problem.
Specifically, Twitter blamed errors in planning, monitoring and configuring its internal network, and said that in response it had doubled the capacity of its internal network, sharpened its monitoring and improved its load balancing,
"By bringing the monitoring of our internal network in line with the rest of the systems at Twitter, we'll be able to grow our capacity well ahead of user growth. Furthermore, by doubling our internal network capacity and rebalancing load across the internal network, we're better prepared to serve today's tweets and beyond," wrote Jean-Paul Cozzatti from Twitter's engineering team on the company's official blog.
However, problems continued throughout the weekend and into Monday morning, as acknowledged on the official Twitter Status blog, as the site returns its notorious "fail whale" error message.
Not even at its halfway point yet, June is already the worst month in terms of downtime for Twitter since October of last year, according to Web performance monitoring company Pingdom. So far this month, Twitter has been down for 3 hours and 3 minutes.
Twitter, launched in March 2006, had frequent and lengthy outages in 2007 and the first half of 2008, but then steadily improved its site uptime by beefing up and revamping its systems. In 2009, it had very solid months but also bad ones, like August, when it was down for more than 6 hours, according to Pingdom.
"If you look at the type of outages out there, they seem to be largely related to relatively new or fast-growing services. Often fast changes are harder to manage, especially by small, new startup teams that have not yet built lots of operational discipline and maturity," said IDC analyst Al Hilwa.
Such services often face patterns of use they don't fully understand and workload peaks of unknown scope for which they don't have defined response plans, he said via e-mail. "This is a also a by-product of new architectures often used as the back-end of the cloud services supporting these types of new social networks or Web sites," Hilwa said.
The predominant architectures to handle such scale require factoring the workload on many engines and having them collaborate as a distributed system. In the long run such architectures will mature, but for now operators are clearly challenged to provide the right level of robustness, he said.