by Kim S. Nash

Facebook Technology Infrastructure Needs Constant Care

Aug 28, 20088 mins
Data CenterDeveloperIT Leadership

Jonathan Heiliger, the top technology exec at the huge social networking site, talks about his efforts to build a technology operations team at Facebook that can both handle millions of users worldwide and a restless, creative culture inside the company.

Started in a dorm room four years ago, the social networking site Facebook now claims to be the fourth most-trafficked site in the world. Ninety million active users pound on 10,000 servers every day, uploading millions and millions of pieces of information in a given month. For example, “friends,” who socialize in 21 languages, add 500 million photos per month.

Related on

Facebook Faces Class-Action Suit Over Beacon

The Pros and Cons of Social Networks and IT Job Seeking

Why Microsoft Should Bring Facebook-like Look to SharePoint

Web 2.0: Companies Gain Competitive Edge with Social Networking Tools

At last count, Facebook stored 6.6 billion photos total, more than any other photo site. Roughly 400,000 developers and entrepreneurs have built 25,000 applications for the platform and about 140 new applications are added per day. (For more on Facebook’s application user interface appeal, see Why Microsoft Should Bring a Facebook-like Look to SharePoint and Web 2.0: Companies Gain Competitive Edge with Social Networking Tools).

Overall there are 25 terabytes of cached data available to help Facebook’s 2,000 databases serve up user requests.

Yeah, the infrastructure fairly boils over with activity and Jonathan Heiliger is the lucky VP of technical operations who gets to stir the pot. Heiliger, who has run technology for several start-ups and advised venture capitalist firm Sequoia Capital, also directed site engineering for Wal-Mart’s website. He joined Facebook in October 2007 to oversee its technology set-up, which many of its 600+ employees tinker with continuously. Whew! It’s a good thing Heiliger lists as an interest (on his LinkedIn profile!) “anything 24 x 7.”

CIO Senior Editor Kim S. Nash recently interviewed Heiliger to talk about his work at Facebook. You’ve done a lot of startups in the past. What lessons from that experience do you bring to Facebook?

Jonathan Heiliger: The decisions you make early on tend to leave a lasting impression. It’s difficult to change the way a startup is started. One of the challenges or opportunities that drew me here was going from a purely engineering-driven culture—writing software for users [for] sharing information—to now operating this truly large infrastructure. Those are two very different things. [Early on] you make tradeoffs in IT to speed development that often can lead to disaster later when you have to operate five years later.

Facebook's Jonathan Heiliger
Jonathan Heiliger, Facebook’s VP of technical operations

What was the first thing you wanted to accomplish when you got to Facebook?

Heiliger: I spent the first three months coming up to speed. It was the longest coming-up-to-speed process I’d been through because most of my prior experience was at much younger companies. When I joined Facebook, there were 300 employees. [In the past] typically, I was among the first 10 employees. I knew where the bodies were buried, what cultural challenges there were. At Facebook, I had to figure it out.

What did you figure out?

Jonathan Heiliger: There’s not a lot of formal process and structure in place. Here, [the culture dictates that] you can’t dip a toe. You have to dive in headfirst and wrestle crocodiles. My first mission here was to build credibility and explain what technology operations does. Until that point, it was ambiguous what engineering, IT and operations each did.

How did you draw the lines between them?

Jonathan Heiliger: We’re constantly looking at the lines. It’s not static at Facebook. Most IT organizations love to control change. I threw that out when I came. We’re not going to try to control change, but speed it up. We trained people to be pushers. We do a major release once a week and minor releases every few days. Recently, we did some underlying changes— to photo gallery layout, for example, or starting using more Ajax calls on the site so the page refreshes are more seamless.

We created a 24×7 team in operations to be the stewards of site reliability. Instead of calling them a NOC or help desk, we called them the “site reliability group.” If someone wants to do something dumb or push bad code, the team can revert it. There are 20 people, split between Palo Alto, Calif., and London, to follow the sun. No one has to work graveyard shift.

One thing we try to balance is, since Facebook is first and foremost a technology company, we don’t want to stifle change and innovation. We’d rather innovate and have a little mess to clean up than run something reliable but stale. That’s tough to do at a bank, I imagine. The IT organization at a large bank wouldn’t have that flexibility—there’s other people’s money at stake, regulatory oversight.

What’s an example?

Jonathan Heiliger: Over the last two years, we have had a concerted effort to improve the push tool so site updates are seamless to users. Every couple of weeks, someone checks in some bad code or there’s a bad database call or we fail to do full design review and push it into production and see user impact. The site might start running slower or a geography of users will have issues. Site reliability isolates the problem, reverts the component or reverts the whole thing back to the previous known good state.

At Wal-Mart, we had the belief that we only roll forward, never back. Once you make schema changes in the database, it’s difficult to pull back. If you pushed buggy code into production, you had to fix it in production. With user impact covered in the press.

Here, it’s the opposite approach. We know there’s going to be broken things that happen fairly regularly. We are ready. We have emotional shields for them.

You changed the basic Facebook interface a few weeks ago. What sort of things happened with that rollout?

Jonathan Heiliger: That’s a massive change. Similar to how we rolled out Chat, we turned the new interface on gradually, some percentage of users at a time.

How did you roll out the chat feature?

Jonathan Heiliger: We had the technology running for about a month [detecting who was online] before we had the user interface visible. We turned it off several times, found a bunch of bugs that way. You can’t discover that in a QA environment. You need millions of people pounding on it every day. But the actual rollout is gradual.

Is gradual rollouts an approach enterprises should take with big software rollouts?

Jonathan Heiliger: That control, that knob, gives operations and development organizations a lot of confidence. You can turn up the heat and if there are issues, only a certain percentage of employees have been affected. It’s a mentality shift. In some large enterprise apps, you can’t necessarily control technology changes to a subset of users. They all have to be using the same iteration at the same time. But in other instances, you can.

What’s going on at Facebook to keep the company and culture flexible?

Jonathan Heiliger: The product we’ve built encourages people to be open and share information. A lot of decisions here—design reviews, PR strategy, what servers to buy—are often open for informal debate and input from across the employee base.

We built tools on top of the Facebook platform, including one called Ideas. Any employee can create an idea by category—social, office, product. There’s a discussion tool with a star rating. One star is a really bad idea. Five stars is “I’m gonna quit if we don’t do this.” Ideas are anything from, “I think we should have a chat feature on the site” to “Can we replace sodas with juice in the fridge?” We encourage public comment.

We also live-blog. There’s a person who transcribes any large company meeting, monthly presentations from different departments and weekly Q&As with the management team.

There’s a combination of management’s willingness and desire to continue to push to openness and creativity.

Why did you build these tools rather than buy them? There’s no shortage of blog and chat tools out there.

Jonathan Heiliger: A couple good reasons and not so good reasons. We’re a technology company and we like to write software. But really, these tools are integrated with the Facebook interface, so it makes it that much easier for employees to use. One thing I’ve seen at a lot of other companies is they have a pea soup style—lots of tools and Web forms and e-mail in-boxes. It’s difficult for an employee to know, if they have an HR question, do I e-mail or walk over? Do I have to fill out a comment form? For us, it’s all Facebook. Employees use Facebook every day. They don’t have to launch a browser window to go to different URLs to communicate.

What does your infrastructure look like?

Jonathan Heiliger: Our entire Web site is run on free software. That varies from a large MySQL site—we’re second or third to Yahoo, which is No. 1. And we are also a PHP site. We have half a dozen open source projects. Another is the Memcached project. We’re a significant contributor to the project.

What have you contributed lately?

Jonathan Heiliger: One example is Thrift, which is a language-independent network library that allows different software and systems to communicate without developers having to do rewrites of network application layers. That’s gotten a respectable following among Web companies.