Started in a dorm room four years ago, the social\n networking site Facebook\n now claims to be the\n fourth most-trafficked site in the world. Ninety million active\n users pound on 10,000 servers every day, uploading millions and\n millions of pieces of information in a given month. For\n example, "friends," who socialize in 21 languages, add 500\n million photos per month.\n Related on CIO.com\n \n Facebook Faces Class-Action Suit Over Beacon\n \n The Pros and Cons of Social Networks and IT Job Seeking\n \n Why Microsoft Should Bring Facebook-like Look to SharePoint\n \n Web 2.0: Companies Gain Competitive Edge with Social Networking Tools\n At last count, Facebook stored 6.6 billion photos total,\n more than any other photo site. Roughly 400,000 developers and\n entrepreneurs have built 25,000 applications for the platform\n and about 140 new applications are added per day. (For more on\n Facebook\u2019s application user interface appeal, see\n Why Microsoft Should Bring a Facebook-like Look to SharePoint and Web 2.0: Companies Gain Competitive Edge with Social Networking Tools).Overall there are 25 terabytes of cached data available to\n help Facebook\u2019s 2,000 databases serve up user\n requests.Yeah, the infrastructure fairly boils over with activity and\n Jonathan Heiliger is the lucky VP of technical operations who\n gets to stir the pot. Heiliger, who has run technology for\n several start-ups and advised venture capitalist firm Sequoia\n Capital, also directed site engineering for Wal-Mart\u2019s\n website. He joined Facebook in October 2007 to oversee its\n technology set-up, which many of its 600+ employees tinker with\n continuously. Whew! It\u2019s a good thing Heiliger lists as\n an interest (on his LinkedIn profile!) "anything 24 x 7."CIO Senior Editor Kim S. Nash recently interviewed Heiliger to talk\n about his work at Facebook.CIO.com: You\u2019ve done a lot of startups in the\n past. What lessons from that experience do you bring to\n Facebook?Jonathan Heiliger: The decisions you make\n early on tend to leave a lasting impression. It\u2019s\n difficult to change the way a startup is started. One of the\n challenges or opportunities that drew me here was going from a\n purely engineering-driven culture\u2014writing software for\n users [for] sharing information\u2014to now operating this\n truly large infrastructure. Those are two very different\n things. [Early on] you make tradeoffs in IT to speed\n development that often can lead to disaster later when you have\n to operate five years later.Jonathan Heiliger, Facebook's VP of technical\n operationsWhat was the first thing you wanted to accomplish\n when you got to Facebook?Heiliger: I spent the first three months\n coming up to speed. It was the longest coming-up-to-speed\n process I\u2019d been through because most of my prior\n experience was at much younger companies. When I joined\n Facebook, there were 300 employees. [In the past] typically, I\n was among the first 10 employees. I knew where the bodies were\n buried, what cultural challenges there were. At Facebook, I had\n to figure it out.What did you figure out?Jonathan Heiliger: There\u2019s not a lot\n of formal process and structure in place. Here, [the culture\n dictates that] you can\u2019t dip a toe. You have to dive in\n headfirst and wrestle crocodiles. My first mission here was to\n build credibility and explain what technology operations does.\n Until that point, it was ambiguous what engineering, IT and\n operations each did.How did you draw the lines between\n them?Jonathan Heiliger: We\u2019re constantly\n looking at the lines. It\u2019s not static at Facebook. Most\n IT organizations love to control change. I threw that out when\n I came. We\u2019re not going to try to control change, but\n speed it up. We trained people to be pushers. We do a major\n release once a week and minor releases every few days.\n Recently, we did some underlying changes\u2014 to photo\n gallery layout, for example, or starting using more Ajax calls\n on the site so the page refreshes are more seamless.We created a 24x7 team in operations to be the stewards of\n site reliability. Instead of calling them a NOC or help desk, we\n called them the "site reliability group." If someone wants to do\n something dumb or push bad code, the team can revert it. There\n are 20 people, split between Palo Alto, Calif., and London, to\n follow the sun. No one has to work graveyard shift.One thing we try to balance is, since Facebook is first and\n foremost a technology company, we don\u2019t want to stifle\n change and innovation. We\u2019d rather innovate and have a\n little mess to clean up than run something reliable but stale.\n That\u2019s tough to do at a bank, I imagine. The IT\n organization at a large bank wouldn\u2019t have that\n flexibility\u2014there\u2019s other people\u2019s money at\n stake, regulatory oversight.What\u2019s an example?Jonathan Heiliger: Over the last two years,\n we have had a concerted effort to improve the push tool so site\n updates are seamless to users. Every couple of weeks, someone\n checks in some bad code or there\u2019s a bad database call or\n we fail to do full design review and push it into production\n and see user impact. The site might start running slower or a\n geography of users will have issues. Site reliability isolates\n the problem, reverts the component or reverts the whole thing\n back to the previous known good state.At Wal-Mart, we had the belief that we only roll forward,\n never back. Once you make schema changes in the database,\n it\u2019s difficult to pull back. If you pushed buggy code\n into production, you had to fix it in production. With user\n impact covered in the press.Here, it\u2019s the opposite approach. We know\n there\u2019s going to be broken things that happen fairly\n regularly. We are ready. We have emotional shields for\n them.You changed the basic Facebook interface a few weeks\n ago. What sort of things happened with that rollout?\n Jonathan Heiliger: That\u2019s a massive\n change. Similar to how we rolled out Chat, we turned the new\n interface on gradually, some percentage of users at a time.How did you roll out the chat feature?Jonathan Heiliger: We had the technology\n running for about a month [detecting who was online] before we\n had the user interface visible. We turned it off several times,\n found a bunch of bugs that way. You can\u2019t discover that\n in a QA environment. You need millions of people pounding on it\n every day. But the actual rollout is gradual.Is gradual rollouts an approach enterprises should\n take with big software rollouts?Jonathan Heiliger: That control, that knob,\n gives operations and development organizations a lot of\n confidence. You can turn up the heat and if there are issues,\n only a certain percentage of employees have been affected.\n It\u2019s a mentality shift. In some large enterprise apps,\n you can\u2019t necessarily control technology changes to a\n subset of users. They all have to be using the same iteration\n at the same time. But in other instances, you can.What\u2019s going on at Facebook to keep the\n company and culture flexible?Jonathan Heiliger: The product we\u2019ve\n built encourages people to be open and share information. A lot\n of decisions here\u2014design reviews, PR strategy, what\n servers to buy\u2014are often open for informal debate and\n input from across the employee base.We built tools on top of the Facebook platform, including\n one called Ideas. Any employee can create an idea by\n category\u2014social, office, product. There\u2019s a\n discussion tool with a star rating. One star is a really bad\n idea. Five stars is "I\u2019m gonna quit if we don\u2019t do\n this." Ideas are anything from, "I think we should have a chat\n feature on the site" to "Can we replace sodas with juice in the\n fridge?" We encourage public comment.We also live-blog. There\u2019s a person who transcribes\n any large company meeting, monthly presentations from different\n departments and weekly Q&As with the management team.There\u2019s a combination of management\u2019s\n willingness and desire to continue to push to openness and\n creativity.Why did you build these tools rather than buy them?\n There\u2019s no shortage of blog and chat tools out\n there.Jonathan Heiliger: A couple good reasons\n and not so good reasons. We\u2019re a technology company and\n we like to write software. But really, these tools are\n integrated with the Facebook interface, so it makes it that\n much easier for employees to use. One thing I\u2019ve seen at\n a lot of other companies is they have a pea soup\n style\u2014lots of tools and Web forms and e-mail in-boxes.\n It\u2019s difficult for an employee to know, if they have an\n HR question, do I e-mail or walk over? Do I have to fill out a\n comment form? For us, it\u2019s all Facebook. Employees use\n Facebook every day. They don\u2019t have to launch a browser\n window to go to different URLs to communicate.What does your infrastructure look\n like?Jonathan Heiliger: Our entire Web site is\n run on free software. That varies from a large MySQL\n site\u2014we\u2019re second or third to Yahoo, which is No.\n 1. And we are also a PHP site. We have half a dozen open source\n projects. Another is the Memcached project. We're a significant contributor to the project.What have you contributed lately?Jonathan Heiliger: One example is\n Thrift, which is a language-independent\n network library that allows different software and systems\n to communicate without developers having to do rewrites of\n network application layers. That\u2019s gotten a\n respectable following among Web companies.