by Matt Heusser

Continuous Deployment Done In Unique Fashion at Etsy.com

Feature
Mar 30, 20128 mins
DeveloperE-commerce SoftwareInternet

With customers and suppliers on seven continents, there is no good time for system maintenance -- and little room for error -- at Etsy.com. Yet by the time you finish reading this article, the company's continuous deployment process has resulted in code pushed to production by a developer with no one else's approval.

If you visit 55 Washington St. in Brooklyn and take the elevator to the fifth floor to the main offices of Etsy.com, don’t expect to see the typical reception station and front desk. Instead, you leave the elevator to see bare walls and this greeting:

Welcome to Esty
Welcome to Esty

The office is about 200 feet down the hall, on the right. Just follow the signs.

Don’t mind the dogs; the office is pet friendly.

Etsy
Nothing about Etsy is typical.

Founded in 2005, Etsy is more than an online marketplace. It is a platform for independent, creative businesses, where people sell handmade and vintage goods and supplies. By the end of 2011, Etsy had grown to handle 800,000 active sellers and 13 million members, with more than 260 employees supporting a digital store that processes credit card transactions, PayPal and electronic funds transfers. And new programmers are expected to write code that deploys to production on their very first day.

Here is how they did it.

Infrastructure as Craft

When you walk into the offices, two words come to mind: fluid and personal. Noah Sussman, the company’s architect for quality assurance, greets me at the door with a huge “over here” grin and a wave, and immediately starts talking about the office environment, perhaps the most open I have ever seen.

The technical team has its desks strewn about in rough rectangles to create “hallways” to the walls. The team needs those hallways, as it grows to fill an office that appears to be, at first, one huge room, with a few side offices used as conference rooms. (The conference rooms have nicknames, like “Kung Foo Fighters” and “Slim Jim Morrison” My favorite was the spaceship room, “Pjörk,” which included art designed to create a “mustache mural.”)

Another word that comes to mind is playful.

etsy_mustache_500.jpg
Yes, that is a set of mustaches as art on the wall. It’s in the spaceship room.

But the Etsy staffers are also completely serious about their work, and these two features they share in common with their customer base, who are tying to earn side money, if not pay the rent, by designing the hand bags, walking sticks and hand-made chocolates that have made Etsy famous in the artisan and sustainable business scene. As one anonymous forum poster once put it: “It’s hard to take Etsy seriously; they sell hand made walking sticks and tchotchkes.” To which Etsy CEO Chad Dickerson replied: “Last year, we sold 24 million of those tchotchkes through our site.”

In the middle of the creative chaos we saw the desk of CTO Kellen Elliott-McCrea, a former architect at Flickr. Sitting next to him was Michael Rembetsy, director of technical operations. Their desks are no larger than anyone else’s, they have no name plates or titles, but I did notice one thing.

They were standing directly in front a huge, six-flat panel display that shows up-to-the-minute metrics on site performance.

Performance Monitoring in a Nutshell

Sussman starts explaining the statistics.

“The whole thing is pretty much open source end-to-end,” he said. “Operations have known about Nagios as a tool for years, but it’s only lately that the quality folks are starting to realize that it is a QA tool. Basically, our logger spits out UDP packets out to a NodeJS server, then we use graphite to generate the graphs. You basically just encode what you need to know into a URL and show that URL on a screen — the software generates the graph. If you don’t want to put the server up yourself, I understand that Librato has the toolset available for rent as a service.”

He added, “We track everything: number of logins, number of login errors, dollars of transactions through the site, bugs reported on the forum — you name it. We batch these up and aggregate the numbers into 10-minute increments, then show the graphs. A vertical line here is a deploy to production.”

Waving at a monitor, he continued, “We’ve averaged 26 deploys to production per day for the past month or so. Yeah, January is a pretty slow month. The vertical lines here are deploys. So basically, you watch the wall. If there’s a sudden spike or valley immediately after a deploy, we’ve got a problem.”

It’s Rembetsy’s turn to speak up. “Mistakes happen,” he said. “We find them, fix them, and move on. The important thing is to learn something from the process, and never make the mistake again in the future.”

Sussman was quick to add: “If pushing is easy enough, then pushing a fix will be too.” He noted that Etsy has made its implementation, called statsd, freely available and open source.

Shifting the Risk Profile

The quality model for Etsy is cutting edge, but not unique. New developers are expected to push code to production on day one. That’s not commit code, but push it to production.

By this time, other folks had started to notice the strange gentlemen in the office, asking questions and pointing at things. Sussman waved over two members of his team, Michele D’Netto and LB Denker. Denker explained the testing in production strategy, which they built on top of an A/B split framework.

The Etsy approach boils down to this: When developers want to push a feature, they use an employee mail list to request feedback. Then the programmer pushes the code. It’s now on a production server, but through the magic of configuration flags and the A/B split framework, the only people who will execute that code will be logged-in employees.

After reasonable testing and feedback, the programmer can promote the code to a larger group, or to a random portion of users, which eventually inches up to 100 percent.

“It’s not just the GUI,” Rembetsy said. “We often add in features that are ‘dark,’ or behind the scenes. Right now, for example, we are adding a second logger. We can add storage, logs, even swap databases, and actually go ahead and duplicate reads and writes to the second DB, measuring performance. Forget about performance worries — by the time we cut-over, we’ve seen the thing work in production for a month.”

I asked Sussman, if they have this great framework, why the graphs? I mean, shouldn’t they know things will work? “Well, sure, in theory,” he said. “And even, most of the time, in practice, too. But humans make mistakes. It is always possible to think you put a feature in dark, miss a config flag, and blam, it goes out to the world. So you watch the graphs, you figure out the problem, you fix it, and you drive on.

“Those people pushing to production on day one aren’t pushing a new credit-card feature; they are pushing a standard, pre-defined change to the ‘about’ page to add their picture. Instead of fearing change, we get people used to it. The risks change,” he said, gesturing at the monitors, but we take steps to address the risks. It’s a different way of developing software.”

Regulations, Security and Privacy

We continued our tour, we dropped by the security and privacy team, which is identical to every other small group of cubicles, except that the lights are noticeably dimmer. As we talked, one man, sitting in his chair, looked at us with authority — and a little suspicion. It was Nick Galbreath, a director of engineering at Etsy, whose practice includes security and privacy.

Suddenly my colleague, Peter Walen, got very excited. Pete was just visiting for the trip, but his background is in financial systems, including PCI compliance, the international standard for credit cards — and he says that most of this would never fly under most interpretations of PCI. How is this possible?

“Yes, and you’ve got a point,” Galbreath said. “Our financial systems are under a lock-box, and handled in an entirely separate way than the rest of the software. Of course, once you lift that out, you also lift out the regulatory burden, and we get to do a lot more creative things. In fact, the real challenging part of the job is preventing and eliminating fraud. We design our own systems and security protocols for that.”

Finding Staff

The major development language for Etsy is PHP. I asked Denker, the continuous deploy framework lead, if it’s cheaper and easier to hire PHP developers. She was a little incredulous: “Oh, no, we don’t necessarily hire PHP developers. PHP is so easy to learn that we can hire any strong programmer.”

I was about to ask what they do if the programmers get stuck when Sebastian Bergmann, the creator of PHPUnit, the unit testing framework for PHP, walked by our space. It seems that Etsy hires Bergmann and other leaders to consult by walking around, teaching the staff while doing the work. The following week, Rasmus Lerdorf, the creator of PHP, would in the office.

It was getting time for lunch, so we headed off to a place called “Rice,” about a block away, that sells, well, dishes with rice in them. After lunch, Sussman walked me to the pickup spot for the cab. He offered to wait, to make sure the cab arrived on time, and we are off to the airport. “It’s one thing to move fast and take risks,” Sussman said. “But you’ve got to follow through and make sure things work out.”

Somehow, I suspected that he might say something like that.