Lessons Learned at Invite-Only Performance Testing Conference
Starting with data and using it to get to the heart of the matter isn't the way IT conference sessions go. Then again, the invitation-only Workshop on Performance and Reliability isn't your typical IT conference. The real value is what happens when it's 'open season' on the presenter and the real 'sense making' can begin.
By Matthew Heusser
Explaining the Workshop on Performance and Reliability is a bit like peeling back an onion. On the surface, there’s a three-day workshop, but peel back the layers and you see approach to learning that’s vastly different.
With a 30-attendee limit, WOPR is a bit like an open space conference that gives every participant an opportunity to speak. The format is different, though, based on first-person experience. Sessions start with data points and move backward to meaning. It’s something that Eric Proegler, one of events’ organizers, calls “sense making.”
It’s the kind of thing that’s hard to explain—which is why I find myself in New York City’s Garment District at WOPR 20 to learn firsthand. Liquidnet, the global trading network located in Midtown, is hosting the event. Given that the company has only 300 employees, with a large number of trades done through automation, serving as a WOPR host makes sense—especially for this WOPR, which has a theme of cutting-edge techniques in performance testing.
Paul Holland, the lead facilitator, explains that each presentation goes into significant depth about the process from the presenter’s perspective. He encourages us to ask “clarifying questions” as the talk progresses (about, say, disk space, number of users or type of processor) but to hold questions about opinions or reasoning for the “open season” after the presentation. That’s the sense-making process where the group discusses the data, what rules of thumb to draw—and when those rules of thumb might not apply.
Performance Testing Applications on Hadoop: Avoid Bottlenecks
John Meza, performance testing manager at ESRI, gives the first presentation. ESRI makes spacial mapping software that programmers can use through a variety of APIs, including Java. Given that Java is probably most popular language used on Hadoop, the open source version of Google’s MapReduce Algorithm, customers could use ESRI’s software on Hadoop.
That’s just the kind of new application that needs to be performance tested, and it’s the focus of Meza’s experience report. Concerned for the product’s reputation, Meza wanted tests to give him confidence that ESRI’s software will perform more quickly, more accurately and scale better than its competition when running on Hadoop.
To do that, Meza had to set up a Hadoop cluster and Linux servers, downloading Hadoop and compile it, and so on. The process was so slow and painful, though, that he gave up and instead downloaded Cloudera Manager, a free Linux distribution that came pre-assembled and -configured with Hadoop. With Cloudera running, Meza generated real and geographic data test data, as well as a program to look for patterns in that data using MapReduce.Implementation.
After discussing Hadoop’s implementation of MapReduce, Meza tells us his performance concerns. First, that the individual mappers would take too long to process the data, and second, that the network (the oval below) would become saturated when too much data was sent across the wire. (The search can be distributed, but communication about the results will grow as the data set grows.)
The testing at ESRI was a little different: The group wasn’t testing Hadoop as much as the interactions with ESRI’s SDK. If the SDK did not create a bottleneck, or if the bottleneck was greater with competing software, then testing was a success. That’s what John found. (Of course, it was possible to saturate the network, but the testing showed that ESRI was not the cause of the saturation.)
After the presentation, it’s time for open season:
I ask how Meza measured network performance; he used Ganglia and found that the biggest bandwidth spike occurred during that shuffle of data.
Meanwhile, Doug Hoffman asks where the biggest bottlenecks occur in testing. Meza’s limiting factor at this point is simple disk space; Hadoop performs so well that to slow it down would require a significant amount of additional disk capacity. Beyond the disk, generating enough unstructured data for Hadoop to process in a meaningful way can take considerable time.
After Meza, we hear about Mieke Gevers’ experience performance testing a new, cloud-based backup and synchronization feature for a client. For legal reasons, the physical backup solution needed to be stored on-premises, inside Germany. Instead of a public application, the company built its own storage cloud, using the open source OpenStack framework, then integrate the existing application with that cloud storage.
Mieke mentions two specific challenges related to the technology. First, the company had no experience with OpenStack, so simply building and configuring the lab was a challenge. This meant Mieke’s performance results from yesterday might be invalid, or at least not reproducible, because the lab was configured differently today. Second, inexperience meant the initial project schedule was a best-case scenario. The performance testing was fine, but it kept exposing functional problems with the application.
Eventually, the company chose an extended beta and incremental release to a few people. This made it easier to ramp up performance demand over time. The lesson: New technologies have a learning curve, while estimates without data are, well, estimates made without data.
Quality Performance Testing Is Probably Not Free
After Mieke comes Richard Leeke, a consulting director and performance tester from New Zealand. Leeke tells the story of grafting performance testing onto a large government website.
The biggest problem with that approach is design for testability—there wasn’t much. For example, the website used form fields that were dynamic, and generated at run time, as opposed to static IDs. That means that a performance tool that plays back traffic, substituting data, will fail when it sends the right data to the wrong IDs.
Leeke explains that the company evaluated eight test tools on paper before doing a funded proof of concept that eventually settled on Visual Studio Load Test. Like other tools, Visual Test struggled to replace the form fields with the ones that would be generated—even though, conceptually, these could be predicted in code.
To solve the problem, Leeke’s team built its own application to create Visual Studio Test files. Using the application, a senior performance tester could write what we was about to do, capture the traffic (from Fiddler), then encode the rules to transform the Fiddler recording into an executable test.
To test subsequent builds of the application under test, a more junior tester could “follow the steps,” as it were, to first record and then use the existing transformation rules to generate a new, valid, Visual Studio Test file. According to Leeke, the process reduced the time to create a test suite from 600 hours to 150, while the time to re-record for a new build dropped down to 15 hours or so. This means a new test run can happen in days, not weeks.
Besides the tool—on which Leeke sought input—the second “cutting edge” aspect of his talk focused on infrastructure. The government structured the system on three physical machines running a total of five virtual machines. Usually when performance testing, Leeke would try to test at full projected load with N-1 machines, but the existence of the VMs meant that if one physical machine went down, two VMs could be lost. It also means another layer of debugging to find out if the problem is the virtualization layer, the physical machine, a specific resource on the machine and so on.
Of course, there were problems. At high load, Leeke found that the VMs, as configured, seem to suddenly “go wild.” The problem didn’t make sense, so the team ran a number of experiments on different VMs, looking for a configuration that didn’t have the problem and working backwards to isolate the problem. Eventually, they realized that removing “Network Interrupt Coalescing” kept performance linear at high load.
Performance Engineering at Facebook: Orders of Magnitude Larger
Leeke’s perseverance amazes me; he never gave up. Facebook’s Goranka Bjedov tells us that, when she hears what others have to go through to actually test, she appreciates her job much more. When Bjedov speaks about performance at Facebook, the first word that comes to mind is scale—as in 160 million newsfeed story created every half hour.
Traditional performance testing on Facebook would probably require several full-sized data centers to generate the load, not to mention a test lab made up of, well, a couple more data centers. So the social media giant takes a different approach.
Facebook monitors the load on all systems. For performance testing, the load balancer will redirect more traffic to a specific subsystem until performance begins to degrade—that is, until schedule-based latency hits 100 ms, a point a point Facebook considers max load.
The thing is, Bjedov’s talking about live traffic, on live production servers.
This takes a moment to sink in, so she elaborates: “There’s no such thing as a test environment. If your test environment is not the real environment, why bother? I can log on and find out the maximum available throughput in a cluster. How do I do that? I tell the router to send additional load to a cluster until it reaches performance limits…There really isn’t a way to do traditional performance and load testing.”
The biggest issue for Facebook, as far as traffic is concerned, is a “black swan,” or an event when it’s are hard to predict where the data is in a single place. The two biggest events Bjedov can name are Hurricane Sandy, when the cable from Europe to the New York data center was severed, and “one time Justin Bieber got a haircut.”
Because of its sheer number of engineers, Facebook is changing continuously. Bjedov therefore needs to test continuously and also monitor rollouts in production. One secret: New Zealand. It’s separate from the Americas and Europe physically and has a population small enough to not have a huge impact yet large enough population to be measurable. It’s a common test bed for new features.
Putting Software Testing Methodologies to the Test
By the end of the conference, we’ve heard from Becky Clinard about testing cloud hosted legacy applications, Robert Binder about simulating an automated trading system and Julian Harty about using mobile analytics results for quality improvement. Each presenter stands up to scrutiny, to a kind of poking and prodding designed for learning, not sales, that would never happen at a “presentation” conference.
The process reminds of Robert Austin, the Harvard Business School professor whose Carnegie Mellon University Ph.D. research became the basis for Measuring and Managing for Performance In Organizations. Austin argues against management by spreadsheet and against balanced scorecard, because they are lousy representations. Instead, he suggests periodic, incredibly deep project reviews.
After three days of extended experience reports, deep dive Q&A and analysis, I think I may have glimpsed what Austin was getting at, of how “sense making” can apply to a larger group.
As I walk out, I run into Mais Ashkar, one of the organizers. She tells me WOPR 21 will be in October 2013 in San Francisco, with a theme of technical debt in system performance. I politely ask for an invitation.