Diary of a Product Testing Nightmare

What really goes on in a serious high-end product review?

1 2 3 4 Page 2
Page 2 of 4

You might guess that we are cursing Juniper at this point, and you'd be right. More importantly, these XFPs are a problem. SFPs for gigabit Ethernet, used to be expensive but are now less than $100 each and everyone has a ton of them laying around. The least expensive flavors of XFPs are still around $1,000 each, and they're so new that people don't have jars full of them in their labs. If we don't have the right XFPs, we're not going to be able to go to Home Depot or Fry's to get them.

Fortunately, there are two Spirent TestCenter chassis in the SPoC lab, filled with 10Gbps ports. We again liberate some lab gear ("easier to apologize...") and borrow half-dozen XFPs from Spirent's equipment to get the Juniper gear working, at last!

I dropped a note to Juniper wondering where the rest of the boards for this chassis are, and the reply is disenchanting: "You need them today? I was under the impression that this was going to take several weeks." Clenching our teeth, we smile and remind them the original test plan was for three days.

There's another problem as well: This chassis has two management cards in it, and I can't seem to get them to synchronize. I exchanged e-mail with a Juniper technical contact, and we couldn't seem to get past vocabulary issues. Like many new devices, this one has a different term for every piece, the only obvious one being "I/O Card." I say one thing, he says another, and we are both confused. However, he agrees to come over when the rest of the hardware is installed to help me figure it out.

David starts a basic test, and I linked up the SRX to the NSM Express management system. We decided that we are overjoyed to have this small level of success. In the meantime, Juniper's Glen Gibson, one of the product managers, sends us an e-mail: They won't actually have the rest of the parts for the chassis tonight, but will get them to us first thing in the morning tomorrow. Since David and I both had to get up at 4 a.m. that morning to catch our flights to San Jose, we called it a day at 6 p.m. and escaped to Los Gatos to have dinner with some friends.

Feb. 3, 9 a.m.

We arrived at the Spirent SPoC lab and are convinced Tuesday is going to be much better than Monday. We're wrong, but that won't be clear until after lunch.

Now that we have packets flowing between all ports on the SRX, we wanted to do some simple sanity checking to be sure we have a valid test configuration, so David launched one of our tests to get a preliminary read on things. The numbers come back and ... our 80Gbps testbed is giving us about 8Gbps of traffic. Uh oh. While this could be Juniper's fault, he was  immediately suspicious of our Spirent configuration. Fortunately, we were in Spirent-central. In fact, people have been coming through the conference room all day to say hello to David. His depth of experience with the Spirent gear has made him a minor deity there, and while I'm just chopped liver on the other side of the table, David the Celebrity gets a big hello from what seems like almost everyone in the building.

This worked to our advantage when we find a problem, and some top technical support people showed up immediately to figure out what is going on. The problem was quickly isolated to the 10G interfaces. It turned out that the configuration of these interfaces is not very straightforward. In fact, it was downright unpleasant. Actually, it was painful in a Michael O'Donoghue "plunge 15-inch steel needles with very sharp points into our eyes" kind of way. We discovered that we needed to completely re-build our test plan. Most of David's homework was down the drain: We needed to start over and re-enter the test, but now suddenly the test setup is seven times longer than it was. Instead of 16 networks, all talking to each other, he had to configure everything for 112 networks.

David and I huddled for a few minutes and redesigned our addressing scheme to accommodate the new configuration. Then he entered a Zen-like state, clicking and typing to enter the new test plan.

While David focused on this tedious task, I had my own demon to confront: NSM. Juniper's management system is very familiar to me, as we've been running it at Opus One for as long as it has existed. But when the SRX was integrated into NSM, someone left out a considerable number of nuts, washers and screws, with the result that much of what I know about NSM and about JunOS doesn't help.

This new-and-improved version of NSM does push firewall policy to the SRX just fine, but everything else is either as difficult as can be (such as managing the device); just plain doesn't work (looking at logs); or balks at what we want to do (getting the IPS configuration into place).

I called for help and Juniper agrees to come over.

We were interrupted by the arrival of Juniper's best and brightest, carrying an enormous pile of cardboard boxes, each of which contains a $100,000 card. With Juniper's help, we slide these into our chassis. This also gives Juniper a chance to figure out why I'm blathering like a heatstroke-addled tourist in Acapulco about two management cards that aren't in sync. Indeed, our chassis had two cards, and they weren't in sync, but that's not the problem. The problem was that this chassis doesn't support two management cards. This also explains why the Juniper team didn't understand what I was asking, since our chassis wasn't configured properly. Glen from Juniper slides the errant card into his briefcase with a "these are not the droids you're looking for" motion. Crisis averted.

Feb. 3, noon

Chris Chapman, Spirent's SPoC manager, sprung for pizza as it was clear that David and I were not leaving this conference room for lunch. Unfortunately, just as David finished rebuilding the test configuration we started running into other problems: The management system he is using has become unstable and he can't get the test to start. David invoked his favorite tool, tcpdump, and immediately pointed the finger at the SPoC lab network.

Debugging this test management problem gave me a rare insight into the dangers of being locked in a room with a bunch of really smart people. If it were me, and my network, I know how I would figure it out. But we've had me, David, Chris Chapman and a handful of Spirent people, , and everyone had a different -- and often contradictory -- idea on how to debug. We spent the next three hours stumbling over each over, reproducing the same experiments, and generally getting in each others' way.

The problem, it turns out, was nobody's fault, and everybody's. When Chris Chapman put together the lab, he ordered boxes and boxes of very short and very long patch cables. The cables are all the same color, and the long ones are all capable of stretching from one side of the room to the other. Thus there are a lot of one-ended patch cables sticking out in front of each switch, and it's very easy to confuse them since they all look alike. Chris' laudable goal of maximum flexibility, in this case, resulted in maximum confusion, as someone had accidentally built a loop into the network.

In the good old days of 10Mbps Ethernet, this kind of problem would have been easy to detect, because the whole network would have gone down and nothing would have worked. Unfortunately, the dozen or so gigabit switches that Chris used to build the SPoC lab are so fast that they actually work pretty well, even in a network which is furiously looping packets and thrashing its forwarding databases.

Feb. 3, 3 p.m.

With the network de-looped, David and I were back on track and he hits the "Run" button on the Avalanche software while I paced anxiously behind. This, certainly, is going to work.

Following the trend of the week, it didn't.

Spirent escalates up the totem pole and ace Spirent troubleshooter Jeff Brown joins us in the conference room. He was sure that we have somehow corrupted the test plan and decides to re-enter it himself, from scratch. David, in frustration, rips his shirt off and screams "I am not an animal! I am a human being!" (For the record, this spontaneous outburst seems to have sprung from some dark place. David swears he's never seen "The Elephant Man".)

Meanwhile, the team from Juniper appeared to help me through our problems with NSM. Rob Cameron, a Juniper's technical marketing engineer, generously walked me through his shortcuts for handling the SRX, while Sanjay Agarwal, an NSM product manager, explained how they got into this particular state. I'm not happy that NSM has such primitive management of the SRX, but it was nice to know how they got trapped in this particular cul-de-sac.

While Spirent's master of disaster re-enters our test plan, David comes up with an alternative idea: Sitting in the SPoC lab are a couple of Spirent TestCenter chassis, packed to the gills with 10G cards. Even one of these high-horsepower generators has more than enough ports to test our SRX 5800, and actually can go faster than the SRX. The only problem is the protocol: To get up to 160Gbps, Spirent TestCenter had to offer stateless IP or UDP/IP traffic.

Earlier, David and I had explicitly rejected a UDP test as not interesting: We wanted to test the SRX with traffic that at least bore a passing resemblance to what people actually send through their networks. UDP numbers are good for data sheets, but they don't really help people understand how these things will work in production networks, where TCP represents 95% or more of Internet traffic. But because we had two full days of failure and were starting to get anxious about getting numbers on paper, we methodically disconnected our 16 tangled fiber cables, running them over to Spirent TestCenter.

As David generated a configuration for the Spirent TestCenter chassis, I was in my own private Idaho: arguing with NSM about the IPS policy. Juniper's gurus came through and applied a good dose of ball-peen hammer to the SRX and NSM's update configuration so that we were finally able to push an IPS policy to it from NSM. Unfortunately, there were internal database inconsistencies that caused an IPS policy push to return errors. Short answer: NSM thinks that the SRX has IPS signatures that it doesn't. To set up our initial policy, I had simply checked two boxes, telling NSM to push a policy with all signatures Juniper regards as "critical" and "major." Seemed like the simplest and easiest policy you could imagine, as well as a good starting point based on Juniper's recommendation. Well, clicking only two boxes, that's too easy. And it won't work.

To keep NSM from throwing errors, I had to go through dozens of categories, individually adding signatures until I got an error, then backing out and trying again. It was like trying to parallel park a Ford F350 pickup between two Jaguars, except someone has filled the cab of the truck with marshmallows and you only know you've gotten close when the car alarm goes off on one of the Jags.

Feb. 3, 7 p.m.

We confirmed what we already feared: Jeff Brown's noble effort to re-enter the test configuration from scratch didn't fix the problem with our 16 Avalanches and their 8x10 color glossy photographs with the circles and arrows and a paragraph on the back of each one. As a consolation prize, we sent Jeff away with an iPhone charging cord.

However, I finally had a good policy in the SRX, without errors, and think I understand what I needed to do to drive the SRX around.

I was wrong, of course, but in the twilight hours of two days of failure, I was happy for a little self-deception.

Feb. 3, 8 p.m.

I left David trying to coax any performance result out of the Avalanche appliances -- I thought he'd actually be happy to simply report that we had "link up" on 16 ports -- and I took off for Singapore-style Chili Crab in Palo Alto. David gave up at 8:30 and only misses the first round of Tiger beers.

Feb. 4, 9 a.m.

David and I were back in the conference room. David ran a set of Spirent TestCenter runs at different UDP packet sizes -- each took about 30 minutes to do a binary chop search -- and we at least had some numbers to report. Unfortunately, they were numbers we think that no one cares about; we essentially confirmed that Juniper was under-reporting its performance on their data sheet (a common strategy nowadays). We got 137Gbps and Juniper claimed 120Gbps. Big deal. David came up with significantly slower numbers for small packet sizes, which cheered him up considerably, but we both knew that it was a hollow victory: you don't buy this monstrosity of a box to push only 64-octet UDP packets, and if you do, you probably have bigger problems to worry about.

Feb. 4, 11 a.m.

Spirent brought out its toppest troubleshooting top gun, Thuy Pham, a diminutive engineering manager who doesn't take any grief from anyone, and she declared that the problem was our test plan. We were ramping up too fast; the Avalanches can't do that. We needed to start slow and build up to full speed. Sure, leave it to a woman to figure that out.

David obediently changes our ramp up time from 1 minute to 5 minutes, brings the Avalanches some chocolates, and suddenly things begin to happen. She was right, and now we have numbers. Or at least we have the ability to get numbers. Unfortunately, these tests take forever to run: about 20 minutes a pass, which meant that we weren't going to have a lot of time to try the 18 different test scenarios I wanted to run through the firewall.

Our first test brought to life our worst fears: the SRX 5800 is faster than our 16 Avalanches, at least when we ran stateful traffic through it. We were offering an HTTP load of nearly 80Gbps, the maximum we could do, and that's about what we were getting out.

1 2 3 4 Page 2
Page 2 of 4
7 secrets of successful remote IT teams