Diary of a Product Testing Nightmare

What really goes on in a serious high-end product review?

Just like sausage, the making of a serious high-end product review might be tough to stomach.

But we're constantly pitched by vendors who want to get their products tested at a more rapid pace than we can commit to, and, occasionally queried by readers wanting to know why or how we choose a particular methodology to exercise any given product. Finally, we are always called upon to correct our mistakes.

In our recent test of Juniper's SRX , we did indeed make a mistake when testing it under an attack load. We're not making excuses for that mistake, but as is typical with large-scale testing, there was a comedy of errors -- on our part and by others -- throughout the testing cycle. What follows is a description of what really goes on during big-box testing.

October 2007

I got a phone call from Doug Dooley, one of the marketing managers at Juniper. He wanted to talk about something, under NDA, called Australia, a hardware platform that would be to high-end security what Cisco's Catalyst is for enterprise networking.

It was a fascinating product line. Folks like Crossbeam had been going for high performance with parallel processing, but there was a considerable management overhead in keeping their chassis running. Doug was talking about an entirely new approach, something we hadn't seen since the peak of the dot-com days when folks like CoSine were trying to slip firewalls into their high-end routers, without much success.

The Juniper team was promising performance of 150Gbps -- completely blowing the competition out of the water, and without some of the ugliness in other products. This was not just going to be fast, but it was also going to be manageable in a way that had never been done before at those speeds.

Things got quiet after that. Juniper introduced enterprise switches, and came up with a new high-speed intrusion-prevention system in April 2008. But Australia was still down-under until July, when Juniper was ready to go on the record. The platform had a name: SRX, and there would be two models, the 5600 and the 5800. When Juniper promised speeds of 60G on one and 120Gbps on the other, I knew I was out of my league.

There's only one person in the public testing business that has the chops to handle that kind of speed: David Newman, at Network Test. David, Christine Burns (the testing editor at Network World) and I got on a call. Christine took the bait: The biggest firewall in the world? How could she not want to publish a test of that? David and I were hired: Go get an SRX and tell the world how fast it goes.

Roger Fortier, then the PR guy handling tests at Juniper, was our go-between. He set up a meeting where we could quiz the technical people and get a solid overview of the platform. We looked at the options, and decided to give the 5600 a spin, rather than the studlier 5800. Our reasoning: Network World is aimed at enterprise network managers more than service providers and the 5600 was much more likely to be on our readers' shopping lists than the über-box 5800.

Thanksgiving day 2008

We began to arrange all of the components we'd need to test the 5600: David had to be there. I had to be there. The hardware, which was just beginning to be manufactured on a different continent, had to be there. And we knew that we'd need to use Spirent's SPoC (Spirent Proof of Concept) lab in Sunnyvale, because neither of us had the equipment in our labs to stress a firewall that goes 60Gbps with stateful traffic flowing through it. We settled on a test date: the first week of February. All that was missing was a final "go" from Juniper.

Which didn't come. And didn't come.

Jan. 16

We found out that our main contact there had been laid off and the project was being dumped into the laps of other folks in Juniper's PR department. Fortunately, Juniper got on the ball and promised to have the hardware there, along with one of its new NSM Express management appliances (previously Netscreen System Manager but now renamed Network and System Manager to show that it could handle JunOS devices like this one), ready to rock and roll, when we showed up.

Which was mostly true. The Juniper folks said they'd send over a chassis that didn't have all of its SPC cards in it, but there was enough to get started. They would bring the rest over from their own labs, just around the corner from Spirent. When? We asked for dinner time on the first day, because we figured we'd spend much of the first day just getting everything set up.

With all of these parameters in mind, David and I designed a test methodology. We were going to approach this beast as a firewall, not as a router, so we looked at typical stateful firewall metrics: maximum number of TCP connections, maximum TCP connection rate and, of course, TCP goodput. To keep things fair, we used a variation on the same methodology I had used for Network World in the past, where we built very "content-full" HTTP streams and sent them though the firewall. We wanted to test with firewall on, with network address translation on, and of course with all flavors of the IPS turned on.

To add a wrinkle to the test, we also planned to add between 1% and 3% attack traffic to the mix and see how the firewall behaved. David went off and pre-built test configurations, while I boned up on my JunOS command-line syntax and upgraded the version of NSM in our test lab in Tucson to the latest version to see how it was going to look.

Feb. 2, 9 a.m.

David and I met at the Spirent SPoC lab. We are given a beautiful conference room for the week and pointed at an enormous wooden crate. "This came for you," the Spirent folks said. Opening the crate and disassembling the chassis so it was light enough to carry into the SPoC lab across the hallway, we confirmed that we had 16 10G ports. Woohoo! This was going to be a fast box!

But we ran into problem No. 1 right away: this was not an SRX 5600. This was its bigger brother, the SRX 5800. Juniper had shipped us the wrong box, or at least not the box that they had agreed to send. So now we had a problem -- we had only spec'd a testbed to really punish the firewall up to 80Gbps. The SRX 5800 (as we would find out later) goes far faster than that. What's the point of this test?

Rather than abort the test after we'd gone to all this trouble, we decided to push ahead and see what we could find out. We continued unpacking.

After lifting the chassis into place and replugging all the cards and power supplies, we discovered our next problem: Juniper hadn't shipped enough high-amperage power cords. The company sent two, and the chassis needed three to power on. Fortunately, a Cisco Catalyst 6509 on the floor of the SPoC lab was powered off, so we liberated one of its power cords and turned everything on. ("Liberated" is a term I learned from Grace Hopper, who used it as part of her lecture explaining that it is generally easier to apologize than it is to ask permission.) Then we began to label and patch the 16 10G ports to the 16 Avalanche 2900s that Spirent had installed for us.

David dove into the application that controls all of the Avalanches at once, while I started working on making the firewall and management system work.

Feb. 2, 11 a.m.

I discovered that the management system is not a factory fresh one, and needs to be reinstalled. There's good news, and bad news. The good news is that Juniper makes this really easy: You boot it up and interrupt the boot, then tell it to re-install. The bad news is that this takes a long time. I launched off the reinstallation and then turned to the SRX to see if I can get a basic configuration into it so that David can start some sanity checking.

Although we're sitting in a conference room, the SPoC test lab is across the hallway and we left the door open for easy access. I was concerned because the fans on the SRX don't seem to be behaving well: they are getting louder and softer every few minutes. When I look at the serial console, I find something that drives a cold stake through my heart: The system is in a continuous reboot cycle. That's what's making the fans go crazy.

I drag David into the lab to see if we can find something loose or misinstalled and we quickly zero in on the power supplies: Two out of three of them are dead! As the system boots, it slowly powers on each card in the chassis. What's happening is that at some point we have more cards turned on than the single working power supply can handle. Then some internal breaker trips and the whole thing comes crashing down.

But two power supplies. That's pretty weird, and a very unusual failure to boot. I stared at them for a while and then noticed something strange: the one power supply that works is the one plugged in using the Cisco power cord. The ones that don't come up have the Juniper power cords. Bad power cords? That's not possible, not two of them at once. Obviously, something else is wrong. We move power cords and quickly identify that it is indeed the power cords. Then we looked at them carefully.

The Cisco power cord had never been unplugged from the power outlets, hung high above our heads, while we plugged in the Juniper power cords. But there's a difference we discover after closer inspection: The Cisco power cord is jacked into a 220V twist-lock outlet. The Juniper cords are normal 110V power cords. In fact, they are completely illegal: They are 20-amp power cords on one end with 15 amp plugs on the other. These cords don't go with this device and, frankly, they can't go with any device. These are for some other, mythical device that has a 20 amp connector but only draws a maximum of 15 amps.

I pulled down the Cisco power cord and look at it: a very standard L6-20 20-amp connector. OK, no problem, we're in the heart of Silicon Valley. David and I hopped into the rental car and headed over to Home Depot to pick up some L6-20 plugs, enough tools to change them and a cheap voltmeter to make sure we know what's going on.

Feb. 2, 1 p.m.

US$68 later -- which Juniper still owes me -- we're back at the Spirent SPoC lab and I'm busy cutting the ends off of the bad power cords so we can put the correct plugs on. A few minutes later we discover another problem: the SPoC lab doesn't have L6-20 20-amp receptacles, but instead has L5-30 30-amp receptacles. The difference is important, but anyone who has spent time in a machine room knows that, in a pinch, you can squeeze a L6-20 pug into a L5-30 receptacle, which is what Spiren''s techs did for the first plug, and which we did for plugs 2 and 3. I didn't feel good about doing it, but it was getting late and so far we haven't moved a single packet.

Along with replugging the power cords, we have an opportunity to test the stability of the SRX chassis. To get the power cords to reach to the 30-amp outlets, we have to slide the SRX just a tiny bit off center. At which point, the SRX becomes too heavy for its rack mounting, ripping a shelf out of the rack and plunging the SRX to the floor of the SPoC lab in a rather abrupt, noisy and dramatic fashion. No one was hurt, including the SRX. We rearranged things a bit to pretend that we meant to have it look that way and continue on.

Feb. 2, 3 p.m.

David has a basic configuration in the Spirent Avalanche systems, and I have a basic one in the SRX using JunOS command line. I have given up on NSM (while it reinstalls itself) for the short term until we at least get things going, so for now we simply have a firewall that passes all packets. Unfortunately, it doesn't actually pass all packets. It only passes most packets. David pulls some statistics and it's immediately clear what's wrong: some interfaces are working fine, while others forward no packets at all.

Because there are a lot of patch cables and fiber patches running around, we immediately suspect human error in wiring. That's the most common cause of this sort of thing, and we've been there, done that and gotten the T-shirt when it comes to human error. We checked everything: Do we know which boxes are which? Are the fibers crossed? And has anyone mistyped an IP address or network mask? We run around, trace a ton of fiber and put labels on everything, but find no errors. Everything should work, but it doesn't.

David launches another test and I dig deeper into the SRX to see if I can see error counters going up. I did not see any errors in the counters. But I also did not see any packets on some of the counters. Big, fat, decidedly unsexy zeros. In fact, it turned out that the SRX wasn't seeing "link up" connections on all ports.

Back we went into the SPoC lab to squat in front of the machine and think. Then I noticed something. The ports that aren't working use one kind of XFP; the ports that are working use a different kind of XFP. (XFP is the 10Gbps version of an SFP; it's the incredibly expensive optical transceiver.) Aha! I pulled out one of the non-working XFPs and read the label carefully. It doesn't say much (bad XFP manufacturer, bad, bad!), but there is a part number -- and Google is your friend. This isn't a short-haul 850 nm transceiver suitable for a 10-foot fiber patch in a machine room; this XFP is designed for long-haul networks. These are the wrong XFPs.

Someone at Juniper has made a fourth mistake: wrong chassis, wrong power cords, wrong management system -- and now wrong XFPs.

1 2 3 4 Page 1
Page 1 of 4
7 secrets of successful remote IT teams