Diary of a Product Testing Nightmare

What really goes on in a serious high-end product review?

1 2 3 4 Page 4
Page 4 of 4

David pushed "Run" on the Avalanches, and I started my countdown. Because the Avalanches take about 10 minutes to ramp up to the full steady-state load we'd configured, we chatted and waited for the SRX to get busy. When it did, I kicked ThreatEx in the pants, and suddenly our graphs got very interesting. The SRX was not taking the attack traffic well. As the ThreatEx box was ramping up to 660Mbps, the SRX was ramping down, down, down, to ... well, to essentially zero. Nothing was getting through. We had brought this poor box to its knees with less than a gigabit of traffic. We thought.

This is good news. Not because the SRX couldn't take the load, but because it simplifies our testing. I didn't have to figure out how much attack traffic the SRX was seeing, because clearly it couldn't take it at all.

Feb. 4, 1 p.m.

After a Chinese take-out lunch, we were ready to take a closer look at the SRX. We repeated the test from this morning, verifying that the SRX indeed could not take our 660Mbps of ThreatEx traffic. I hunted through the CLI to verify that some signatures were being triggered, just to verify that it was the IPS that is at fault. We looked at the IPS processors and all are pegged at 99% load. Indeed, the SRX was on the verge of collapse.

"OK," we said, "what can it take?" We tried a binary search and I resetted the ThreatEx to 10Mbps of attack traffic.

Our poor SRX couldn't even handle that. With 10Mbps of attack traffic, the SRX did not go to zero, but it certainly was near that number. I checked my numbers and checked the SRX one more time: Yep, it was pegged. It was busy. We had 15 co-processors all at 99% CPU load. And it was not processing any traffic. We repeated the test, with an eye on the clock: We had to leave the lab around 5 p.m., and there are a ton of patch cables to be put away plus we had to get the monster chassis back in its packing case.

The SRX was nearly dead in the water, so we decided that we were truly out of time. We  killed it, and now at least we have something to write about. We thought.

David and I started unplugging, coiling cables and generally undoing the mess we made the last four days. Rushing out the door to the airport, we weren't happy, but at least we had got some numbers.

In fact, we didn't kill it at all. What we did was fill up the NAT table. In an earlier test, we had turned on NAT, but I had forgotten to turn it off. Because we were concentrating all of our attack traffic on a single interface, everything was being NATed to a single IP address. That meant we could have a maximum of about 65,000 sessions open at once.

With our 10Mbps test, we were opening about 4,000 sessions a second. In about 15 seconds, we had filled up the NAT table on that interface. No traffic was going to get through there, period. That was utterly predictable.

Now, you could argue that just because we filled up the NAT table on one interface was no reason for the entire SRX to grind to a halt. After all, each interface should have its own NAT table and what we were doing on one pair of interfaces shouldn't have bothered the others. That's true, and if I had an SRX 5800 in my lab with 16 Avalanches and a ThreatEx box, I'd spend a lot more time trying to figure out why NAT exhaustion on one interface brings the whole box down.

Unfortunately, Juniper found this problem in trying to replicate our results long after Network World had already sent the performance numbers to the presses. So 175,000 copies of Network World are out there saying that the SRX falls apart when its IPS is under attack.

That may be true, or it may be false. However, it's clear that the configuration we used (as supplied to us by Juniper), where we were NATing and exhausting the NAT table, was not a realistic configuration. Anyone using this device would not NAT to a single address, but would NAT to a huge range of addresses. That's how I would have set it up in NSM -- if I could. But NSM is utterly unusable when it comes to the SRX, and I didn't bother to craft the proper NAT policy because the one Juniper gave us seemed to work.

So we have to give Juniper the benefit of the doubt.

Is there something wrong with the SRX? Yes. If you fill up the NAT table on one set of interfaces, the whole box comes to a shaky halt. Juniper needs to fix that. That's a bug, for sure. The system doesn't throw an error; it doesn't notify NSM; it just stops passing traffic. The fact that we applied the wrong NAT configuration might invalidate the performance-under-attack results, but then there's still this basic design flaw.

We weren't trying to test the NAT; we were trying to test the IPS. And we don't know what the IPS does when it's under attack.

Will we be able to find out? I don't know. Getting the material together for this test was hard enough. And now Juniper doesn't have any incentive to cooperate: The test is published, and the best possible result that could come out of a re-test is that we say "the box doesn't get any slower." So probably not, though hope springs eternal: We'd be delighted to do a retest if Juniper is willing. But we won't expect it to be much easier next time.

Big boxes. They're a pain to test.

This story, "Diary of a Product Testing Nightmare" was originally published by Network World.

Copyright © 2009 IDG Communications, Inc.

1 2 3 4 Page 4
Page 4 of 4
7 secrets of successful remote IT teams