Software Testing Lessons Learned From Knight Capital Fiasco

Knight Capital lost $440 million in 30 minutes due to something the firm called a 'trading glitch.' In reality, poor software development and testing models are likely to blame. Here's what your company can do to avoid similar embarrassment and huge losses.

It took only one defect in a trading algorithm for Knight Capital to lose $440 million in about 30 minutes. That $440 million is three times the company's annual earnings.

The shock and sell-off that followed caused Knight Capital's stock to lose 75 percent of its value in two business days. The loss of liquidity was so great that Knight Capital needed to take on an additional $400 million line of credit, which, according to the Wall Street Journal, effectively shifted control of the company from the management group to its new creditors.

Knight Capital was regulated by the Securities and Exchange Commission, routinely audited and PCI complaint. If that bug could affect Knight, it could happen to any company. At least that's what Knight Capital CEO Thomas Joyce seemed to imply in an interview with Bloomberg Television. "Technology breaks. It ain't good. We don't look forward to it," he says, adding, "It was a software bug....It happened to be a very large software bug."

Thomas Joyce
"Technology breaks. It ain't good. We don't look forward to it," Knight Capital CEO Thomas Joyce says. (Image courtesy of Bloomberg )

This incident wasn't the first of its kind. In 2010, something caused the Dow Jones Industrial Average to drop 600 points in roughly five minutes in what is now known as the "flash crash." Nasdaq blamed the disastrous Facebook IPO on a similar technical glitch.

Mistiming, Bad Orders Crash High-Frequency Trading Algorithm

In early June 2012, the New York Stock Exchange (NYSE) received permission from the SEC to launch its Retail Liquidity Program. The RLP, designed to offer individual investors the best possible price, even if it means diverting trades away from the NYSE and onto a so-called "dark market," was set go live on Aug. 1. This meant trading houses had roughly a month and a half to scramble to write code to take advantage of this new feature.

The Knight Capital incident happened in the first 30 minutes of trading on Aug. 1. Something went very wrong in the code that had been introduced overnight. The code itself was a high-frequency trading algorithm designed to buy and sell massive amounts of stock in a short period of time. A combination of mistiming and bad orders let to disastrous results.

Beyond admitting a software defect, the staff at Knight Capital have been reluctant to discuss exactly what caused the defect. They aren't alone—the majority of financial-related inquires for this article led to responses such as "No comment," I can't comment" or "We cannot comment on this story."

One technologist at a financial services company, who asked to remain anonymous, suggests two possibilities. It could have been the standard rush to production without proper testing. Parse the statements from Knight Capital carefully, the technologist says, and it's possible that the program that went into production was actually a test program—one designed to simulate trade requests and evaluate if they went through properly. Nanex conducted an analysis of the trades last week and came to the same conclusion.

Column: Stupid QA Tricks: Colossal Software Testing Oversights

Rick Lane, CTO of Trading Technologies in Chicago, agrees that the problem might be a test program in production—or, possibly, a configuration flag that wasn't ready for production and should have been turned off. He points out that these trading algorithms are developed incredibly quickly, as they are designed to chase fleeting opportunities, and that good change management may take a backseat to speed.

"The scary thing is this happens more often than people think, and not just by trading shops," Lane says. "In September 2010 the Chicago Mercantile Exchange ran a program that accidentally injected test orders into [its] production system—and the CME doesn't even have the kind of time pressure that these trading shops have."

Adding Retrospective to Development Process Can Reduce Errors

Jeff Sutherland, a co-author of the Agile Manifesto who helped formalize the Scrum methodology, adds a third possibility— the team may have been using a development method prone to error.

Sutherland, also a former U.S. Air Force pilot, recommends an external assessment, much like the process the National Transportation and Safety Board uses for airplane accidents. Without some assessment, he says, we may never know what went wrong—and we run the risk of trying to prevent the wrong problem.

New York Stock Exchange
Only a thorough assessment of Knight Capital's software development lifecycle will tell us what happened at the New York Stock Exchange on Aug. 1, 2012, experts say. (Image courtesy of Ryan Lawler via Wikimedia Commons )

George Dinwiddie, principal consultant at iDIA Computing, also recommends an assessment. Any company can assess its organization using a tool called a retrospective, Dinwiddle says. The retrospective is a formal "look back" process that considers what is actually happening, what the risks are and how the team can improve.

In the Army, retrospectives are called "after-action-reviews." The latest thinking in software, though, is to have the conversation before the software is deployed in order to catch and fix the problem. The Agile Retrospective Resource Wiki provides a host of options.

One effective method I recommend is to ask what is going right, what is going wrong, and what we (the team) should do differently. Team members create cards to list what they would like to talk about, then vote by placing a dot on the cards to decide what to talk about. The team discusses the two most heavily dotted items in each category.

Analysis: Rethinking Software Development, Testing and Inspection

When there is a problem, another anonymous source points out, someone in the organization usually knows about it but may not feel safe enough to bring up the issue in a large, supportive forum. Retrospectives provide not only an open door, but group consensus as well. Someone can raise an issue and get support. That's hard to turn a blind eye to.

4 Ways to Improve Software Testing and Reduce Risk

After the retrospective, your team may come up with a list of risks and issues such as those (theoretically) identified in the Knight Capital case. If so, consider these four techniques to reduce risk.

  1. Improve change and configuration management. Keeping test and production code in different "sandboxes is one popular practice to reduce risk. At Zappos, the team has a separate, and more rigorous, process for code that will touch customer-sensitive data and financial information. Etsy, meanwhile, deploys all code to production but mitigates that risk with technique #2.
  2. Improve production monitoring. Lane suggests using automated processes to detect and send alerts regarding errors. Jeffery Reeves, editor of InvestorPlaceMedia, recommends having actual humans watching transaction volumes, using personal judgment. Companies with a large amount of automated transactions would do well to have both.
  3. View testing as a high-level risk management process. "In many cases, trading companies feel there is not enough time for traditional testing due to the compressed time of these fleeting opportunities," Lane says. Ironically, many bugs would not be found by traditional testing, as they are configuration risks. The software may have worked correctly, but it in the wrong place, at the wrong time or from test code that was in production. The type of testers who "click this, click that, make sure these numbers match," as Lane describes them, could not find such bugs. Better risk management is necessary. The SEC may create regulations to require automation review policies for trading software, but the concept applies to any software with business risk attached.
  4. Increase internal controls on high-volume transactions. It is possible that the Knight Capital failure could have been averted by a single button that a human needed to click, especially when the volume reached a certain level. Such a control would not be at the program level but, instead at the public API level. Until such external controls exist, we would do well to build them into our gateway programs.

Knight Capital may never be transparent enough for us to conduct an assessment of what went wrong, or to even see a retrospective report. That shouldn't stop your organization. This could be an opportunity to examine your systems and how they interoperate while determining the value of investing time and energy in risk management.

It's hard work, and it's not eye-popping, but good risk management is likely to keep your company off the CNN, Wall Street Journal or Financial Times home page. That just might turn out to be a most excellent thing.

Matthew Heusser is a consultant and writer based in West Michigan. You can follow Matt on Twitter @mheusser, contact him by email or visit the website of his company, Excelon Development. Follow everything from CIO.com on Twitter @CIOonline, on Facebook, and on Google +.

Join the discussion
Be the first to comment on this article. Our Commenting Policies