Can New Software Testing Frameworks Bring Us to Provably Correct Software?
Sixty-five years after the birth of ENIAC, software controls airplanes, pacemakers and missile systems--and it's buggy. Nonetheless, provably correct software has come a long way, and a variety of emerging software testing frameworks and methodologies are poised to further advance the cause.
By Matthew Heusser
We can get CPUs that are consistent and reliable, and we can get compilers that work on top of them that are provably correct. Pascal creator Nicholas Wirth described how to write a correct compiler in his book, Compiler Construction, back in 1996. It should be a simple step from there to create “functionally correct” programs that can, for any input, produce the correct output.
The only problem is that no one seems to be able to actually do it.
From Complete Testing to Proofs
Dutch computer scientist Edsger W. Dijkstra is renowned for his contributions to programming languages and for creating the phrase “Go to Considered Harmful.”
Within testing circles, he is known for his proof that testing addition for two 32-bit integers a microsecond at a time, would take longer than the known history of the universe. (The problem isn’t the 32 bits but that two variables together create 64 bits of combinations.) This leads to Dijktra’s famous quotation: “Testing can show the presence of errors, but not their absence.”
Most programs are more complex than adding two numbers together. This leads some academics and practitioners to abandon testing and to pursue “proof of correctness” instead. These proofs can demonstrate that a requirements document, probably expressed in symbols, is correctly implemented in code.
Wirth argues that proofs of correctness should be included in programming courses in undergraduate computer science, while C.A.R. Hoare proposes the Floyd-Hoare notation. Sidney L. Hantler and James C. Kind, meanwhile, provide an excellent introduction to proving the correctness of programs, making it clear that the size of the proof will grow much faster than the code itself.
In other words, where the code for a 10-line procedure could be proven in an academic exercise, it’s just not feasible for a real, industry computer program. This problem of exploding complexity in code is not new. The NATO Software Engineering Conference identified is as the “software crisis” back in 1968.
Where “proof of correctness” may not be feasible, there have been attempts to move in that direction, to provide more rigor and correctness around a program before it gets to testing. A few of the more popular ways to address this include the transformational specification, model-based testing and Fred Brooks’ “Bronze Bullets.” Each is discussed below.
Transformational Specification: Impractical for Software Development
If we could define our software in a precise, clear way, it may be possible to take that definition and transform it, over a series of steps, into the actual program. This turns out to be remarkably similar to what a compiler does—and compilers can, in theory, be demonstrated as correct.
This technique is sometimes called transformational specification. Its challenge: That precise definition will need to be interpreted by some sort of program. In order to make the specification precise, it needs to be written in code, and that code will probably have errors.
As an academic discipline, the formal specification for finding the largest element in an array of integers requires roughly 600 small transformations, expressed in predicate calculus. That is complex enough that a graduate student in computer science might write a transform to, say, add up the change in your pocket or sort and count it, but it can’t be used for practical software development.
Behavior-Driven Development: Code That Looks Like English
Like transformational specification, behavior driven development (BDD) also separates the spec from the code. Like its older cousin, BDD begins by writing examples of what the software will do in high-level language that looks more like English than most traditional code.
Here’s an example in Given/When/Then format:
Given a logged-in user
When I add item 12345 into my shopping cart
And I click order
Then I see the order screen with item 12345
To implement this, the transformational program would need to know what the shopping cart is, what logged in means, what the order button is and so on. These types of Given/When/Then statements can be very valuable and can be compiled, with a human coming along afterward to define the keywords “shopping cart,” “logged in” and so on.
These examples (but not a programmatic spec), defined up front, mean the programmer can know that, if he delivers the code to acceptance test and it meets some minimum standards of quality, then the code can “work” under precise condition. In some cases, the checks can run automatically with every build, so the build system can find some major regression errors immediately after commit.
Model-Based Testing: Intelligent Automation for Big, Obvious Errors
A typical Web application has various states: a home page, failed login, search results and so on. If we know the previous state and the given input, then it should be possible to predict what state the software will transition to next. If we take those states and create a map in code, then we could create a test program to take “random walks” through the software under test, sending in randomly selected input to make sure the correct transitions occur. The term for this is model-based testing, or MBT.
Unlike transformational specification, there has been significant amount of industry work on MBT. In particular, Harry Robinson, an engineer on the Bing team at Microsoft, has published articles and tutorials, including his paper on intelligent test automation.
One challenge with MBT is something called the Oracle Problem: The test software needs to know if the software under test is correct. In the case of a mortgage calculator or insurance claims processing software, for example, that might mean re-implementing the calculations themselves in the test software. In some cases, this can add significant cost to a project. (Think about the cost of writing Bing, or Google, in order to compare search results.)
It’s worth noting that MBT has made inroads in areas where system behavior is predictable and specific, such as a firewall or a piece of hardware that routes TCP/IP packets. Vendors such as Conformiq produce software in this market but often struggle to move into commercial software.
Cem Kaner, a professor of software engineering at the Florida Institute of Technology, suggests a slightly different approach to his students: If possible, use a competing product as the oracle. If you are testing an open-source spreadsheet, you might throw a random formula into the software under test, then into Microsoft Excel, and compare them. Create a set of rules for the formulas, automate and log the process, run it 1 million times over a week and you have a formula for a model-based approach.
In many cases, though, no competing product will be available, the product may not be open to automation or the two products may have different rules for determining the correct answer. In that case, MBT tends to find “big and obvious” defects such as crashes. This can improve reliability, but it tends to be expensive, and it certainly won’t make the software provably correct.
Fred Brooks, ‘Bronze Bullets’ and Small Improvements
Brooks argues that any radical change in a specific activity, such as eliminating testing by making software proven correct, can only modestly decrease the cost of delivering an entire system. Instead of trying to automate the task of programming, something Brooks equates to finding a “silver bullet” to slay the werewolf of runaway costs, Brooks suggests a collection of “bronze bullets,” or small improvements to each activity that combine for a large effect.
For proof of correctness, that means moving from a single technique to a set of techniques, each designed to improve confidence in the software—preferably in a way that can be automated and repeated with the next build.
Dave Nicolette, an independent consultant and agile coach in Cleveland, recently published his list of techniques in Delivering Provably Correct Code. Like Brooks, Nicolette gives up on the idea of formal proof and instead suggests four other approaches.
Leveraging higher-level frameworks. Once a tag cloud is demonstrated correct (enough) for a half-dozen customers, then testing the search becomes a matter of checking examples and inputs, not measuring font sizes with a rule and calculator. The same can be said of a search indexer, GPS code library or key-value NoSQL tool. Focusing on delivering the right inputs to a library, outcomes and edge conditions can be much less expensive than writing it from scratch. Nicollete suggests frameworks, code generators and libraries as three places to start.
Programming language constructs. Nicolette starts with an example of declaring a typed list in Java, to ensure that the list only has members of a specific type, and goes on to recommend static code analysis and automated unit tests. Static code analyzers can, for example, find errors such as an assignment in a comparison (“if a=b” instead of “if a==b”) and are often available freely and easily to add to a continuous integration framework. Automated unit tests, most commonly the xUnit framework, provide examples of a class in operational use and are also straightforward enough to add to a build framework.
Collaborative work environments. Even if we could create a provably correct requirements document, there still remains the risk that the requirement contains an error, that an architect misunderstood a phrase in a English document as he changed it into some program. Collaborative environments solve this problem—not through code, but by having people sit together and talk through issues and examples before putting their hands on a keyboard. These types of environments reduce the risk of defect through conversation, examples, questions, answers and feedback.
Service orchestration. Instead of thinking of a Web page that displays search results, we can think of two features: A Web service that performs a search and a bit of code that formats search results in HTML and displays them to a browser. Breaking the coupling of the work and testing components individually greatly reduces the complexity of each piece. In essence, two divided by two becomes less than one.
Yes, service orchestration might fail when the service is down because, say, a SysAdmin accidentally changes a firewall or router rule—but formal proofs don’t offer a solution there, either. There is a chance, though, that the problem could be caught in a collaborative environment. These techniques support each other.
What’s Next? Better Frameworks, More Collaboration
In his keynote address at the 2009 Google Test Automation Conference, Wirth said industry advances in proofs have been disappointing and the problem of testing is likely to be with us forever. Over the past decade, these problems have become more complex, as desktop software for one product line or operating system has given way to a half-dozen Web-browsers and dozens of mobile device manufacturers with hundreds of model numbers.
Testing problems are likely to be with us for years to come. Nicollete’s suggestions of more powerful frameworks, more collaborative environments and the separation of work into services that can be tested independently may at least create an environment where code demonstrates that it can work under precise conditions.
While provably correct code may be a Pied Piper not worth following, today, organizations can make strides to radically improve software quality before it gets to exploratory testing. That is something to celebrate—and to keep moving forward.