When it comes to budgeting for cloud software, it's important to have some solid data about the cost of deploying a "zero-feature" update, the likelihood of encountering latent bugs, and the level of effort required for simple developer overhead and housekeeping. While there's some good data and solid advice out there from the Standish Group, as I mentioned in a recent article, I haven't seen any data that's particularly modern or really focused on the harsh realities of cloud software development.
Why can't we just extrapolate statistics from 50 years of on-premises software development? Cloud development really is different:
- It's loosely coupled. Web services act as components. That loose coupling is a huge benefit, but it means that components you rely on may evolve without you knowing about it. All of a sudden, there are new, often unspecified behaviors that, though they may not be bugs, certainly will contribute to them.
- Cloud code is multilingual. It's not unusual for a single application to leverage four or more languages. This means there's no tool for comprehensive debugging in your own application, let alone the other Web services you may depend on.
- Cloud code tends to be poorly documented, as I wrote recently. In fact, the Clean Code guys actually advocate no comments at all.
- Logging and troubleshooting data is skimpy, and typically has to be enabled for (brief) dedicated periods.
- Finally, the system is a moving target, not a fixed one. Rules and workflows seem to evolve endlessly. System administrators can change thresholds, constraints and allowable values in ways that make code misbehave.
What this means is that deploying a "zero-functionality" release – just adding some debug statements, for example – can trip across a lot of bugs, meaning hours of hilarity for your developers.
While the generalities of this article hold true for almost any cloud platform, the specifics here are based on Salesforce.com experience. I welcome commentary and amendments to this content based on Amazon Web Services, Microsoft Azure, Google and other cloud application development experience.
What's the Probability of a Bug Evolving?
The idea here is that, at T0, the system runs properly and all unit tests pass. At T1 thru TN, a sys admin changes the configuration of the system(s) that may cause new behaviors to come in. What are the relevant configuration changes? Sys admins can do a surprisingly number of things that provoke system issues, even when there are no changes to code, including modifications to the following:
- Field constraints
- Picklist values
- Record types
- Page layouts
- Workflow thresholds and formulae
- Workflow field updates
- Validation rules
- Lead and case routing rules
- Formula calculations
- Field permissions
- Object permissions
- User roles
- User profiles
- User groups
- Custom Settings
- iFrame "drop ins"
- Creation of new objects
- Installation or upgrade of add-ins (installed packages)
And there are probably seven other things I forgot to mention.
How do you know when one of these changes has occurred? Salesforce has a Setup Audit Trail that provides a lot of clues. Or, if you use some sort of configuration management tool on your systems, you can do a diff every week to see what's changed. Great.
Here's my formula for the probability that a bug or anomalous behavior has crept into your cloud system somewhere, even if it hasn't been noticed yet:
Probability of bug = NC * # custom objects * # SysAdmins / 100
"NC" is the number of individual changes that have been made from the above list since the last time somebody ran all the test code. Of course, the probability is capped at one – and you reach that ceiling pretty quickly.
The above formula is just a wild approximation, but the point is that it won't take many weeks at all before some serious bugs have invisibly evolved.
The first line of defense, of course, is some sort of configuration control board that actually thinks through the consequences of any changes to the system metadata before they are made, then applies other changes to accommodate them. Fat chance, I know.
The next line of defense is to run all your unit tests every week and record the results. Salesforce.com can actually do this at the touch of a button. It's not painful at all. What will be painful, though, is the realization that someone, somewhere, has set up a bunch of sand-traps for your developers. Fix those errors as early as you can, so you can do them in a relaxed and productive way. Stressed-out developers make more mistakes, driving up costs.
What's That 'Nothing' Code Change Really Going to Cost?
Salesforce requires your internal developers to pass all unit tests and cover at least 75 percent of the code as a precursor to deploying anything. While it's easy to scam the system and do only the most basic of unit tests, that turns out to be a false economy: You want to do positive and negative result testing, not just blind exercising of all the code paths.
Of course, the more thorough the test, the more likely that you'll find a bug introduced by the evolving system configuration. This is a good thing, because it traps the errors before the user finds them. Paying this tax early helps you avoid penalties later.
Here's a simple (and therefore inaccurate) model for the cost of that "nothing" code change:
( # outstanding execution error bugs * 200 + # outstanding unexpected results bugs * 400) * average age of bugs (# months) * # of development teams
The units, unfortunately, aren't yen or Turkish lira but, rather, dollars or euros.
Why so expensive? For each bug you see in the initial run of testing, there are a couple more behind it, masked by tests that failed to complete. Further, a bug in somebody else's code may surface in yours, making the debugging chore fairly convoluted, particularly if that original developer is now gone.
Why are the "results" bugs more expensive than simple execution errors such as index-out-of-bounds? Results bugs often involve troubleshooting and fixing data problems as well as coding issues. The more teams you have to work with, the more expensive it all gets, thanks to coordination and finger-pointing. Older bugs cost more because the developers have had more time to forget, and more data has become corrupted by anomalous processing.
None of this should come as a surprise, and none of it is the fault of any particular product. Some products make things a bunch easier – but the root cause here is loosely coupled development and the inherently "centerless" model of cloud development and execution cycles.
The best practices come straight from the formulas above. Have fewer system admins, minimize the number of custom objects, have fewer independent developer teams, do full system tests weekly and repair the bugs as soon as you can.