What happens when a telephone company loses its customer billing data, or when an online brokerage firm’s performance degrades so badly that stock-trading transactions are delayed? What happens when a retail store’s point-of-sale system goes down, or when customers looking for bank loans online find that the bank’s system is down?
In today’s Internet economy there is little loyalty, and customers will readily defect if their needs for high availability, reliability and performance are not met. When mainframe data is lost, corrupted or cannot be accessed quickly, it can be catastrophic. People typically associate “threats to business continuity” with disasters such as hurricanes, floods or terrorist attacks. These calamitous events can wipe out entire data centers, severely disrupting business.
But other types of events, such as human errors, application errors, and delays in reacting to changing conditions in the mainframe environment can also disastrously disrupt business continuity. Human errors can wipe out critical data. Application errors can stuff erroneous data into business-critical databases. A delay in responding to a change, such as a spike in workload, can drag down the performance of a business-critical application. According to leading industry analysts, as much as 80 percent of all unplanned downtime is caused by software problems or human error. Companies must be prepared to deal with these types of events.
Gone are the days when the mainframe was monolithic, walled off from the outside world and accessed only by a small number of skilled IT personnel. Today, the mainframe is a vital and tightly integrated resource in the enterprise IT infrastructure. As part of a complex, multitiered, services-oriented architecture, the mainframe must interoperate with a variety of other resources. For example, a single SAP landscape can include mainframes, multiple servers and hundreds or thousands of database tables.
Additionally, the mainframe must meet today’s demand for 24/7 operation. Maintenance windows have virtually disappeared, forcing operations staff to perform maintenance tasks—such as deploying bug fixes, and upgrading and adding new applications—while the system is operational or in the very limited downtime window for those tasks that cannot be done while the systems are up and running.
What’s more, today’s Internet environment has opened up the mainframe to thousands, even millions, of outside users—employees, business partners and customers. As a vital component of business processes, the mainframe must participate in business-to-business transactions with systems outside the enterprise, such as those in supply and distribution chains.
The resulting complexity has increased the potential for human error and for errant code in applications. It increases the likelihood that people will make operational errors that can cause data loss. Complexity also increases the probability of coding errors in applications, which can result in the contamination of business-critical databases. For example, in updating a database, an administrator makes a single keystroke error in a batch update job that causes the update to be performed with the wrong input data set, contaminating critical business data.
The applications that rely on this data continue to operate with bad data and may go undetected until reported by end users. Bank customers report that erroneous transactions have been made to their bank accounts. Truck drivers for a parcel delivery company report that the dispatching system has sent them to incorrect locations. Complexity also makes it far more difficult for the operations staff to react quickly to changing conditions in the IT environment, which can result in performance degradation.
Traditional Methods Come Up Short
Traditional data and IT component protection methods, such as backup, recovery and data mirroring, are not sufficient by themselves to ensure business continuity in the event of human or application errors. Consider, for example, the case of a critical database contaminated by an application that makes an extraneous entry into the database each time a specific type of transaction occurs. Traditional methods simply back up the contaminated data, perpetuating the problem as the application continues to operate with contaminated data.
Moreover, the use of traditional manual system-management methods is no longer viable. Complexity has increased to a level well beyond the capabilities of even the most skilled IT professionals, creating a high risk of error. For example, the orderly restart of complex interrelated systems when attempted manually introduces dozens of time-sensitive procedures. A slipup with one procedure will result in cascading system outages requiring time-intensive recovery activities. In fact, one of the major sources of human error is the use of system-management processes that rely on manual procedures.
Manual processes also can increase reaction times to changes in the IT environment, with the potential to cause a performance slowdown. And they are difficult if not impossible to audit, exposing the organization to the risk of regulatory noncompliance.
Intelligent Automation Provides a Solution
To ensure business continuity, enterprises must augment their existing disaster recovery mechanisms and traditional manual system-management processes with coverage that is more complete. Intelligent automation provides the answer. This practice consists of software-driven routines that automatically perform IT service management functions and make decisions based on business impact and business policy.
Intelligent automation brings with it several major advantages. It masks the complexity of the IT infrastructure and helps ensure that IT system-management processes are performed in a repeatable, consistent and timely fashion using best practices. The result is a dramatic reduction in the risk of error, along with an auditable trail to ensure regulatory compliance.
Intelligent automation brings a wealth of other benefits, too, including:
- Minimizing failed changes to IT infrastructure. Intelligent automation helps ensure consistency of the change process, resulting in a dramatic reduction in the number of change failures and preventing many unnecessary outages. Intelligent automation also increases the speed and efficiency of the change process, freeing valuable time for highly skilled IT staff. Also, if a change does not produce the expected result, the solution can back out the change. It can also maintain an audit trail of the process to support regulatory compliance.
- Maintaining service quality as conditions change. Intelligent automation enables staff to quickly address problems, eliminating the “think time” associated with manual problem diagnosis and resolution. It leverages system-monitoring tools, keeping a close watch on the environment and responding automatically and immediately to out-of-threshold conditions. Response is intelligent and based on policy and business impact. With intelligent automation, the IT staff can also move proactively to head off problems before system outages occur or performance degrades.
- Detecting and eliminating contaminated data. Traditional approaches to prevent data loss, such as backup, recovery and data mirroring, do not ensure business continuity in the case of contaminated data. They simply create copies of the contaminated data, perpetuating the problem. Although applications remain in operation, they are now operating with bad data. Intelligent automation permits IT staff to quickly identify and back out the contaminated data from the database, and exclude it from the backup process.
- Preserving skills of experienced mainframe personnel. Intelligent automation eliminates mundane, day-to-day, repetitive tasks that soak up much of the mainframe operations staff’s valuable time. It provides automated, preprogrammed responses to problems to ensure successful resolutions without requiring IT staff intervention. In addition, intelligent automation preserves the knowledge of mainframe specialists by encapsulating this knowledge into software-driven intelligent automation routines.
- Ensuring successful and speedy recovery from disasters. Intelligent automation can augment in-place disaster recovery mechanisms to automatically synchronize the recovery process with the correct sequence. By doing so, intelligent automation speeds the recovery process, and it ensures success by enforcing the correct recovery sequence. Intelligent automation also can base the recovery sequence on business priorities, recovering the most business-critical systems first.
Part of a Broader Solution
It’s important to implement a business continuity solution that addresses the entire IT infrastructure, including distributed components, and not just the mainframe. Generally, this can be accomplished in the context of a broader business service management (BSM) strategy.
BSM strategy is based on enabling enterprises to align IT with important business goals. BSM solutions permit IT to make decisions, including intelligent automation decisions, based on their business impact. By implementing a business continuity solution in view of the broader BSM strategy, IT can maximize its contribution to business value. To do so requires a solution that is capable of associating the IT infrastructure components with the business services they support—for example, by indicating which servers and databases support which applications.
The right solution can present a consolidated view of the entire IT infrastructure that shows all deployed assets (hardware, software and network components), their locations, configurations, their associated users (employees, business partners and customers), and their physical and logical interrelationships. This consolidated view not only masks the complexity of the infrastructure but also helps the staff make more intelligent decisions related to business continuity.
Moreover, business continuity involves several BSM disciplines, including change and configuration management, service impact and event management, incident and problem management, and infrastructure and application management. As a result, it’s vital that the business continuity solution operate in concert with solutions that support these various disciplines.
By implementing intelligent automation across the IT infrastructure, based on a sound BSM strategy, companies can move their IT mind-set beyond the simple, reactive disaster recovery plans of yesteryear and into a more dynamic model of proactive engagement with IT performance and stability. The result, of course, is happier customers, and a healthier business overall.
Ralph Crosby is the chief technology officer for the Mainframe Service Management Business Unit at BMC Software. He is responsible for setting the strategic technology direction for the entire portfolio of IBM Mainframe products.