Riddle this: All your core performance metrics are glowing green, but customers on the other end of the network are still cursing your service. How can IT get to the root of the problem? Network World Editor in Chief John Dix put the question to two experts: Tony Davis, vice president solution strategy at CA Technologies, who once faced that challenge working for FedEx on the company's web site and now shares what he learned with CA customers, and Jimmy Cunningham, senior manager tech support enterprise monitoring, BlueCross BlueShield of South Carolina.
We're here to talk about improving customer experience so Tony, why don't we start with you because you talk a lot about the concept of Business Service Reliability. Explain what that's all about.
DAVIS: Business Service Reliability is a top-down approach to how IT should operate compared to traditional models. It is an actual execution strategy that emphasizes the reliable creation, production care and feeding, and mathematical measurement of a business service. It then takes that business service and translates it into real time customer experience. We provide solutions to automate this transformation strategy, but the secret sauce is in the methodology used to implement the strategy and the automation and measurements used to verify success.
How did you get involved with BlueCross? A
DAVIS: The CA Account Executive for BlueCross BlueShield South Carolina had heard about the concepts of Business Service Reliability and stepped up to provide the program at no charge as a way to ensure long term success with the investments they had made in CA solutions. I first met with the executive in charge of the enterprise strategy for reliability, and he introduced me to Jimmy who was leading the initiative and we have been partnering ever since.
So Jimmy, when Tony plugged in you were already neck deep in an effort to improve customer experience by getting a better handle on the performance of core services. Explain that effort.
CUNNINGHAM: Five years ago we didn't have a unified approach to monitoring system performance or system availability. Our CIO, Steve Wiggins, used to say we were flying blind in that we would deploy our applications and often find out from customers that we had pieces and parts that were broken. He was tired of customers telling us we had something wrong before we knew it was wrong, so he decided to form the Enterprise Monitoring System (EMS) group to rectify that, pulling in people from around the company. One of the first things our group did was take control of the main monitoring tools that existed inside BlueCross. Prior to that every group would buy and deploy tools as they saw fit, meaning another group could buy a similar tool and deploy it.
We were tasked with consolidating the tools and, at the same time, we invested in CA's Customer Experience Manager (CEM) and Introscope, both of which are now part of the company's Application Performance Management product, to augment and replace some existing tools. CEM does customer experience monitoring and allows us to monitor HTTP(s) traffic and see the elapsed time and experience our customers have as they use our websites and desktops. Introscope is a Java deep dive analysis tool that works in tandem with CEM to provide detailed metrics on the services and programs supporting our applications and desktops. We also use Introscope to monitor MQ, MQ Broker, DB2 calls to the host, and more. When we had all of that in place our group built standards around the tools and worked with various levels of management to figure out how to deploy the tools in a holistic fashion to help monitor applications. A ~~
What was the stated goal?
CUNNINGHAM: We had two main goals. One was to improve our mean time to resolution (MTTR). We wanted to be able to find problems within our applications or our infrastructure and be already working on them when the customer called so we could tell the customer we're already working on it and have an estimated completion time. And whether we were minutes ahead of the customer call or hours ahead of the customer call, the important thing was being ahead. A
Then at the same time, we planned to marry the monitoring group with capacity planning so we could improve our mean time between failures (MTBF). As the monitoring team gathered data and fed it to the capacity team, we could start doing predictive troubleshooting, by saying, "OK, you're starting to have a problem here. If you address it now you might not fail." And that way we can increase the time between when our applications failed.
When you started to dig in and pursue that first goal to identify problems early, did you find about what you expected to find?
CUNNINGHAM: One of the things we discovered was, in areas that had monitoring tools they owned, they would do what we call focused monitoring. For example, Server A would have a problem so they'd put a monitor on Server A, but Server B wasn't having the problem so they didn't put it on B. They focused the monitoring where something had occurred to try to stop it from occurring again. A
And when we came in, one of the things we said was, "If you're going to have it on A, you might as well have it on B. They're mirrors of each other. If it's going to happen to A, it could happen to B, so let's get the big picture." A
At BlueCross BlueShield of South Carolina we have a fairly healthy virtual machine environment. We're one of the biggest zLinux [Linux compiled to run on IBM mainframes] shops in the world (top 1% in the world according to IBM), so we have tons of guests running on mainframes and all of our data lives on the mainframe. Anything that starts off on a webpage or a desktop has to go to the host to get its data. So we cross a lot of infrastructure. A
And if that's a 10-step process, what we found was people had deployed monitors on three of the steps, and the other seven were, "Well, they work so we don't really need to know about them right now." We came in and said, "OK, tell us the 10 important steps and we'll watch all 10 steps. That way we'll let you know as soon as something happens." Again, my group is building automated monitors, we're not actually sitting at our desks watching things. We're building the automated monitoring solutions to feed our support people.
How many people are in the group and where did they come from?
CUNNINGHAM: We have 10 in Monitoring and five in Capacity Planning. We got a couple of people from the infrastructure group and we got some of our team from what we call LCAS (Leveraged Core Application Systems), and they're the ones that code and maintain the apps on the non-host part of the environment. And we got a couple of people from the host side. We were trying to get some experience from the different silos so everybody would be represented.~~
Was it clear how you were going to reach your goals?
CUNNINGHAM: It was a staged process. The first stage was to get visibility into our big apps. Our CIO went through our app list and said, "This is your A priority, this is your B priority, this is your C priority." And we had just gotten CEM in, so one of the first things we did was start instrumenting CEM to watch the A Priority apps as they came into our system. So CEM started to give us visibility into those apps. Then we started to work with our customers to say, "OK, what do you want to know about this app?" That was stage one. That took about a year. A
Stage two was happening in the background as we were developing our standards for holistic monitoring. We went to internal IS customers, and asked, "Tell me about your app, tell me the important pieces. Tell me where they live, tell me how you use them, so we can deploy our entire toolset and watch your app as holistically as possible."
We started monitoring the heck out of stuff, generating thousands of tickets that went to the support areas. We went from flying blind to flying in a snowstorm. That was our "lets monitoring everything" phrase. We were ticketing everything.A I mean anytime the system hiccupped, we'd ticket.A So if you could picture it, we had a little A ball of monitoring, it mushroomed up huge, and then we settled back down to somewhere in between where we can say, "Now we're monitoring your important pieces. We know which domino is the main one and if it falls something has happened," and we are continuing to refine that process today.
Tony, is that common when customers add a lot of instrumentation, they initially get buried?
DAVIS: I would say so. And that's sort of my mission. Instead of going down a path that puts you into a snow blind situation, what if we design your monitoring around the core business services. That will inherently cut out some of that noise.
So Jimmy, how long did it take to tone down the noise level so you could actually make some progress on the important stuff?
CUNNINGHAM: The first year we were deploying monitors like crazy, so we were constantly adding to the snow storm. We were killing our support people, and we realized we couldn't keep that up. So we brought together our top app development managers and our app support guys and a couple of infrastructure support guys and said, "OK, how do we make this better?A What do we do to give you on-time, relevant information that helps you put your finger on problems and send your A guys off to the right spot to fix it." And we started refining our overall process of how we gathered requirements.
How did you achieve it?
CUNNINGHAM: We went to the systems experts and the top support guys for each app, and between the three of us, figured out how to refine the requirements gathering process so the monitoring data and output allows them to jump in there and fix something before it actually stops working, or as quickly as possible after it stops. The EMS team works with those two areas to set what specifically should be monitored and the monitoring thresholds.
Is everything fully instrumented at this point?
CUNNINGHAM: No, never is. We're constantly modifying and growing as apps are developed. Five years ago we broke our list down into A, B and C, and we're in the B's now.
How far along are you in terms of the monitoring system integration processes? A Is that done? A
CUNNINGHAM: Yes. We originally had CA's CEM and Introscope and we upgraded to CA APM to ensure the performance and availability of business-critical applications, transactions and services, as well as the end-user experience of customers that access our online services. At the same time, we bought CA Cross-Enterprise APM to gain 24/7 monitoring of business transactions on the mainframe. By providing CA APM with this data on a single pane of glass we now have true cross-platform APM monitoring, and that was the final piece of the puzzle that stitched everything together. Because prior to that, we had a lot of tools in the non-host world, and once you hit the host monitoring sort of disappeared. You threw it over to the host and you know stuff happened on the host and you know you got an answer back, but that was about it. The host itself is well monitored, but there was no integration between what was happening there and what was happening in the non-host world. So getting that cross-platform APM monitoring set up gives us that bridge, and that's been huge for us.~~
Tony talks a lot about user experience. Do you look at it from that end as well? A
CUNNINGHAM: Absolutely. Internally people were saying, "Nobody's complaining." And our CIO would come back, "Well, just because they are not complaining doesn't mean everything is great, we have to find out what their experience is like, and measure it so we can figure out how to improve it." And that was one of the things he wanted to fix. He wanted to know what their experience was, so we could make the experience better. So when they do call and complain, we know it's a legitimate complaint because generally they're happy with us. A
What goes into that calculation of user experience?
CUNNINGHAM: Primarily it's response time. From the time they click until they get their answer back, what was that time frame? A But we also divide our tickets into three categories: We generate availability tickets how much were we available? Reliability tickets -- was the whole app available and responsive? And capacity tickets. So in generating those tickets and gathering the metrics, we can determine how available we were, how reliable we were, and did we have enough capacity to meet demand? A
Capacity performance is a key metric because BlueCross is a low-cost claims processor so we don't have a whole lot of extra MIPS and I/O lying around. We try to run as lean as possible and we're always asking, "Are we meeting our requirements without having a whole lot of resources just sitting around idling?"
Tony, coming back to you -- given you work with a companies in different industries, can you compare and contrast what Jimmy is talking about here to what you're seeing in other shops?