Operating your business in the cloud is fundamentally different than operating on premises. And when operations differ, so too do strategies for containing costs.
Financially speaking, a datacenter requires a large capital expenditure for the building, additional capital expenditures for the servers and software licenses, and smaller but significant operating expenditures for powering the servers and cooling systems, and for maintenance and management.
In the cloud, there are no capital expenditures. Instead, there are significant operating expenditures, billed for server virtual machine instances, storage, network traffic, software licenses, and other niggling details.
From a cost management perspective, there are significant benefits in shifting computing load to the cloud — but there are also significant risks.
When someone wants a new server rack in your data center, there are purchase orders to approve and justifications to ponder, and the process is fully managed. It requires permission. It also takes 6 months at many companies. Once the rack has been installed, nobody pays attention to how heavily it is or isn’t used, unless its load is so heavy that it doesn’t perform well. Yes, that’s inefficient cost-wise — hence the push for VMs and containers (such as Docker) in your data center to increase server utilization.
If someone wants a new cluster of virtual servers in the cloud, it might take a few minutes to spin them up. While you might have a policy that requires management approval for new cloud resources or applies quotas to each department’s cloud resources, pretty much everybody with access to your cloud accounts can create what they want when they want, and ask for forgiveness later — if management even finds out.
Whether this freedom is good or bad depends on your point of view. From the perspectives of business agility and devops, it’s good. From the perspective of financial management, it can be good if done right, but otherwise it’s a potential disaster.
In this article, I’ll discuss how to avoid “cloud sticker shock.” I’ll start with individual technical tactics for optimizing cloud expenditures, and end with the topic of cloud spending management.
According to Michael Liebow, global managing director of Accenture Cloud, cloud services can lead to a “zombie apocalypse” — not human zombies, but zombie servers. Zombie servers have little or no utilization: They cost you money but don’t do much of anything.
Liebow and his colleagues also write elsewhere about orphans, which are services left over after the resources that used them have been deleted, and gluttons, which are oversized VMs. These three pathological conditions can easily inflate your cloud bill by 20 to 40 percent, if not managed properly.
Finding underutilized assets in the cloud in a timely manner isn’t easy or automatic. Bills from cloud providers only come on a monthly basis, and may contain more than a hundred million lines of charges for a large enterprise with a sizable cloud estate. If you wait until you get the bill to act, you may find steep charges for VMs and other services that have been idle for 30 days and should have been shut down or downsized long ago.
It’s even harder when you have to manage multiple clouds with multiple accounts each. The good news is that you can usually pull billing information from your cloud providers electronically on a daily basis; the bad news is that you may need to license or develop new tools to manage your cloud estate.
One way to reduce spending on cloud resources that you expect to use for one or more years is to pre-purchase your base capacity at a discount. Each cloud provider does this a little differently, and changes its billing policies periodically. Be warned: This is a confusing area, even when the provider claims to be transparent about pricing.
Amazon explains its pre-purchase plan as such:
“Reserved Instances provide you with a significant discount (up to 75%) compared to On-Demand instance pricing. In addition, when Reserved Instances are assigned to a specific Availability Zone, they provide a capacity reservation, giving you additional confidence in your ability to launch instances when you need them.
“For applications that have steady state or predictable usage, Reserved Instances can provide significant savings compared to using On-Demand instances.”
Amazon recommends Reserved Instances for:
Applications with steady state usage
Applications that may require reserved capacity
Customers that can commit to using EC2 over a 1- or 3-year term to reduce their total computing costs
As a concrete example, consider a compute-optimized c4.8xlarge VM instance in the N. Virginia zone running Linux, which costs $1.591 per hour on-demand and offers 36 virtual CPUs and 60GB of memory. If you reserve the instance for a year and pay entirely up-front, your rate goes down to $0.947 per hour, a 40% savings. Do the same for a standard 3-year term, and the rate goes down to $0.621 per Hour, a 61% savings. For a convertible 3-year term, which allows you more flexibility, the rate is $0.739 per Hour, a 54% savings. Pay less up front, and the effective rate goes up a little, but the difference is roughly in line with the time cost of money.
AWS c4.8xlarge Linux (36 CPU, 60GB)
Reserved Instance (1 year)
Reserved Instance (3 years)
Reserved Instance (3 years, convertible)
Customers have the flexibility to change the Availability Zone, the instance size, and networking type of Standard Reserved Instances. Convertible 3-year Reserved Instances provide additional flexibility, such as the ability to use different instance families, operating systems, or tenancies over the Reserved Instance term.
Azure has a similarly sized VM (fewer CPUs, more RAM) in its general-purpose D32-v3 instance, which offers 32 virtual CPUs and 128GB of memory and costs $1.60 per hour on demand. Azure doesn’t offer reserved instances as such: Instead, it offers an Enterprise agreement with an upfront monetary commitment that lowers the price, although the discount levels are not published.
Google offers an n1-standard-32 VM with 32 virtual CPUs and 120GB of memory for $1.52 per hour with a monthly sustained use discount. You don’t have to commit to extended use to get a sustained use discount: Instead, it is applied automatically to the incremental minutes over the 25%, 50%, and 75% usage levels.
Google also offers a committed use discount for VMs, which you can activate by purchasing commitment contracts for one or three years. Any resources that have committed use discounts applied do not qualify for sustained use discounts. With committed use discounts, VM prices can be up to 57% less expensive than regular VM prices. Discounts apply to the aggregate number of vCPUs or memory within a region so they are not affected by changes to your instance’s machine type. There are no upfront costs for committed use discounts. Committed use discounts are applied to your bill every month. The catch is that you are billed for your commitments whether or not you use them.
Spot and low-priority instances
Amazon EC2 Spot instances allow you to bid on spare Amazon EC2 computing capacity. Since Spot instances are often available at a discount compared to on-demand pricing, you can significantly reduce the cost of running your applications, grow your application’s compute capacity and throughput for the same budget, and enable new types of cloud computing applications.
Spot instances are run when your bid price exceeds the Spot price, and offer 50-90% discounts compared to on-demand instances. With Spot instances, you will never be charged more than the maximum price you specified. While your instance runs, you are charged the Spot price that is in effect for that period. If the Spot price exceeds your specified price, your instance will receive a two-minute notification before it is terminated, and you will not be charged for the partial hour that your instance has run.
If you include a duration requirement with your Spot instances request, your instance will continue to run until you choose to terminate it, or until the specified duration has ended; your instance will not be terminated due to changes in the Spot price. At the moment I checked, a Spot instance for a c4.8xlarge VM with Linux costs $0.3591 per hour in the N. Virginia zone, compared to $1.591 per hour on-demand.
AWS c4.8xlarge Linux (36 CPU, 60GB)
Azure calls its equivalent of AWS Spot instances “low priority.” When I checked, a low-priority D32-v3 instance in the Eastern zone cost $0.345 per hour, versus $1.60 per hour on demand. I wasn’t able to select this option in my account, however.
Azure D32-v3 instance
Google’s equivalent of Spot instances are called “preemptible instances.” A preemptible VM is an instance that you can create and run at a much lower price than normal instances. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. Preemptible instances are excess Compute Engine capacity so their availability varies with usage. A preemptible n1-standard-32 instance in the Northern Virginia zone currently costs $0.3424 per hour, versus the full price of $1.712 per hour and the sustained use price of $1.52 per hour. According to Google, the preemption rate typically varies in the range of 5% to 15% per seven days per project.
Google Cloud n1-standard-32
Underutilized servers and limited space for new racks led enterprises to turn some dedicated servers in their data centers into hosts for VMs. Then, when memory utilization in VM hosts became an issue, they turned some of their VMs into hosts for containers.
The basic difference between virtualizing with VMs and virtualizing with containers is that, in addition to the application software, a VM contains a full operating system and a full set of virtualized hardware, while a container contains only parts of an operating system, some libraries, and the application software. Both VMs and containers offer some isolation from other applications; VMs offer more isolation and better security, albeit at a high cost in memory usage.
RAM is one of the most expensive resources to lease in the cloud, and containers typically need only one-third of the RAM to run the same software as a VM. This makes running your cloud estate in containers an attractive cost proposition, as long as the reduced isolation isn’t a problem.
Up until fairly recently, container use has been limited to Linux-based systems, and orchestration, tool support, and instrumentation for containers were lacking. None of that is really a problem anymore, and moving your loads to containers in the cloud is a good way to streamline your operations and reduce your cloud expenditure.
At AWS, there is no extra charge for running containers — you only pay for the underlying VMs and storage. Azure lets you create and use containers directly from a pool, and charges $0.0025 per Instance created, plus $0.0000125 per GB-second and $0.0000125 per core-second. For example, if you run three containers simultaneously for a month, and each container uses 1GB of memory and 2 cores, you’ll pay less than $300 per month for them.
Cost per GB-second
Cost per core-second
1GB, 2 cores
1 x $0.0000125
2 x $0.0000125
Google Container Engine runs clusters of container nodes under Kubernetes, on top of Compute Engine VM instances; you pay for the VMs. There is a small charge for Kubernetes management, $0.15 per hour for clusters of six or more nodes.
Serverless cloud computing, or more accurately Functions as a Service, has the potential to drastically decrease the cost and effort involved in putting loads into the cloud. AWS Lambda, Bluemix OpenWhisk, Google Cloud Functions, and Azure Functions all offer a model where the developer defines a function to run on demand, creates triggers for the function, and a sets a memory allotment for the function. The cloud infrastructure takes care of allocating a container for the function whenever it needs to run, so the developer doesn’t have to worry about capacity or scalability.
Serverless costs are typically based on the number of triggers (often a negligible charge), execution time, and the amount of memory used. Runtime for a single function invocation is limited to 5 to 10 minutes, depending on the platform, but sub-second runtimes are more common. Most platforms also limit the number of functions that can run simultaneously, and have a bundled free capacity for functions available each month.
As a rule of thumb, using serverless functions is cheaper than running a small VM if the aggregate function execution time is less than half a million seconds per month, or roughly 20% of the month. The numbers vary somewhat by provider, and whether you compare functions to full-priced or discounted VM instances.
When doing your cost analysis, you also need to include developer and operations time, which is usually lower for serverless functions than for VMs because more of the administration has been pushed to the cloud provider. Factoring in the development and operations costs, using serverless functions can be cheaper than running a small VM even if the aggregate function execution utilization is 75% per month.
One downside of using serverless functions is the complexity of billing. If you host your functions in a VM, that VM generates one billing line per month no matter how many times the functions are called. If your serverless function is called 4 million times a month, there will be 4 million events in the billing log.
Cloud spending management
Given the complexity of cloud pricing, and the differences from data center management, many companies will need to adopt new tools for IT spending management in the cloud. An Internet search for “cloud spending management,” “cloud cost management,” or “cloud management platform” will turn up at least half a dozen viable possibilities, along with a bunch of irrelevant results. While you may be able to manage your use of a single cloud platform with its native facilities for resource tagging and reporting, it’s hard to stay on top of usage and costs if you use two or more cloud providers.
No matter what tool you use, a major key to managing your cloud spending is to tag your resources. If you can look at a billing line item and see immediately that it’s for development and test related to the Fizzpop product, you’re in far better shape than if all you know is the serial number of the VM. If that tagging carries into your reporting and management tools, so that you can zoom in on a “zombie” in a graph and see its intended purpose and group affiliation, you’re in a position to close the loop and actually do something about the unutilized server VM.
Cloud cost management isn’t easy, and it can’t really be done with the ITIL processes and tools that most IT organizations have in place for their data centers. To manage your cloud estate effectively, you need to monitor your costs on a daily basis and intervene as necessary by shutting down resources, downsizing them, or putting them on a schedule instead of running them all the time.