by Sharon Florentine

Why you need a systems reliability engineer

Apr 26, 2017
CareersDeveloperEnterprise Applications

How can you make sure the software your company builds today will stand the test of time? Hire an SRE.

2016 software predictions
Credit: Thinkstock

How can you ensure that the software and services you build today can deliver what your customers and consumers need in the future? If this is a question you think your organization should be asking, then you might need a systems reliability engineer (SRE). SREs are software engineers who focus on the reliability and uptime of applications and services not just in the short-term, but with a focus on scalability and long-term use.

Sometimes referred to as “site reliability engineer,” or “services reliability engineer,” this engineering role is one that’s finding its footing as DevOps practices take hold in IT departments, says Jason Hand, DevOps evangelist and incident and alerting specialist with VictorOps. The roles are most prevalent in cloud services, SaaS, PaaS and Iaas companies whose clients rely on them to keep those services available 24/7/365, he says. For organizations that rely on uptime, availability and reliability, an SRE is a logical talent add, as every minute of downtime chips away at the bottom line, Hand says.

[ Related story: Salaries for storage, networking continue to rise ]

What is an SRE?

What does an SRE do? They’re something of a developer-sysadmin hybrid — concerned both with development and coding and with the seamless operation of software and applications.

“For us, we already have teams of engineers building the back-end stuff as well as the front-end stuff, but we also have incorporated SREs who work with these teams throughout the software development pipeline and software development lifecycle to make sure the dev teams have at least one eye on scalability; on the future. That’s especially important during design and initial development so that our solutions can handle loads now, yes, but also six months, a year, two years from now. We don’t want to have to yank out parts of the codebase to rewrite, so we have to make sure we are anticipating future needs now,” Hand says.

The SRE acts as a middleman and a diplomat, balancing the needs of the development teams — which want to create, test and release new products, features, updates and fixes as quickly as possible — and the needs of the business stakeholders — who want to make sure the products and services work flawlessly and can handle increasing customer demands and requirements, says Hand.

“It’s a role and a school of thought that has evolved out of the DevOps perspective to be a combination of these two disparate groups. There’s often issues aligning the needs of these two types of roles, even though they want the same things: applications, software and services that are easy to maintain and deliver great availability, reliability and scalability,” Hand says.

[ Related story: Do you really need a CEO? ]


As Patrick Hill, a site reliability engineer for Atlassian, explains in this blog post, SREs mediate the age-old power struggle between developers and operations teams by removing “the debate over what can be launched and when.”

“The underlying problem goes like this: Dev teams want to release awesome new features to the masses, and see them take off in a big way. Ops teams want to make sure those features don’t break things. Historically, that’s caused a big power struggle, with Ops trying to put the brakes on as many releases as possible, and Dev looking for clever new ways to sneak around the processes that hold them back,” Hill says.

With an SRE mindset and dedicated engineers, these problems can be almost eliminated, because development teams and operations staff agree in advance on a threshold of error that must be met before product can be launched, Hill explains.

SRE as a movement and SREs as roles fit best into mature organizations and larger enterprises, says Hand; in smaller organizations, there’s a lot of overlap between what traditional software developers do and the responsibilities of infrastructure engineers, sysadmins and other operations titles.

[ Related story: 10 digital marketing jobs that top the pay scale ]

What’s in a name?

“It’s not so much about the title; they could be called DevOps engineers, sysadmins, systems reliability engineers, site reliability engineers. It’s more about what they’re doing in that role. In smaller enterprises, everyone has to wear a lot of hats, and yes, it should be everyone’s responsibility to ensure that code is robust and reliable and scalable across the board, but in some cases the workload is just too great and you’d need a dedicated role,” he says.

For smaller organizations that don’t have the resources to hire a dedicated SRE, offering existing talent training and education on the processes and procedures necessary for success in a sysadmin or operations role and upskilling to technologies like Chef, Ancible, Puppet and other automation tools can be a great stopgap, says Stephen Zafarino, senior director of recruiting at IT recruiting and staffing firm Mondo.

“Especially with SREs commanding premium salaries for this specialized skillset, some clients might do better to offer professional development to their existing talent pool. Another option is finding a freelancer or a contractor who can come in and advise, or work on projects on an ad-hoc basis,” Zafarino says.

While the demand for SREs isn’t huge right now, Zafarino says, as more organizations move to the cloud and get software and other services and solutions from third-party providers, the role will become more common and more in-demand.

“Right now the responsibilities are currently being handled by other roles in IT departments — DevOps engineers, software engineers, sysadmins – and we only have a few clients who are filling dedicated SRE roles. But as teams start to integrate DevOps and adopt those principles, it’s going to get hotter as the roles become more defined,” he says.

Related Video