Compare /

At the Intersection of SRE & DevOps

At the Intersection of SRE & DevOpsAt the Intersection of SRE & DevOps

Site Reliability Engineering (SRE) is a relatively new addition to the DevOps world, and delineating between SRE and DevOps can be confusing. That’s why Justin McCarthy, CTO and co-founder of strongDM, recently sat down with Alan Shimel at DevOps.com and a panel of technology experts to discuss site reliability engineering (SRE), its relationship to DevOps, and why it’s so hard to pin down a definition of the role.

The full panel included:

  • Tim Yocum—Director, Operations at InfluxData
  • Greg Leffler—Observability Practitioner Director at Splunk

So, what exactly is SRE, and why does it matter? Here’s the recap:

SREs: Fanatics about Uptime

“SRE is […] really just a practice within the current DevOps evolution.”

— Tim Yokum, InfluxData

SRE is more than a buzzword, it’s an integral part of production. While most DevOps engineers want to focus on writing new features, closing tickets, and moving on, site reliability engineers are plotting out the future of the product in a way that’s sustainable instead of just doing a break/fix.

At smaller organizations like strongDM and InfluxData, SRE is often distributed throughout DevOps. Even at larger companies like Splunk, SRE is more of a mentality than a title. But however you define the role, everyone agrees: you need SRE-type people embedded within your teams, and close enough to the product to understand it.

Where Does Observability Fit In?

“If you are [...] architecting reliability into your services, or architecting resilience into your world, you have to have observability as part of that. There’s no way to do it without it.”

— Greg Leffler, Splunk

While old-school monitoring required you to know what you wanted to monitor, observability has you instrument everything, and then use that data to figure out what's going on. Whether you’re looking at access observability, application data, or Twitter sentiment, SREs help you prioritize what matters most, like keeping your customers happy or meeting your service-level objectives (SLOs).

Measuring SRE

“If you're in this pretend world where we say everything is five nines all the time, well, either you're incredibly rich, or incredibly brilliant and incredibly lucky.”

— Greg Leffler, Splunk

Using SLOs as a metric for SRE lets you drop the need for perfection. Service-level objectives focus on setting expectations within the team. 

Site reliability engineers can use SLOs as internal currency to keep everyone focused on what really matters, like building more reliable applications, without getting bogged down in business-side metrics like SLAs.

Final Thoughts On Hiring

“Bust through that 'ten years of SRE experience' or '200 years of Kubernetes experience' job requirement.”

— Justin McCarthy, strongDM

It’s really difficult to recruit for this position, especially as HR often writes job descriptions with excessive requirements. This is compounded when managers search for new hires with the SRE job title. Often, people in the SRE role don’t even consider themselves SREs. 

So what’s the best way to hire for the role? Drop the bullet points, and focus on the people. Anyone who enjoys troubleshooting and appreciates accumulating knowledge should be a part of the talent pipeline. 

While the role may be fuzzy, the need is clear. In a complex world of cloud-forward micro-services and ephemeral infrastructure, SRE thinking is a crucial component of your IT environment.

Did you miss the panel? No worries, you can still check out the replay. And if you have a growing team that needs access to Kubernetes clusters, AWS accounts, databases and all of your infrastructure, come on over to strongDM for a free demo.

strongDM logo
💙 this post?
Then get all that SDM goodness, right in your inbox.
Email icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.