Site Reliability Engineering (SRE) is a relatively new addition to the DevOps world, and delineating between SRE and DevOps can be confusing. That’s why Justin McCarthy, CTO and co-founder of strongDM, recently sat down with Alan Shimel at DevOps.com and a panel of technology experts to discuss site reliability engineering (SRE), its relationship to DevOps, and why it’s so hard to pin down a definition of the role.
The full panel included:
- Tim Yocum—Director, Operations at InfluxData
- Greg Leffler—Observability Practitioner Director at Splunk
So, what exactly is SRE, and why does it matter? Here’s the recap:
SREs: Fanatics about Uptime
“SRE is […] really just a practice within the current DevOps evolution.”
— Tim Yokum, InfluxData
SRE is more than a buzzword, it’s an integral part of production. While most DevOps engineers want to focus on writing new features, closing tickets, and moving on, site reliability engineers are plotting out the future of the product in a way that’s sustainable instead of just doing a break/fix.
At smaller organizations like strongDM and InfluxData, SRE is often distributed throughout DevOps. Even at larger companies like Splunk, SRE is more of a mentality than a title. But however you define the role, everyone agrees: you need SRE-type people embedded within your teams, and close enough to the product to understand it.
Where Does Observability Fit In?
“If you are [...] architecting reliability into your services, or architecting resilience into your world, you have to have observability as part of that. There’s no way to do it without it.”
— Greg Leffler, Splunk
While old-school monitoring required you to know what you wanted to monitor, observability has you instrument everything, and then use that data to figure out what's going on. Whether you’re looking at access observability, application data, or Twitter sentiment, SREs help you prioritize what matters most, like keeping your customers happy or meeting your service-level objectives (SLOs).
“If you're in this pretend world where we say everything is five nines all the time, well, either you're incredibly rich, or incredibly brilliant and incredibly lucky.”
— Greg Leffler, Splunk
Using SLOs as a metric for SRE lets you drop the need for perfection. Service-level objectives focus on setting expectations within the team.
Site reliability engineers can use SLOs as internal currency to keep everyone focused on what really matters, like building more reliable applications, without getting bogged down in business-side metrics like SLAs.
Final Thoughts On Hiring
“Bust through that 'ten years of SRE experience' or '200 years of Kubernetes experience' job requirement.”
— Justin McCarthy, strongDM
It’s really difficult to recruit for this position, especially as HR often writes job descriptions with excessive requirements. This is compounded when managers search for new hires with the SRE job title. Often, people in the SRE role don’t even consider themselves SREs.
So what’s the best way to hire for the role? Drop the bullet points, and focus on the people. Anyone who enjoys troubleshooting and appreciates accumulating knowledge should be a part of the talent pipeline.
While the role may be fuzzy, the need is clear. In a complex world of cloud-forward micro-services and ephemeral infrastructure, SRE thinking is a crucial component of your IT environment.
Did you miss the panel? No worries, you can still check out the replay. And if you have a growing team that needs access to Kubernetes clusters, AWS accounts, databases and all of your infrastructure, come on over to strongDM for a free demo.