Chris Becker, SRE, Betterment
Chris Becker is an SRE at Betterment. Previously, he did similar work on Warby Parker's Infrastructure team. At Betterment, he earned the label APT (advanced persistent threat) thanks to consistently tripping alarms with his peculiar scripts and commands.
In this talk, he discusses how Betterment's approach to server access controls evolved as the team grew exponentially. With more people and keys to manage, the SRE team needed to find ways to automate more and reduce the maintenance overhead.
Advantages of strongDM for Betterment
As Betterment searched for a more scalable solution, they evaluated tools based on 3 criteria:
- support a Beyond Corp model for remote access
- includes an audit trail for compliance reporting
- easy to deploy and maintain
The team initially deployed Hashicorp's Vault to manage SSH keys. While it was a very powerful product, the maintenance proved to be too much burden.
The team then deployed strongDM.
Ease of Deployment
Because strongDM does not require a daemon, deployment meant installing two gateways (for HA) on a bastion and thanks to VPC peering, that was it. strongDM's gateways intelligently route all database and SSH sessions through the lowest latency path.
Because Betterment runs ASG's to run mission critical services and with that thousands of boxes spin up and down throughout the day. Any access control system needs to support that ephemeral infrastructure. strongDM made that process easy and automated. Since everyone likes LAMBDA these days, we wrote some LAMBDA function as the glue to keep strongDM's server inventory up-to-date. Additionally, strongDM's gateways auto-register as they spin up using 3 tools: Packer to build gateway images, CloudInit to register (whenever an instance boots, it runs a set of scripts and can be customized per boot or start basis), and Terraform to deploy the infrastructure.
Anytime Betterment runs a Terraform plan, it goes in and looks up the strongDM gateway instance. And it will just pull the latest one, which makes it really easy to actually deploy security updates to these gateways, or if they wanted to change parameters about them, they can do that easily.
Using AWS, occasionally there were instances that that didn't terminate cleanly, sometimes they would get shut down and then it would get stuck somewhere and it wouldn't really finish. So Betterment wrote a lambda script that will iterate over the relay, the servers and the inventory. And because of the way we named them, we can actually like infer some information about them in AWS. And then what they do is with with their code, basically say, what's all of the boxes, and then iterate over them and if the server is not in AWS, basically run the strongDM admin command and delete them. And because the instance ID is going it is definitely going to be unique for each account, they can use that as a shortcut to delete that box from inventory.
The engineers at at strongDM created a JSON output so Betterment can just load that right up into an object. And then they iterate over them and look right away and see how many instances we have in inventory. And then from there, they can loop over and for instance, delete in strongDM if it's not running. If there are instances that are in sort of known bad states - shutting down, if they're in stopped, or stopping, or they're terminated, they should not be in strongDM inventory, because they won't be connectable. So essentially, all Betterment has to do is use Boto within the lambda function, which will try and access some information about it. And then if there is an exception, they can infer that it's not there. Or if they just see that it's like shutting down or something like that, they can say, No, it's not there. So let's delete it from inventory. And then the inventory is nice and green, and happy.
Because strongDM deconstructs each protocol (ex: MySQL, MongoDB, SSH) instead of just forwarding along the request, strongDM logs every query & SSH command, making it easy for the SRE team to answer who did what, when and where.
Automate the Boring Stuff and Be Boring
Chris also discussed the team's most important mantras:
- If Betterment's SRE team has to repeat the same task three times, they write a script.
- They're big fans of using boring technology. All the tools in this talk were stuff in bash, or Python, no need to be clever with anything. There's no need for complexity. Just get it done quickly.
strongDM - A Really Nice Unix Citizen
According to Chris, strongDM is "a really nice Unix citizen." What he means is it makes it really easy to pipe the output of strongDM directly into files and things like that to work with, as well as just being like nicely deployable, because it's just one sort of go binary that can be thrown anywhere they want.