About This Episode
This episode we sit down with Will Charczuk, Engineering Group Lead at Blend. Will oversees the service management, runtime & alerting, and operations sub-teams. The crew talks in-depth about rapid deployment in a highly secure environment.
About The Hosts
Justin McCarthy is the co-founder and CTO of strongDM, the database authentication platform. He has spent his entire career building highly scalable software. As CTO of Rafter, he processed transactions worth over $1B collectively. He also led the engineering teams at Preact, the predictive churn analytics platform, and Cafe Press..
Max Saltonstall loves to talk about security, collaboration and process improvement. He's on the Developer Advocacy team in Google Cloud, yelling at the internet full time. Since joining Google in 2011 Max has worked on video monetization products, internal change management, IT externalization and coding puzzles. He has a degree in Computer Science and Psychology from Yale.
About Token Security
At Token Security our goal is to teach the core curriculum for modern DevSecOps. Each week we will deep dive with an expert so you walk away with practical advice to apply to your team today. No fluff, no buzzwords.
Justin McCarthy 0:00
Welcome to another episode of Token Security. This is Justin McCarthy from strongDM --
Max Saltonstall 0:11
And I'm Max Saltonstall from Google Cloud
Justin McCarthy 0:14
And joining us today is Will Charczuk from Blend for part two of our conversation, and he's going to be talking to us about rapid deployment in a highly secure environment. Hey Will!
Justin McCarthy 0:00
In many environments, controls that you put in place end up being pretty additive or a creative, you end up adding them from or off and then removing them. Have you encountered any cases where you put a control in place, and then you ended up relaxing in the one that comes to mind the PII log thing, which is basically we started out super strict, like don't log anything really other than like a very clean web requests entry. And then we relax that collecting exceptions with enough sort of grace periods to understand what was being collected. And now we're starting to relax that even further to allow arbitrary logging. But it's done in a compartmentalized way, such that only certain users have access to production. They're only given time-based access to it, and it's as automated as possible. So basically, they're allowed to CPI potentially, that we don't explicitly log it, but we make sure that we know who was looking at what logs when, and that helps to understand more fully, basically the potential vectors for stuff going wrong, but the process that actually affects us the most.
Will Charczuk 1:00
One of the things that is just really hard and super secure environments, debugging your code and production, it's really hard to get a sense of what's running and why. And the more logs we can give people, the better and faster they'll be able to develop. But that's all stuff helps with the concern, right? Like, you know, you have a bunch of data sitting around that's potentially really sensitive. So that's been slowly relaxed over time. But the key is basically building in controls and systems to allow us to have careful auditing and controls around what you can take out of that system.
Max Saltonstall 1:30
Have you tried any other strategies around letting people access logs without the PII tokenizing it or other because an optimization techniques?
Will Charczuk 1:41
Yeah, we have, we have an entire stream of blogs, which are basically whitelisted logs, the key is basically making sure there as typesafe as possible. So the fields are fully enumerated and then having the strategies for the field on how to basically anonymize it. So the common way to do that is, is you have an HMAC the values that you can do grouping and identification of value anonymized. And so we have that as well, the sort of tussle for us is building the tooling to ensure that correctness isn't trivial. And it's one of those things that, like, you're always worried about the fields you missed, or some part of the data where the thing was in a bunch of context that screws up your obfuscation and that kind of stuff. So what we've tried to do is basically set up different sort of, like categories of logging were some are totally safe and whitelisted. And we can sort of be very confident that they're going to be safe. And we can keep those all longer attention as a result. But then there's like, basically logs that are we just treat them with kid gloves and their break glass in case there's a really pernicious issue that isn't surface by the other sort of whitelisted.
Justin McCarthy 2:46
That’s cool, because you mentioned a tracking some love values. And I realized you we sort of have talked about a few ways to sort of do one way transforms, do you have any role for actually encrypting values that you can get back the system? And just in general, where does encryption live in your environment?
Will Charczuk 3:04
We have a couple different things we have to do with potentially sensitive data. The two main use cases are storage and then querying. So the one way somewhat consistent in the coaching of stuff is what we use for querying. And then for storage, we tried to have basically as small number of records as controlled by a key as possible. So we try to do basically per field encryption, it's kind of extreme and effects, throughput and stuff. But it's also one of those things where, you know, if you were to compromise one of those keys, it would affect one value, not an entire database. For those are two sort of use cases, the key strategy is effectively the same, though, which is, regardless of if we're using alike, sort of reversible encryption, or if we're just using something that effectively creates a hash instance of the value, both of those are done as close to perfect as possible. So yeah, we use transit encryption for that stuff.
Max Saltonstall 3:58
So where do you see things about going next, especially for the infrastructure and operations side, where you're, you're working with the product teams, you're working with the infosec teams, what's the big thing that people should be looking out for x, or expecting to change about that world.
Will Charczuk 4:14
One of the things that sort of been nice is with the advent of really programmable and extensible cluster orchestration technologies, namely, bring me Docker image, and I will run it for you on a node somewhere that can host it, we can do a lot more with that now. And so we do everything from individual builds of images, to the actual services themselves to small low throughput databases, and that kind of stuff, what then ends up sort of developing as you have all these services, and you have sort of ad hoc methods for how you enforce was effectively a directed graph of what services can talk to each other. And we want to sort of standardized how that's done as a first step on just having an agreed upon convention, that you will take this GW and enforce that this other service can talk to your service, but then having an entirely separate sort of control structure around that, to enforce it based on the configuration, that's all or did, right. So if I am a blend developer, and I spin up a new service, and I want to say, only service X can talk to my service, I can register that and sort of a central registry that enforces that not just with certificates and other stuff like that, but potentially even up a network layer on some other things like that. So the project that were served investment is called spiffy SPI FF II. And we're trying to figure out exactly what sort of form that will take. But that's the next big thing for us is basically figuring out inner service off position, and then eventually fine-grained control when the services themselves so that you can have a loan officer at one bank needs to access to records and some descendant system, you can carry that identity through and say that, like, they only have access to certain records in my system as well.
Justin McCarthy 5:54
So Will if you could go back to the beginning you were starting some you know, starting from scratch and you want to define this Goldilocks zone on the trade-off between speed and security. And maybe you could think of this as advice for someone in that situation, what are the first one, two, and three things that you would advise someone to invest in, if they knew they had to go fast, but they need to do it securely, I would invest a little bit of tooling up front to leverage some kind of secure config and credential store, the idea being basically that stuff will pay dividends long term because you can lean on it, even if your team gets big, it's still a thing you're going to need, right? And so do that early and setup process around it early, because it's really easy just to have like a passwords file sitting on JSON laptop, but you don't want to do that ever. So just spend a little extra time standing up an instance of a good and shaker or secret store and leverage that from day zero, basically and then, you know, make sure that you have a good story around I mean, database encryption. If I had to start over again, I wouldn't really do too much different because once you have that sort of a good
Will Charczuk 7:00
system for for secrets management and configuration, you can do this sort of key management stuff really easy. And there are a lot of products out there to sort of leverage for that. And then once sort of those basics are sort of setup you can whatever system you want as far as how the actual deploys go. But the other thing that I would sort of come back to is, I would say, just to make sure you have at least one touch point before something goes to production, just to get a sense of what exactly is going out, what's the implication of the change, just so that like, you know, ci CD is is this really buzzy thing and it really is freeing because you can just sort of commit code and they'll end up from production eventually. But you do kind of want to at least make sure you have one manual touch point just to make sure that your code is what you think it is. That's a big one. And you know, there isn't anything else that you're missing. But otherwise, there's so much tooling and so much technology being thrown at the sort of deployment automation space right now, I would just sort of make sure that existing tools will do what you to, if you can sort of cobble together some scripts system to make it work for yourself versus going ham on like a full blown product that promises a whole bunch of features and stuff like that. That's sort of a nuanced one. But it's, it's kind of important because you want to understand exactly the pros and cons of the tools that you use and keep things as simple as possible. And, you know, basically, like, if you have, for instance, a CI CD tool that has its own user database, do you need to understand the implications of that, that's one more thing you have to configure per employee you bring on and it's one more thing that could potentially be compromised. So it's important to sort of keep things as simple as possible and add tools as they're needed over time. That's it. That's really general, I think, I think it kind of holds true for like, you know if I did the Contrapositive, I'm starting a company over again today. What would I do? Yeah,
Max Saltonstall 8:49
awesome. Thank you. It sounds like you're taking a very intentional approach to, you know, how do we want to solve these problems to really maintain our velocity? So I just want to hear your thoughts on addressing technical bed, whether it's of the product developers and product teams, or of your infrastructure teams? And then how are you sharing that kind of tech debt burden with the infosec teams and the service owners?
Will Charczuk 9:12
Yeah, I mean, tech, that's always one of those super hot button topics. Because it's pretty universal, that seems have stuff kicking around that they'd like to refactor or change or get rid of completely the way that we approach it is kind of holistically, we tried to make sure that we have quarterly goals, we have things we want to do with long-term goals, we try to align things to that. And that's sort of how we start the polarization discussion of if we want to sunset a particular system or process, you know, what does that actually get us what how does that sort of marches towards our goal in terms of the way the infrastructure works, it's nice that a lot of this stuff, if you have clear boundaries, and responsibility between infrastructure and developers, you can swap stuff out a lot more easily. So if your contract with developers give us a Docker file, and we'll run it for you how that Docker file gets run, or other type of container technology, whatever you want to bring, if you have a clear abstraction, which is is easy to swap components in and out of tech that affects you a little bit less, and you can leverage that to stand up to independent systems that do effectively the same thing and test them out or, you know, try different methodologies for how to do certain processes. So, you know, the way that we sort of approach tech that is from a design standpoint, first, which is have clear abstraction boundaries that you enforce kind of rigorously because that will pay dividends later in terms of how you want to approach these changes. But I mean, tech, that's always a problem. And there's always stuff that's like kicking around that we're constantly trying to get priority with, with teams that have to do work that basically action them, right. So the way that I serve approach it is I tried to explain the exact amount of work that's going to be involved. So it's not vague. Like, if you have to do something in your code to change how it's deployed, here's what you actually have to do. Here's some examples of other teams that did it so that you have some context for how much it's going to cost and then explain the relative merits, explain why we need to do it, it's going to affect basically, an example is, if we're moving from one sort of cluster orchestrator to another, the other cluster is costing us money, well, other things are using, right, and we don't want to pay for the same thing twice. So that's an easy argument to make, which is like, you know, we have a budget, we'd like to use it for other cool stuff, let's try to sunset this other this other system, this is how long it's going to take. And then you have to kind of trust that your counterparts are going to evaluate that fairly and make time for it. That's that's sort of how we approach it.
Max Saltonstall 11:39
Cool. Thank you. I like it. The other question that I was struggling to formulate. So you are trying to enable these development teams is product teams to move quickly. And yet, every single product they're making that's externally facing is dealing with a large amount of PII. So how does this become an operations or infrastructure problem of mocking this out, or creating sort of, and maintaining fake data that's real enough that you can test new features? Or is this a problem is entirely on the developers or the risk management? How does that get addressed to blend.
Will Charczuk 12:14
So it's an interesting thing to dive into. Basically, we try to stay out of some of these things as much as possible, because it's all going to be domain specific. And you kind of have to do what works for your team. But some things we try to pitching with. So the big one is, before we run big migrations on our database, will snapshot the production data in a secure environment, and then run the migrations on a snapshot of production data, and then have acceptance tests on the migrations, right, that's something that we provide the teams as a way to, that changes they need to push out, that's the kind of stuff where like, it really is situationally dependent, in terms of how much we're involved as an infrastructure team, versus we will give you tools to do it. This is less from a security standpoint, but more from just the exhaustive testing standpoint, one of the biggest things that we struggle with from a scale perspective is like, we need to run headless web browsing tests for a lot of different permutations of feature flags. Because all customers are kind of unique, but our code needs to be the same. So making sure that those work, and that they can talk to a live system to do full integration tests. And then they also consume a bunch of resources. So those kinds of things, like we just have to work with teams to make sure that we're accommodating them as much as possible, even if their use cases are a little bit extreme outside. Those are the cases where it is really something they couldn't solve with the tools that we've already given them. We sort of leave it to teams to figure out like, you know, if you want to run these tests, but you don't want to actually have real data to run them and you want to do it in sandbox, that's where you're sort of institutional knowledge and having really capable developers, you really have to lean on, right, because they have to figure out a way to do that in isolation and a representative fashion that's going to catch errors production that is super hard, and there isn't a clear right answer for it.
Max Saltonstall 13:58
Justin McCarthy 13:59
Will, it sounds like you've described your deployment architecture as featuring microservices. And I know a lot of folks have sort of been rebalancing the division of microservices, there's a sort of swing of the pendulum going on right now. Part of that pendulum is due to actually securing and credentialing all those microservices. So can you talk about specifically how microservices either have common elements or don't have column common elements, for example, do they all use the same secure configuration secret store, just talk a little bit about how you guys are dealing with that tension today
Will Charczuk 14:34
didn't benefits and microservices are pretty well understood at this point, which is that you can have for agency for teams to own their product or feature. But yeah, the downside is you sort of hint that are you actually have to bring a whole lot of tooling to have a service work securely, right, that's sort of what ends up producing what we call sort of mini list, which are maybe they're feature dependent, but they're kind of lumped together with other features. And you end up with these sort of like medium sized services, because people don't want to have to deal with standing up a new database and backing it up and having disaster recovery, if they have to restore backup. And like, what happens, if that's multi-tenant, you have to do a per tenant restore, and all this kind of stuff. So what we try to do basically, is the things that you can anticipate ahead of time that are going to be common for almost every service, we try to build tools for that. So you don't think about it. So an easy one is turning to us, we ran into an issue where we need to turn off certain to OS versions. And if you have two different microservices, each Terminator CLS to there is the instance pushing out code changes to do all that's really cumbersome and kind of annoying, what we do is just have a thin proxy that sits in front of your service that does pls termination for you, and then talks over localhost to your app. That's like a sort of nice example. But we try to do that as much as possible. We have, if you don't need a massive database, we let you spin up a database yourself, it runs on the cluster with everything else, we park the credentials for you in a place that you're at it use common libraries will know to look at, but then we handle snapshot in it for you. And we give you facilities to go encrypt fields as you need. So the key there basically is, is Yeah, the things that pull people back to larger services, that sort of tooling, inertia, we try to just have as much tooling available for people as possible, that takes a lot of work, it's not something that happens sort of easily, you have to really devote a lot of time to it and care to it. And that's why we sort of also think about the teams that blend is being internal customers, because we happen to have a constant feedback loop with them that like, hey, the tool that you wrote is kind of cumbersome, it doesn't quite do what we need to do, can you add some features to it, and then we have to prior that work as well. So I think, you know, there's something else to sort of highlight as far as like best practices for building the tooling that allows you to do a lot of services in a standardized way, make sure you have feedback was make sure you have regular check-ins with teams are using your tools so that you can get a sense of what's working or what's not. And a lot of the features that end up getting used by a lot of teams came from one person, anecdotally mentioning something they needed, but some things are pretty easy to anticipate, you know, all services are going to be a place to store their, their configuring credentials, you know, you can do some things for teams just off the bat, like encryption at Raskin. Some other things that are just like a standard, right. And then in terms of traffic encryption, you can do that for teams as well. And then database, you know, encryption, you basically have earliest, we have good examples of like, this is the library and its use, this is how you use it, here's where it pulls its context. So it knows how to operate and educate how it works. But then have working examples of what to work off of, so that you can build it yourself securely. Then, finally, like this, the one that we're investing a lot in right now is also standardizing how logging works and looks and making sure that some subset of the things that you do all look the same across all the services. So, you know, if you have a concept of an audit, which is sort of like a principal is doing something in our system, it's nice, having all those look the same. Similarly, web request or other types of requests. So that's another thing to sort of think about deeply is like, make sure that teams service doesn't emit logs, JSON and team service doesn't admit logs is like a tabular format that no one's seen since the 80s.
Make sure all your log messages look the same, because will make parsing easier. Great.
Max Saltonstall 18:20
Thanks very much for joining us today. Will, and telling us a bit about the infrastructure operations and security the developers and how it all fits together. Blend it, really appreciate it.
Will Charczuk 18:28
Yeah, thanks for having me for sure.