About Token Security
Welcome to another episode of Token Security. Our goal is to teach the core curriculum for modern devsecops. Each week we will go deep with an expert on a specific topic so you walk away with practical advice to apply to your team today. No fluff, no buzzwords.
About This Episode
This episode we sit down with Will Charczuk, Engineering Group Lead at Blend. The crew talks in-depth about rapid deployment in a highly secure environment.
About The Hosts
Max Saltonstall loves to talk about security, collaboration and process improvement. He's on the Developer Advocacy team in Google Cloud, yelling at the internet full time. Since joining Google in 2011 Max has worked on video monetization products, internal change management, IT externalization and coding puzzles. He has a degree in Computer Science and Psychology from Yale.
Justin McCarthy is the co-founder and CTO of strongDM, the database authentication platform. He has spent his entire career building highly scalable software. As CTO of Rafter, he processed transactions worth over $1B collectively. He also led the engineering teams at Preact, the predictive churn analytics platform, and Cafe Press..
Justin McCarthy 0:00
Welcome to another episode of Token Security. This is Justin McCarthy from strongDM --
Max Saltonstall 0:11
And I'm Max Saltonstall from Google Cloud
Justin McCarthy 0:14
And joining us today is Will Charczuk from Blend and he's going to be talking to us about rapid deployment in a highly secure environment. Hey Will!
Will Charczuk 0:24
Hey, how's it going?
Max Saltonstall 0:25
Thanks for joining us. Today the topic that we're going to be talking about is security in your deployment process. But it would like to get just a little bit of a picture of the environment that you work in. So could you tell us a little bit about sort of what your day to day is about, and then we'll move on to how security fits in, and what's Blend all about.
Will Charczuk 0:46
Blend is a company we make SaaS software for banks to help make the origination process and mortgages faster and easier and more secure. So we sell a white-labeled product that banks use to onboard perspective buyers in the process of applying for a mortgage. So naturally, when you're doing that, you're touching a lot of what is very sensitive information. And we take security really seriously as a result. And then the space that I personally operate in the team that I lead is infrastructure. And our job is effectively you think of a product development organization is three sort of layers where the bottom-most layer, we're responsible for making sure that your code runs and production securely, it's configured well, and that you can monitor and diagnose problems. So that ends up catching a lot of different concerns. You know, our three sorts of main pillars are basically the deployment and management of production logging in the venting. And then active monitoring alerting.
Max Saltonstall 1:44
And you also maintain dev environments or QA environments alongside the production infrastructure.
Will Charczuk 1:48
Yeah, so our general topology is that we have a sandbox environment for development, which is its own sort of accounts partition from our production setups. And then within the production setup, we have multiple different partitions for different stages of the release pipeline. So pre-production or staging environment, a release environment, and then a testing or user acceptance environment for customers to train their people on our software. So all those are effectively different sort of cuts of the same account, but they're separate in terms of how we deal with them.
Max Saltonstall 2:23
Can you tell me a little bit about the other groups within blends engineering work that you work most closely with?
Will Charczuk 2:29
Yeah, so infrastructure is sort of responsible for the architecture and implementation of things that are runtime concerns for developers. But we interface heavily with our infosec, and infosec engineering teams who are both risk analysts, they look at processes and things that we're building and tell us if they're really bad or not. But they also have agency to implement some processes and tools themselves, where they would be the most motivated team to sort of implement them if that's what makes sense.
Max Saltonstall 2:57
So you said, infosec they do the risk management, what other things teams to work closely with
Will Charczuk 3:01
Naturally, our internal customers are the vertical product teams and service owning teams that have to deploy their code, obviously, so, you know, when we make changes, or we need to, you know, implement new processes, we have to go these teams and get their buy-in that it's not going to be super disruptive, that's not going to add work or do anything adverse for them. So, yeah, are we sort of operate as a company within a company, essentially, our sort of counterparts are teams that have to build the actual things that our customers are either deploying are using. Yeah,
Justin McCarthy 3:34
So Will since you since you highlighted the Dev staging environments, I'd like to actually head over there for a second, and specifically regarding some of the differences related to securing those environments. So I know around the office here, we often talk about sort of live customer data is being radioactive, and we go to great lengths to make sure that that radioactive material doesn't leak into any non-live environment, right? So all our security controls apply to production. And then over working in development or staging environment, we're very sure that none of the credentials none of the secrets and know the log information has leaked in Can you talk to us a little bit about segmentation?
Will Charczuk 4:14
Yeah, of course. So like you said, one of the things that we did early has completely separate accounts and credentials for those accounts for each of those different environments. The key differentiator between them is like you highlighted production can have production data or PII in sandbox. And what's actually tricky about that is, it's a bit more pernicious than you think. So basically, if I'm a developer at blend, and I'm text testing, what we call assets connectivity, which is basically I need to log into my bank account and get my transactions and stuff like that, you really shouldn't do that in sandbox, because that is effectively PII that stuff that if it leaked out of our sandbox database, you would be in trouble. So we have to have aggressive alerts, and aggressive controls to make sure that things that even if it's not customer data, it's all our own data doesn't end up in sandbox. But yeah, the primary start criteria is that sandbox is for people for you to test stuff. But we have to have aggressive controls that no backups from production, make it there. So we have separate keys to enforce that, and no other sources of potentially dangerous information end up in that environment as well.
Justin McCarthy 5:18
If we were a member of your team, and we just opened up some of your deployment automation, you know, there's always going to be a whole bunch of configuration variables that are environment specific. And then there are going to be others that are both environments specific and sensitive. So how do your developers on the team? How do they see that sort of boundary between sensitive credentials and not? And how do they work on the system, when those are some of the bits that they actually need to change?
Will Charczuk 5:42
Yeah, so it's kind of situational. But again, let's say that, you know, the most sensitive like passwords that we deal with our provider credentials. So things that we use to authenticate, I am a service of blend, I'm talking to a trusted provider, those kind of credentials will typically require that they issue two of them to us. One is easy production, one is using sandbox. And the sandbox one typically will ask for very specific controls around it, such that there's a whitelist, or there's a only restricted set of functionality that's available to the sandbox once. And then typically, the way that these things are stored, is we have a secure credential store that your developer laptop can reach out to if you're on our sandbox VPN. And the idea is basically, that you can fetch the code from kit, you can fire it up, and it'll reach out and try to grab credentials for you using your we use a couple different things. But you want to basically have a unique identifier for each developer accessing the credential store --
Max Saltonstall 6:41
-- and you say, service providers, you're talking about banks or other financial service providers?
Will Charczuk 6:45
Yeah, it's a big mix of banks, credit providers, you know, third parties that interface with the bank, so your clouds or do these, that kind of thing. Yeah, all of those are super, yeah, as you'd call it, radioactive credentials, just because you could basically grab them, then build your own little kit app and fetch data on behalf of blood.
Max Saltonstall 7:03
So how do you how do you make sure that developers aren't able to introduce any kind of malicious use cases like that?
Will Charczuk 7:11
So that comes down to the code review process. So basically, when you ship code that has visibility into you actually secure credential stores that has to be valued not just by your tech lead on your team and the people that you work with. And they don't explicitly sign off on it based on we keep track of owners basically, for modules, but also infosec will need a review on stuff that's basically you have to opt out of infosec review. So if you have a service and it's like managing go links at blend, that's not something that needs infosec review, necessarily, but everything else needs basically you to have sign off from someone on infosec to take a peek at it before it emerges. But yeah, that's the idea is basically you want to make sure that code is is not just reviewed, been audited, so that you know, who introduced the changes and when they did that --
Justin McCarthy 7:59
So then it sounds like you're you actually have a lot of automation in place. But you also mentioned these review steps which are, which are very familiar. So overall, how do you balance security risk and velocity because it sounds like being able to deploy at speed, ease, and important value to the team. But obviously, you can't do that without, without some of these checks and balances.
Will Charczuk 8:18
It's definitely been one of those things where we've sort of had to take a lot of different concerns and boil them down to a process that works for us, there's a couple different sort of key features of this one is each team is empowered to release that the velocity that they feel they need to. So you don't want to have a big release train held up by a major overhaul of something weird that you can ship a trivial bug fix, right. So we try to partition the sort of work streams and allow teams to ship in the velocities that they feel appropriate that comes down to, like microservice architecture, and why you have all this automation on top of how you deploy code. But then, you know, the manual review steps are just one those things that we found, we just wanted to make sure that we had at least a couple touch points to make sure that a human is interacting with the process to make sure that things to look okay. And that you have that sort of escape hatch to say, No, we have to stop looks really bad for it goes out again. And that's just sort of balancing the concerns of, you know, development velocity is really good. And we really want to do as much as we can to ensure that but we don't really have any second chances with having major catastrophes so you know, we really need to make sure that never happens, right? And we're sacrificing purposefully a lot of velocity for that that's something that seems sort of on board with.
Justin McCarthy 9:38
Are there any minor catastrophes or near misses that come to mind that you could share with us?
Will Charczuk 9:44
Nothing that's actually been super bad. But it's one of those things where, like, you know, oh, the air brake library captures your entire environment, when it sends over exceptions. That is scary. And so that requires you to basically say, well, we didn't get burned by this. But if anyone ever stores the database password is an environment variable that could have been really bad, those kinds of things were like, velocity also kind of works against you, you kind of need to live in these tools for a long time to understand them fully before you can really start to ramp up velocity. Yeah, that kind of stuff is just something we're always trying to be super vigilant about. And guard against, which is basically, you know, review any module you add your code as if it was a normal code review. Don't just blindly add Leinster MPM, or package JSON or whatever.
Justin McCarthy 10:31
So it sounds like you have an infosec team in place. I imagine that that's the team that's let's say, looking at at the events in the security interesting security van log, how do you make some of those events real for the developers. And I'll say, this is definitely a situation that I've faced before, if you actually do a decent job, real events are kind of few and far between. So how do you make it visceral for the team that's actually building when you might actually have just a trickle of authentic attacks.
Will Charczuk 11:04
For example, what we always try to sort of build in any sort of context is what would happen if your bank account and social and everything else that she's a person, identify yourself with leaves. And that's a pretty easy thing to empathize with, as developers, you never want to have that happen. Where it gets trickier is basically explain the importance of keeping credentials, outage, insecure locations, or other things like that, that is trickier. And you kind of have to start with an advocate. So we typically try to have in classes when people join, blend on, here's, you know, here are some security best practices. Here's why credentials are important to secure and what they let you do. And I mean, for some people, it's obvious that like, yeah, I could just take our, our credentials, build a small kid app, do an Xfinity entire accounts that we have active at that moment, and then that's a huge problem, right? So you know, it, what it really comes down to is both empathy and knowing where there are gaps in that education, sort of why you care about it --
Max Saltonstall 12:02
Do you do pepper that process within the drills of any type as well,
Will Charczuk 12:06
For developers typically know for infosec, yes, we tried to crash drills, and like, you know, if we had to shut down certain components of our infrastructure, how we do it, we do disaster recovery, that's more for the continuation of the business kind of thing, which is, if we lost an entire data center, how long would it take us to get it back up, that kind of stuff. But then also, this is a fun one that the infosec team does. And it seems pretty basic, but they'll try to send out phishing emails to the employees and see what happens and the response rates really scary. So the good thing for them to proactively do and ensure that if someone ever actually tried that the infosec channel will get a billion notifications from people reporting it.
Max Saltonstall 12:48
I'm pretty safe to say they have tried it against your employees already.
Will Charczuk 12:53
I would probably assume so. Yeah, I yeah. But I haven't heard of any incidents in particular, I'm sure everything's fine.
Max Saltonstall 13:01
Yeah, it seems like there are some interesting interdependencies between your infrastructure group, the infrastructure information security teams, the product owners, how do you navigate those interdependencies when it could really slow things down?
Will Charczuk 13:18
Yeah, so what it comes down to is a relative priority, some things are going to be critical, but we call zeros, and those get floated right to the top and they get action, immediately we stopped whatever we're doing, and we go address it. And then entering sort of that we have to basically work through as a team, like we prioritize product work or anything else, which is, you know, this is going to eventually help keep something secure, it's going to be a better process. An example that's kind of esoteric, but maybe more common is we have a build machine, that builds machine was in the past responsible for doing everything from the CI builds to make sure your code compiles to, you know, there are jobs on it that let you stand up a new instance, machine and production are not quite the same. It's really two different sorts of instances of the same system. But the idea is the same, which is, basically there's a single system that has a lot of responsibilities, and you want to make sure that developers only have as much responsibility or impact as they need. And so we had to basically chop this system up into multiple components, which was extra work. And it requires us to maintain multiple instances of what looks like the same thing. But it's important to compartmentalize your responsibilities as much as you can. And that was a classic case of it added a bunch of like, extra work for us, but the relative merits of it were easy to communicate. And that was one of those things that, you know, we're happy to do that for infosec. that's those are apart from them. And it's just something that we need to sort of prioritize, yeah.
Max Saltonstall 14:48
It sounds like a good compromise. I feel like a lot of companies are struggling with that, you know, velocity, security, velocity security, and the risk management often falls more on that security side, as you mentioned, are, there are the compromises that you feel like you're digging through right now, or you're not quite sure how they're going to land you can share,
Will Charczuk 15:07
There are a couple things that we're just growing more accustomed to over time. And I'm not sure how much this affects every team. But well, actually, this one would. So we don't let you emit PII in logs. Or at least we really try to be super strict about it. And what that means is basically, we have mentors that go through and look for logging messages and have them all sort of highlighted so you can see exactly what saying log. But then also we have effectively different sort of buckets that log and up in as far as their stores and searches and the facilities will let you search through them. And for the longest time, it was just really hard to debug stuff from production, because you couldn't really log stuff you to rely on sort of vague counters and metrics to figure out what was going on. And what we're what we're doing is basically slowly building in infrastructure to allow us to view more and more logs, and standardize the whitelist and process and standardize how the that works. It's more than the requirement was set. And we're sort of adding tooling to make it more ergonomic for developers is we have entirely separate machines we use to access production. So their Chromebooks they're managed by a central policy, they don't have your tools on them or anything. So it's a little bit tough to transition to try and do work on them. And really, you should only be using them for a sort of our time at a given moment. But that's something that we're like slowly adding things to and sort of adding processes around to make that a bit more ergonomic in terms of how you use it. Yeah, I don't know if that's sort of where what you want me to touch on. But I'm happy to go into other examples.
Justin McCarthy 16:38
Let's think about something that's a struggle, right? Something that is a pole in two or more directions. And it's not quite clear how it's going to all comedown, maybe because, you know, competing priorities competing, you know, orthogonal metrics. And so there's no simple way to just say, here's the right answer.
Will Charczuk 16:58
I mean, we've actually been pretty lucky, our infosec team explains stuff pretty well. And the stuff that they asked for is usually pretty reasonable. Another example might be we have a VPC or a sort of separate private network that we operate in, in production. And we're eventually going to need to install an Ingress filter into it, which is basically preventing nodes within that private network from being able to talk to anybody on the internet. The idea of being basically to prevent exploitation of data, that process is kind of disruptive, like, you need to install something from Docker, and you have to whitelist Docker like that been a collaborative effort. And it's one of those things where it's incredibly necessary. But it's always one of those things also where you're like, we really need to be super careful about this because this could, you know, basically kill entire processes inadvertently. And that's the one of the most sort of like keeping my eye on just because it has such a major implication in terms of how just basic code functions that like I totally get them and the team that's the motivation behind it, which is protecting us from Knoxville and then eventually suffer PR PCI, but it's also one of those things were like, it's a very tough thing to tune and make sure it works. Okay, it doesn't really affect our team too much directly other than when I try to go do an image build. And suddenly we can't fetch things from a repository because the way it chops up packets is wrong for how we're doing the filtering. It's just like, Who would have thought of that, but it ends up biting us. And so that's a weird thing. Like if you are an engineer and you want the simplest process possible that will have the fewest, you know, vectors for failure. seeing something that like will balk at packet size changes makes your spidey sense go off. But it's always sort of in the context of here's the business reason why you have to do it. Here's, you know why this is super important. And so you sort of have more, again, empathy for why that needs to sort of happen.