Speaker 0
0:00 – 10:00
Hello. I'm Ryan Cook, and this is Civic Tech Chat, a show that looks at the way technology, politics, and policy impacts the world around us. The tools we use, the way services are delivered, and how we talk about and set policy all shape our society. We'll gather around and have a chat about these things together and more. Before we get started, I do wanna let you all know that we've started a Discord for the podcast. There will be a link with an invite down in the episode description. Do feel free to go check that out. It's a small community right now, but hoping to grow it. It's a great way to reach out to me and let me know things that you might want us to cover or to just hang out and talk about civic tech. Anyway, let's go ahead and start the show. Alright, folks. Today, we're gonna talk about the CrowdStrike incident that happened on July 19. Now that we're about, oh, close to two weeks from the incident, I thought it'd be a good idea to kinda gather what information I can and put it together into a sort of audio briefing for you. So first, let's talk about why you should care about this. As I mentioned, on July 19, there was a major IT outage, and it was global in scale. Something like over 5,000 commercial flights were impacted, which if you're someone who is trying to take a flight that day, I'm sure you have a personal experience to to weigh in on there. All kinds of other businesses were affected though as well, and services were disrupted. Everything from stores trying to check out goods to hospitals trying to give critical medical care. It's something of it where it's difficult to quantify the damages, but an early estimate from the Anderson Economic Group based in Michigan puts it at somewhere greater than a billion dollars. This problem reveals systemic weaknesses in the way we handle our information systems. A problem with one vendor causing this big of an impact across the globe is probably not a great situation. So let's talk about what CrowdStrike is. It's a company that makes a series of IT security and management products. In the past, they've been known for detecting security breaches, like the Sony Pictures hack of 2014 and a series of Russian cyberattacks on the Democratic National Committee in 2015 and 2016. Folks might remember seeing the headlines for those. They also serve a bunch of enterprise customers, including 500 in the fortune 1,000, and that's according to their own website. The product of theirs at the center of all of this with the CrowdStrike outage is called Falcon. It seeks to be a sort of center point for a collection of services covering things like antivirus, threat detection, and real time monitoring. So what happened? You might be wondering. Well, from what I can tell and what I can see kinda gathering from sources online, CrowdStrike pushed an update for that aforementioned Falcon product. It uses a kernel mode device driver running in what's referred to ring zero. So you can imagine you have a computer that, space this application has access to effectively gives it complete control over that individual computer or endpoint. By comparison, other software that might run-in your machine, whether it's office processing software, web browser, something that runs chat like Discord, it all runs in a user space where it has a certain amount of permissions to be able to operate, but it doesn't have the ability to change certain system functions. As you might imagine, a piece of software having this level of access to a machine can create some risk if something were to go wrong, like what happened recently. According to CrowdStrike's post incident review, a content configuration update was released which resulted in the Windows crash that folks encountered. That's why if you were at the airport that say that day, for example, you might have looked around trying to find when your flight time was gonna be and just saw that dreaded Windows blue screen of death. That same report indicates that this content had an undetected error managing to get past layers of unit testing, integration testing, other forms of automated testing, and a staged rollout. This all leads to the question of, well, why does Falcon have this sort of kernel level access if it's so risky? Microsoft has asserted that there is a 2009 European Commission ruling that is in part to blame for the situation. It requires that third party products must be able to interoperate with Windows on an equal footing with their own offerings. As a result, it's possible, Microsoft might try to use the situation as leverage to make an argument against the conclusion of that ruling as they might try to claim that that activity is what forces them to be in a position to allow these third party companies to have that kernel level access. But on the other hand, other operating systems like, say, macOS have managed to provide API access for things like antivirus providers, And they're able to operate without having to get that ring zero access that we've talked about. So if you're thinking, hey, it's a bit concerning that we're relying on something with this kind of systemic issue for critical infrastructure, I would say you're in the right headspace. I think this is something that we're gonna be having to worry about for some time. As the, outage happened, as you might imagine, folks were working on trying to get some sort of fix in place as we needed things to be up and running again. And we might be wondering what that is. So the initial fix that was put out was a set of instructions that were manual steps to be on each individual computer. Effectively you had to go to that endpoint, boot it up into something called safe mode, which is kind of like a maintenance mode for Windows. And you would then have to go in and delete a specific file that had a naming pattern that I believe had the, dot sys extension. Once you did that activity, you were then supposedly able to reboot and things returned back to normal. Now as you might imagine, if you're running a large organization, like say an airline, it probably takes a lot of humans and a lot of time to get around to thousands upon thousands of endpoints to do this repair, which perhaps is some explanation for the length of time the outage went on for it. Microsoft later also released a recovery tool, which at it was meant to automate some of these steps where you could create a bootable USB stick, you know, plug it in, boot off of it, and it would kind of run through those steps that I just described. There's also kind of a goofy public relations thing that happened, after this fix kinda went out, which is, reportedly, CrowdStrike sent out a bunch of $10 Uber Eats gift cards. Though it was reported that, there was potentially a false positive on the kind of fraud slide on with Uber Eats where some folks weren't able to redeem them. CrowdStrike was saying this potentially because of just the large number of people trying to use the same setup all at once. I wasn't able to confirm that particular back and forth. So as we kinda get to the the end of this summary of what happened, I think something to think about is what what lessons can we learn? As folks that, many of us are out here trying to build our own tools to service, folks out there in the world, what are things we could do to avoid putting ourselves in in this kind of situation? The advice I would give is for for your own projects. Make sure you're investing time in your layers of automated and manual testing. Really consider where areas of risk could be and try to align your testing strategy to cover those specifically. Do those kind of pre mortem activities. Ask yourself where can things go wrong? What impact can those events have? In particular, if there's business critical functions that your app must do in order to be considered up and running, those are the areas where you really want to spend time on those expensive tests. For example, if you need, to write end to end tests for something it should align with the risks to that. I would also say it's important to try to build systems such that changes are easy to undo or roll back. And even this is something that applies both to technical systems and people systems. A big outage is laborious to resolve if there's no easy way to step back and undo the thing. Something you can do is is really look into your technical architecture, your systems design, and ask yourself, you know, if I push something out to prod and it goes wrong, is it something that I can roll back easily? Can I simply do another deploy and it comes out? Or do the changes we push due to the nature of how the system is put together cause us in a situation where we have to go, only forward? And similarly, I would also have you consider your change manner process with that. Are you able to move quickly when something goes wrong? If you do have to fail forward, are you able to do so in a way where you can push out a change quickly, or are there many obstacles in your team's way as they do that work? Again, as you're thinking about these things, really focus on those areas of high risk like we talked about before. Thanks, folks, for listening in here. This was a little bit of a briefing I wanted to put together on CrowdStrike. Coming up soon, we should have some more interview content as we continue into the content year. Thank you for listening. You can follow us on Twitter using the handle at civic tech chat. Visit us on the web at civictech.chat, or subscribe to us for content updates wherever it is you download your podcasts.