UPDATED as of 2 August 2024
On 25 July 2024, CrowdStrike released a preliminary Post-Incident Report (PIR). It confirms the broad contours the community suspected over the incident weekend, but remains frustratingly vague on technical specifics and doesn’t address the deeper business process questions of how this could happen to an org like CrowdStrike.
You’ve probably heard of the great CrowdStrike outage of July 2024. At the time of writing we don’t have a detailed postmortem yet, but for those just catching up: on 19 July 2024, CrowdStrike pushed a faulty content update (that is, a detection update rather than a software update to the agent per se) to customers. Because the CrowdStrike EDR agent is implemented via a highly-privileged kernel mode driver that loads early in the boot process - an early-load anti-malware (ELAM) driver - this caused millions1 of Windows endpoints around the world to enter blue screen of death (BSOD) loops, crippling much global IT infrastructure in the worst outage to date.
As mentioned above, we currently don’t have a full postmortem of how this happened. There are both micro and macro questions here. The micro question is what exactly the software flaw was, while the macro question is a process or organizational question of how this made it through QA. The best working hypothesis for the former is currently that a logic bug in the content update - a channel file - allows the kernel driver to attempt to read from an invalid memory address, although the precise nature of the flaw is hard to ascertain as CrowdStrike obfuscates their channel files to protect against competition.2 The latter is much harder to answer and to even discuss rationally in the absence of more detail from CrowdStrike. There is some speculation by former CrowdStrike employees that QA has gone downhill sometime in the intervening years, but nothing concrete yet. (UPDATE as of 2 August 2024: this is still true, the preliminary PIR is frustratingly vague).
Putting both those micro and macro questions about this specific incident aside, a broader question that many people outside the security industry quite reasonably have had is how this could happen at all. Within infosec, it has reignited foundational discussions about the risks involved in various different approaches to endpoint security. A common thread in both of these parallel discussions is “why CrowdStrike is running so much in kernel mode?”
In this series, my goal is to help answer that question, or more precisely to provide technical readers outside of security or a different subfield of security with the context to think productively about the tradeoffs involved with modern EDR. Most articles I have seen about this topic so far tend to either be technical deep-dives into Windows EDR internals targeted at my fellow offensive security types, which rapidly lose sight of the big picture, or conversely, big-picture thinkpieces that are so vague and non-technical as to be unhelpful or outright misleading. I hope I can strike a balance here that appeals to a range of readers, although my imagined “target reader” is a technical professional or enthusiast outside of infosec.
The series will be structured as follows. Part 1 will introduce, from a highly general perspective, the “how” and “why” of modern EDRs - how they are structured, and why they are structured that way. Here, I hope to introduce the problem space independent of a specific operating system (OS) or implementation, although I will touch on various specifics. If you read anything, read this and part 4.
In part 2, we will go deeper into the technical details of how mainstream Windows EDRs work. Since this is a topic that has already received a lot of open-source coverage, my focus here will be on synthesizing and presenting the information with an eye to understanding the tradeoffs that led to the CrowdStrike incident and, once again, to thinking productively about the design decisions implicated.
In part 3, we will turn to the technical details of EDR on Linux and macOS. This is an area that is under-covered in the open source/public literature. Once again, however, my goal here is not to provide a comprehensive review, but rather to introduce the subject enough to think intelligently about the questions of EDR design and tradeoffs raised by the CrowdStrike incident.
Finally, in part 4, we will bring the previous parts together to tentatively explore the question of whether and/or when the kernel-mode-heavy approach of mainstream Windows EDRs like CrowdStrike is worth the inherent risks.
Footnotes
-
https://blogs.microsoft.com/blog/2024/07/20/helping-our-customers-through-the-crowdstrike-outage/ ↩
-
The best summary so far is Michael Cahyadi’s. ↩