CrowdStrike Outage Explained in Plain English

6Lm7...Gg5t

23 Jul 2024

Let’s break down the CrowdStrike outage in non-technical terms!

A LinkedIn follower asked for me to opine on the CrowdStrike Outage situation to describe in plain terms, how the issue we see unfolding could have happened!

Technical Explanation

What is being described is a bit technical - a faulty update by CrowdStrike caused a kernel driver error that resulted in Windows Blue Screens of Death.

Let’s translate and explain.

Cybersecurity tools, like CrowdStrike need deep access into the operating system to monitor other applications, data, connections, users, and even various OS telemetry.
1. To detect suspicious and malicious activity
2. To interdict, contain, and eradicate attackers and their methods when detected

That means it needs some Kernal access. The kernel is the deep core of an OS. It is where all the control resides. Think of it as the powerful brain...

CrowdStrike

The CrowdStrike Falcon product is an endpoint agent (don’t worry about if it is an EDR, MDR, or XDR – that is all marketing gibberish to try and sell stuff)
An endpoint agent. That is a program that resides on the PC, server, or device. It runs locally to do its security thing!

Cybersecurity Threats

Well, as attackers are constantly adapting and finding creative new ways to exploit systems, endpoint protections need to be updated to keep pace.
The cadence of the updates can greatly vary.

Back in the day, anti-malware was only run once a week or month and rarely updated. Nowadays, such products may receive new instructions several times a day.
Some endpoint products, which look for general anomalies, my not need updates for months or longer.

Anyways, CrowdStrike sent a flawed updated. And that is where we saw the problems begin.

Although that is not where the problems began… More on that in a minute

Bad Update and BSODs

Okay, the flawed updated, once uploaded and employed by the system, began doing some privileged things (remember it has some deep access) that weren’t proper, which caused a critical condition detected by the OS.

For such critical conditions, the standard response for Windows is to halt all functions and display the dreaded Blue Screen of Death (BSOD).

The BSOD is a throwback to one of the early versions of windows and the thinking was if a program was doing something really naughty, the system should “fail safe” to protect the data and stop the harmful activities.

That was great 20 years ago, it was all we could really do given limitations of hardware and software, but that architecture has never really evolved the way we needed it to. Remember attackers have rapidly evolved, but the BSOD has largely remained the same.

Widespread Impacts

So, with CrowdStrike being one of the biggest cybersecurity firms and having a large user-base, when the bad update went out, the BSOD screens began lighting up like the Christmas at the Griswalds (that a reference to Christmas Vacation, Chevy Chase, go look it up).

To add insult to injury, security tools often load very early in the boot cycle, so they can observe what other programs are doing at startup. Which means, even if you reboot, the problem reappears before it is easy to do anything about it. A nice cycle of Blue Screens and reboots.

To fix it is not necessarily easy or straightforward. There are instructions and tools from CrowdStrike and Microsoft now available, but even some of those require a tech person to visit the device. Not exactly fast or scalable.

Problem Origins

Okay, lets go back to where the problem actually originated.
CrowdStrike is a behemoth of a security software company, well regarded in the community (well, until last week anyways).

They do updates, just like all endpoint security products, - so this is not their first rodeo.
They likely have a mature software pipeline, which is standard fare for software companies.
Which means they have developers that create the updates. Those engineers have processes and tools to check for errors. Sometimes stealthy problems get past them. Ones that would not get noticed.

Then the code moves from the Devs to the QA – that is the Quality Assurance team, which has a set of test script to validate functionality, backwards compatibility, performance, etc.

They are often tasked with finding those well-hidden or subtle issues, then report them back to the Devs for correction and resubmission into the QA testing channel where the entire tests begins again.

This cycle continues until the code passes and approval is given to move the code from Pre-Production to Production where it is pushed out to the world.
Okay, that was a lot.

But here is the issue. This was not a well-hidden or minor bug that would go unnoticed. A BSOD is about as big as you can get!

This should have been easily caught by the Devs and absolutely caught by any QA team with a pulse. Really, even a new QA person on day 1 would notice this!

So, how did it get past all those checks? Nobody at CrowdStrike is saying (which is a little odd).

But, this is where the problem actually began! …and that is why over 8 million devices, and the digital services that depend upon them, need to be fixed.

Although I won’t speculate right now what might have caused the bad code to get past all those checks-and-balances to be pushed, I might publish something later if people are interested in my thoughts on the matter.

So, I hope this helps explain how we got into this mess, given the current data we possess, and why so many are frustrated.

Thanks for watching! Be sure to subscribe for more Cybersecurity Insights!
Follow me on LinkedIn: https://www.linkedin.com/in/matthewrosenquist/ and on my YouTube channel for more Cybersecurity Insights: https://www.youtube.com/CybersecurityInsights