Unpacking the CrowdStrike outage from a QA perspective
The recent CrowdStrike issue affected thousands of people around the world — and the QA industry is still watching and analyzing the aftermath.
Six days after the issue, flights are still being canceled, and IT folks are still restoring computers from the Blue Screen of Death (BSOD). I am unsure about the exact financial damage done by the disaster, but the first estimations are around $5.4B only for Fortune 500 companies.
What we can interpret from the CrowdStrike report
CrowdStrike finally published their report. What a read!
Essentially, their app is designed to detect and prevent attacks. To successfully prevent attacks, the app operates at a very low level, specifically at the kernel level of the operating system. When an app works at such a low level and there’s a bug, the whole OS might halt, which is exactly what happened last week.
The technicalities of the issue are quite simple. Their app has two ways of detecting attacks: the main heuristic-rich engine “with AI” does most of the work on its own, and there’s an additional level of detection through regular config updates, where configs are essentially pattern-matching instructions. The main engine loads these new configs when they are released. Before the main engine loads the config, it checks the configs for validity.
This time, the verification failed due to a bug in the verification module:
When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).
The approach of having a core functionality-rich system with loadable modules for updates or additional configs is far from new in IT.
In gaming, SAP-like programs, and enterprise software, various configurations, templates, and even Domain-Specific Languages (DSLs) have been used for ages. This decision for separation is made when it’s cheaper and more efficient to provide flexibility and speed of updates this way. For example, updating the entire SAP system for all users would require downloading gigabytes of data and installing the new version over several hours or even days, whereas updating a configuration file would only involve transferring a few kilobytes and could be done without significant disruption.
However, this modularization always comes at a cost. For example, parsing and integrating the downloaded config should go through further testing. Without this additional testing, the modularization would incur damage, as the fresh config can corrupt the whole system.
It’s generally safe to say that the cost of modularization should be lower than the benefit achieved from it.
When comparing these costs, additional testing should always be accounted for. It’s QA 101. If the main app crashes because the loaded module is faulty, then the whole modularization wasn’t worth it.
How did CrowdStrike allow this to happen?
What’s interesting is that CrowdStrike seems to understand how testing works generally, or at least they claim to: they say that for the main app they employ dogfooding, unit testing, integration testing, performance testing, and stress testing. They also say they deploy the main app to early adopters first and allow customers to choose which version of the app to update to, so that the customer decides whether they want to be on the most up-to-date version of the app or if they are okay with using a slightly older version (which likely will detect fewer attacks but will be tested by other companies).
However, they didn’t pay much attention to testing the modularization approach they have; they simply didn’t test the config parser and loader well enough.
The proof is in their planned mitigation measures:
“How Do We Prevent This From Happening Again?
Software Resiliency and Testing
- Improve Rapid Response Content testing by using testing types such as:
- Local developer testing
- Content update and rollback testing
- Stress testing, fuzzing and fault injection
- Stability testing
- Content interface testing
- Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
- Enhance existing error handling in the Content Interpreter.
Rapid Response Content Deployment
- Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
- Improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout.
- Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.
- Provide content update details via release notes, which customers can subscribe to.”
They are only starting to test the parseability of the new update modules and will also start doing canary testing to reduce the risk of unfound bugs causing major disruptions.
This means they will now increase their costs of support for modularization and in return will get lower risks of failure.
CrowdStrike’s response is disappointing
Knowing that CrowdStrike knows how to implement testing procedures, the only explanation for why they didn’t test the modularization is quite sad: I believe that they decided to lower the cost even more.
This scary thought is somewhat supported by the way they dealt with damage control: Kevin Benacci, CrowdStrike spokesperson confirmed to TechCrunch that Crowdstrike sent UberEats gift cards to "teammates and partners who have been helping customers through this situation".
Dr. Deming, the Father of Quality, once said, “Minimizing costs in one place can often lead to maximizing costs in another.” CrowdStrike minimized the cost of testing, and now they are paying much more in reputation damage. CrowdStrike also minimized the cost of dealing with the reputation damage, and this hits them even harder.
I would love to see their postmortem go deeper, analyzing why little money was spent on testing and whose decision it was. I would hope to see some real accountability from management. Unfortunately, nothing indicates I have any chance of seeing this:
- Crowdstrike managers decided they can't be held accountable for anything.
- Crowdstrike CEO George Kurtz was once a CTO at McAfee, and under his management security update from the antivirus firm crashed tens of thousands of computers worldwide. Very similar case, and nothing happened to Kurtz’s career — he's now a CEO.
Quality starts from the very top, with the managers and CEO. If the management — as Nassim Taleb said in his “Skin in the Game” book — are not held accountable for the misconduct their companies commit, nothing will improve.
Sad times.