How the sunk cost fallacy is sinking your tool stack
In a postmortem, Crowdstrike revealed that an issue in its testing tools was the cause of the world’s biggest IT outage. A set of testing/QA protocols was designed to validate new content, but the validation itself had a bug that led to missing a flaw that caused the outage.
Errors like these reveal just how much we depend on our tooling and force us to face the reality of how rarely we ensure that our tools — especially our security and QA tools — are actually working.
Despite the critical role tools play in our work, companies are often lax about their tool stack because they fall prey to the sunk cost fallacy. Companies can end up overemphasizing the discomfort of trying new tools and underestimating the costs of their current tools.
If you’re not careful, you can sink more and more money into your sunk costs without realizing what you’re losing.
Why companies throw good money after bad things
The sunk cost fallacy is a thought pattern that happens when you invest money, energy, or time into something and then resist cutting your losses when that effort isn’t working out because you don’t want to lose your investment.
If you’ve purchased a ticket to a movie, for example, you might sit in the theater and watch the whole thing even though you know within the first thirty minutes that it won’t be good. The logic is false but tantalizing: You paid for a ticket, so you should watch the whole thing and “get your money’s worth,” right? But no, if you know it’s going to be bad and you can’t get a refund, all you’re doing by watching the rest of the movie is losing time as well as money.
The sunk cost fallacy has its origins in economics. In brief, a sunk cost is an expense that can’t be recovered. In economic theory, sunk costs shouldn’t influence future decisions because businesses and rational actors should only consider future costs, investments, and returns.
But, of course, humans aren’t rational actors. In the 1970s, psychologists Amos Tversky and Daniel Kahneman (of “Thinking Fast and Slow” fame) popularized the idea of cognitive biases — a topic that bloomed in the following decades. Richard Thaler, an economist and behavioral scientist, borrowed from both fields and coined the term “sunk cost fallacy.”
Keeping both origins in mind is important because the sunk cost fallacy affects people in personal and systemic contexts. Often, a web of people suffering from the sunk cost fallacy can lead to organization-wide assumptions about what everyone should believe and what expenditures teams should maintain.
Jason Cohen, founder of WP Engine and Smart Bear, explains using the hypothetical example of a company building a new manufacturing plant. The company spends $20 million on the new plant, but it discovers it will cost another $10 million to finish it. As they ponder the choice, a new opportunity emerges to retrofit an old plant for only $2 million.
“From a completely rational perspective,” Cohen writes, “They should abandon the original project. The $20m they’ve spent can’t be recovered (let’s say), so it’s now just a choice between spending $2m or $10m. Duh.”
But, the sunk cost fallacy comes into play and is then amplified by conflicting incentives in the business. “You can imagine the internal politics of someone standing up and saying, ‘I’m responsible for this hugely wasteful endeavor, and now I want you to trust my judgment as I pull the plug and do something completely different,’” Cohen writes.
This same dynamic plays out in every organization around the world. At a baseline, people tend to resist cutting their losses, so when you add in company politics, it can be difficult to even propose big reinvestments.
How companies slip into sunk-cost thinking
The sunk cost fallacy, like other cognitive biases, seems to have a simple solution: Now that you know about it, can’t you just recognize it and not fall for it?
In reality, the sunk cost fallacy tends to come from numerous other cognitive distortions and ways of thinking. Especially in an organizational context, no single person is committing the fallacy; it’s an emergent result of other forces.
Region-beta paradox
The region-beta paradox describes how people tend to recover more quickly from worse experiences than better ones. This is often because a worse situation triggers a reaction strong enough to actually resolve the situation, whereas a mediocre situation might not be bad enough to trigger a solution.
A common idiom that gets at this idea is the frog in boiling water metaphor. In this metaphor (which doesn’t actually happen in real life), a frog dipped in boiling water will immediately jump out, whereas a frog in water that’s slowly heating up will stay in the water until it’s boiling.
In practice, an example of this is code quality issues that appear small at first but build up over time. There are many moments when it’s okay for developers to take shortcuts and QA teams to approve them, but if teams don’t reassess tech debt often enough, those shortcuts can come back to haunt them.
If the code quality on the first commit was disastrous right from the start, the team would likely reject it outright, but when code quality suffers bit by bit, you can end up surprised by the same sort of disaster later on.
Normalization of deviance
Normalization of deviance is the process by which people learn to accept a certain amount of errors, mistakes, or flaws. The foundation of this bias is sound: People are not perfect, and some kind of error rate needs to be allowable (even a 99.9% success rate, for example, implies some amount of failure).
Dan Luu, a former senior software engineer at Twitter, explains, “You can see this kind of thing in every industry.” In healthcare, for example, we can expect doctors and nurses to know the danger of germs, but we also need to remind them over and over again to wash their hands.
If you want humans to be perfect or near-perfect, then you can rarely leave them to their own devices.
Closer to home, Luu compares this dynamic to how companies try to introduce better coding practices. “If you tell people they should do it, that helps a bit,” he writes. “If you enforce better practices via code review, that helps a lot.”
You can imagine the normalization of deviance, then, as a natural regression to the mean. If you want software teams to maintain extremely high levels of quality, then you need to reinforce those practices consistently and keep them close to the work itself.
Normalcy bias
Normalcy bias is a cognitive bias humans experience when a situation tends to progress in one way for long enough that people doubt (without evidence) that it can change. Often, this results in people ignoring outright threat warnings because they’re convinced that normalcy will prevail.
At its worst, normalcy bias has caused people to understate the possibility of disaster, leading them to under-prepare for natural disasters and market crashes.
In less dramatic terms, software teams can often encounter normalcy bias when they assume, based on what appears to be evidence, that their code is secure and free of vulnerabilities. The evidence often feels correct because — if nothing has happened yet — you don’t know what you don’t know. But when testing or security tools do flag something that might be amiss, normalcy bias can tempt you to skim and even overlook the possibility of something being wrong.
A related bias is self-serving bias, which occurs when people confuse the quality of an outcome with the quality of the decision that caused the outcome.
Author of “Kill It with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones)” Marianne Bellotti explains, “when things go well, we overestimate the roles of skill and ability and underestimate the role of luck. When things go poorly, on the other hand, it’s all bad luck or external forces.”
Between normalcy bias and self-serving bias, it’s very tempting to assume that a currently good situation is a result of your practices and not luck (that could easily turn around tomorrow). As Bellotti puts it, “Success and quality are not necessarily connected.”
The results of sunk cost and sinking tooling
Despite the universality of the cognitive biases we’ve talked about so far, the risks and results often fall disproportionately (as we saw with Crowdstrike). If you’re aware of the potential to slip into these biases, take stock of the risks and red flags that follow.
Alarm fatigue
A 2022 study shows that over half of developers experience flaky tests on at least a monthly basis. If tests are this unreliable, QA teams can’t be surprised if developers start experiencing alarm fatigue.
Alarm fatigue occurs when people experience too many alarms and too much noise, which causes them to disengage from warnings altogether. This effect is most studied in medical research because doctors and nurses are often surrounded by a wide range of beeps, boops, and even chirps.
The problem with alarm fatigue is that, over time, too many alarms can create worse results than having just a few (even if that selection of alerts doesn’t cover everything). For QA teams, the best thing to do is look at the noise-to-signal ratio, the proportion of working tests to flaky tests, and reassess whether the tools they’re using or the test suites they’ve built are doing more harm than good.
Burdensome work leads to low morale
When you consider your current tooling sunk cost — perhaps dressing it up as “good enough” — you can easily end up hiding the poor experience people are having using those tools. A poor experience, however, doesn’t just weigh on individual testers or developers; over time, it damages team-wide productivity.
Recent Atlassian research shows that 69% of developers lose eight hours or more per week to inefficiencies, and less than half of developers think their organizations prioritize developer experience. This is dangerous, not just because of the sheer inefficiency, but because two out of three developers consider leaving their companies when the developer experience isn’t satisfactory.
Of course, developer experience has a much wider remit than the QA team's, but the QA team can contribute by resisting sunk-cost thinking and improving the QA tooling in use. A poor developer experience is the result of a thousand cuts, so even incremental improvements can lead to larger improvements.
Gambler’s fallacy and unpredicted disaster
The last fallacy we want to address is the gambler’s fallacy. In this fallacy, as the name implies, people are subject to a gambler’s perspective, where they slip into assuming that a random event is more likely because it occurred more or less frequently in the past.
In gambling terms, this is often a “positive” bias because gamblers are liable to think they’re “due” after a string of bad cards. In reality, it is always random. In QA and security terms, this is often a more “negative” bias because teams will often assume they’re safe from an adversarial attack just because they haven’t experienced one yet (or in a long while). In reality, precedent explains little; you have always been vulnerable, and no one has happened to notice yet.
Though we can’t peer behind the scenes of the Crowdstrike disaster, we can safely assume something like the gambler’s fallacy was at play. Combined with the other biases we’ve talked about here — sunk cost fallacy, the heaviest of all — it’s easy for any team to slip into thinking they’re safe when they’re not.
Opportunity cost: The incalculable tradeoff you can’t account for
Throughout this article, we’ve focused on the risks and costs of the sunk cost fallacy, but we haven’t talked about the biggest one of all: Opportunity cost.
Opportunity cost is the cost you weather when you choose to stick with the status quo instead of searching for something better. Unlike the other costs, which you can often calculate by comparing current conditions against better conditions, opportunity cost is almost impossible to predict at the outset.
If you’ve stuck with the same tools for years, you might not know just how much better this or that category of tooling has gotten since you first built your tech stack. If you’re not keeping up with the market, you might not realize a niche tool built for your exact use case has gotten traction.
If you’ve accepted “good enough,” you might just not fully realize how good things could be when you get a better feedback loop spinning, when alarm fatigue is down, when the developer experience is up, and when testing processes are trusted.
As Bellotti writes, “It is easy to build things, but it is difficult to rethink them once they are in place.” But the ability to rethink them is the only way to keep improving and stop sinking.