Flaky tests: How to avoid the downward spiral of bad tests and bad code

If you’ve ever been to the hospital, you’re likely very familiar with how loud it can be. Beeps, boops, dings, and other sounds ring almost constantly. As a patient, it’s a bit of an annoyance, but for medical professionals, it can be deadly.

The problem is called alarm fatigue, and the ECRI Institute has identified it as one of the top health technology risks for over a decade. You don’t need to be a medical professional to see why: The sheer volume of alerts and alarms — plus the fact that many are unreliable or miscommunicate urgency — means people start ignoring them over time.

Developers, though rarely in as dangerous a situation, face a similar problem with tests. The more flaky, unreliable, and non-deterministic a test suite is, the more likely developers will feel a version of alarm fatigue and the less likely they’ll trust their tests.

As a result, testers need to treat flaky tests as an urgent priority. For every flaky test, there are the consequences of a misfiring test as well as the trustworthiness of the whole test suite at stake.

What are flaky tests?

Flaky tests aren’t merely tests that don’t work. Flaky tests are non-deterministic, meaning they can either pass or fail the code they cover without the code itself being changed.

If you were a student and failed an exam, for example, you would have reason to distrust your teacher if you resubmitted the same answers and got a passing grade. Passing is one thing, and failing is another, but non-determinism means a tame problem has become a wicked problem.

Generally, flakiness takes one of three forms:

Random flakiness: This form includes tests that pass or fail when you rerun them despite having changed nothing.
Environmental flakiness: This form includes tests that pass on one developer’s machine but fail on another developer’s machine.
Branch flakiness: This form includes tests that pass a PR but fail once a developer merges that PR into main.

A 2022 study showed that over half of developers experience flaky tests every month. And this isn’t a problem companies can control through sheer technical skill alone. According to a 2021 survey, 41% of tests at Google that passed and failed once or more were flaky. The same survey found 26% from Microsoft.

Consequences of flaky tests

The vast array of work addressing test flakiness – ranging from academic institutions to tech giants – hints at the consequences of flakiness. Each flaky test appears, at first glance, to be relatively low-risk, but the consequences of flaky tests can quickly compound.

Testing consequences

Each individual test might waste a little time and effort, but as flakiness compounds, the entire test suite can suffer.

As Adam Bender, principal software engineer at Google, writes, “If test flakiness continues to grow, you will experience something much worse than lost productivity: a loss of confidence in the tests.” Eventually, developers will experience a form of alarm fatigue and start testing less.

A 2022 study on the developer experience of test flakiness proved this: Developers who experience flaky tests more often are more likely to ignore “potentially genuine test failures.” According to Bryan Lee, product manager at Datadog, flaky tests can “spread apathy if unchecked.”

“We’ve talked to some organizations that reached 50%+ flaky tests in their codebase,” Lee continues, “and now developers hardly ever write any tests and don’t bother looking at the results. Testing is no longer a useful tool to improve code quality within that organization.”

Developer consequences

Every time a test needs to be re-run, a developer spends extra time they shouldn’t need to—not to mention that every re-run costs computational resources that can add up over time.

If re-running a test doesn’t work, developers can be pulled from a programming flow state and into fiddling with tests. The more a developer has to do this, the less they can get done and the more time projects can take.

Each individual, discrete flaky test might not seem like a big problem, but they quickly compound. Bender explains, “As the number of tests grows, statistically, so will the number of flakes. If each test has even a 0.1% of failing when it should not, and you run 10,000 tests per day, you will be investigating 10 flakes per day. Each investigation takes time away from something more productive that your team could be doing.”

Delivery and quality consequences

Of course, both the efforts of testers and developers are meant, ultimately, to deliver a high-quality product for users. And in the end, poor testing inevitably leads to poor quality or delayed delivery times.

Lee articulates these consequences best, writing,

“It will be the end-users of your product that bear the brunt of this cost. You’ve essentially outsourced all testing to your users and have accepted the consequences of adopting the most expensive testing strategy as your only strategy.”

8 common causes of flaky tests

A huge volume of research has been done studying the causes of flakiness. To provide a broad but comprehensive overview of the causes, we’re relying on a 2022 meta-study that analyzed 651 articles, including work from academic researchers, companies, and individual developers.

Concurrency: This category includes all flakiness caused by concurrency-related bugs due to issues in race conditions, data races, and atomicity violations. Async-wait, when a test makes an asynchronous call and doesn’t wait the way it should before proceeding, is also included here. As the researchers write, “This category accounts for nearly half of the studied flaky test fixing commits,” making it the most common cause of flakiness.
Test order dependency: Generally, developers and testers tend to assume they can test in any order they want and get the outcomes they expect. But in practice, as the study points out, “tests may exhibit different behaviour when executed in different orders.” Tests often expect a particular state — a shared in-memory or external state, for example — and perform unexpectedly if these don’t match.
Network: Network issues, ranging from connection issues to bandwidth, are a frequent cause of flaky tests. Local network issues can include problems with sockets, and remote issues can include failed connections to remote resources. In one study focusing on Android projects, network issues caused 8% of flaky tests.
Randomness: This category covers tests, or code a test covers, that relies on randomness. If a test doesn’t “consider all possible random values that can be generated,” the study explains, flakiness can result. This category is most applicable to machine learning projects.
Platform dependency: If tests are designed and written for particular platforms but run on different platforms, flakiness can result. “Platform,” here, can range from the wrong hardware and OS to the wrong component in a more complex stack.
Dependencies on external state/behavior: If external dependencies, such as state or behavior, including changes to external databases or third-party libraries, have changed, flakiness is likely to result.
Hardware: Similar to randomness, this category is most applicable to machine learning projects, which sometimes require specialized hardware that can produce nondeterministic results — leading to flakiness.
Time: Just this past February, due to the Leap Year, Playstation players trying to run a sports game ran into time issues. Time issues come for testing in a variety of ways, including time zone, daylight savings, and other synchronization issues.

Screenshot of tweet from @EAHelp explaining a workaround for the 2024 leap year issue

Tips to prevent flaky tests

Some flakiness is likely inevitable. Remember, even some of the most resource-rich tech giants, such as Google and Microsoft, haven’t eliminated flakiness. Your best bet is to focus on prevention. As flakiness pops up, you can shift to identification and remediation (which we’ll cover in the section following this one).

Write smaller tests

Bender, among many other testing experts, recommends writing small, narrow tests. “Small tests,” Bender writes, “must run in a single process. In many languages, we restrict this even further to say that they must run on a single thread.”

As a result, the code running the test is running in the same process as the tested code. You can’t connect a test process to a server or run a database alongside, however, so this advice is most effective for unit tests.

The benefit is that it’s easier to figure out which tests are flaky and why. Bender writes, “even a few of them fail[ing] nondeterministically” can create “a serious drain on productivity.”

Quarantine flaky tests

Even more important than preventing flaky tests is preventing the risks and consequences of retaining flaky tests in your test suite. As such, quarantining flaky tests as soon as you find them is a useful step to take.

The meta-analysis we referenced earlier also looked at how a range of companies handled flaky tests: Google, Flexport, and Dropbox all quarantine flaky tests. Flexport even built a tool that automatically quarantines flaky tests.

Quantify and monitor flakiness

Since some proportion of flakiness is inevitable, the prevention of flaky tests is best done from a high-level perspective that focuses on noticing and addressing patterns.

Facebook, for example, uses a statistical metric called the Probabilistic Flakiness Score (PFS) to quantify the amount of flakiness in their test suites. Over time, developers and testers can test the tests and monitor their ongoing reliability.

GitHub and Spotify use similar methods, with GitHub using metrics to determine flakiness levels per test and Spotify using a tool called Flakybot to help developers determine flakiness per PR.

How to identify and fix flaky tests

Given the inevitability of flakiness, the ability to identify flaky tests, document them, and fix them is an essential part of a tester’s work and an essential part of reinforcing the trust developers should have in the test suite.

Identify flaky tests with dynamic and static methods

The identification of flaky tests comes in one of three categories: Dynamic, static, and hybrid.

With dynamic methods, developers and testers run and re-run tests while switching variables. They might, for example, adjust the environment, mix up the test execution order, or change the event schedules. As they do so, they can often manifest flakiness they can then fix.

With static methods, developers and testers analyze the tested code without running the tests. Instead, they use machine learning techniques and tools to pattern-match the tests they’re looking at to tests that are likely to be flaky.

Flaky tests detection approaches tree. Level one, three boxes: Dynamic, Hybrid, Static. Under Dynamic: Reuns, program repair, differential coverage. Under Static: pattern matching, machine learning, model checking, type checking

Source

As the above flowchart shows, a hybrid approach mixes the two approaches.

Capture as much information as possible

As you identify flaky tests, you have to capture as much information and context surrounding and informing the test as you can.

Lee, for example, writes that you should “Save every scrap of information that can help you find the root cause of the flakiness.” This includes the CI pipeline history, event logs, memory maps, profiler outputs, and more. If you’re running end-to-end tests (sometimes called UI tests), you’ll also want to take screenshots.

Document test flakiness and context

Once you have the information captured, document it in a ticket. If you can fix it right away, do so — speed is always a priority for returning a test to service. If you can’t fix it immediately, add as much information to the ticket as you can so anyone looking at it has context.

As Lee writes, the plus side to a swelling number of testing tickets is that “too many open tickets are a good indicator that some time needs to be set aside to improve the test suite’s quality.” The practice of documentation is good for every individual ticket, but over time, the sheer amount of documentation can indicate larger issues.

GitLab, a company that practices radical transparency, provides many examples of flaky tests in its docs.

Screenshot from Gitlab showing multiple spec failures

The above example, which is only a partial screenshot of the entire issue, describes the issue, estimates how difficult it will be to reproduce, and suggests its severity.

Avoiding the flaky test downward spiral

When developers encounter a flaky test, alarms rarely go off. Many simply rerun the test; if that doesn’t work, they either quarantine the test or try to fix it. If they can’t fix it, the test remains off to the side. The problem is that these tests don’t wait — they fester.

We outlined the consequences above, which range from affecting the covered code to the product as a whole. However, these consequences feed into a feedback loop with potentially greater consequences.

As Lee writes, “Fighting flakiness can sometimes feel like trying to fight entropy; you know it’s a losing battle, and it’s a battle you must engage in again and again and again.”

But there’s a difference between a battle you’ve won and a battle you’ve lost. You might never “win” against test flakiness, but you can achieve a certain victory by refusing to lose.