The test pyramid and its discontents, or, why no one agrees on anything

How did one model become so prominent in testing and development when no one agrees on how to implement it? And what should we do with this model now?

When it comes to testing, developers can only agree on two things: One, we should probably test, and two, no one knows how to define each test type and how many of each test we should run. 

Despite this lack of consensus, there’s a model everyone reaches for: The test pyramid. In theory, the test pyramid should settle the debate. 

Google the term, and a few simple diagrams come up that place unit tests in the lowest, widest section of the pyramid; integration tests in the middle; and end-to-end tests in the highest, narrowest section. Read a few articles (including our own), and you’ll see how the test pyramid models a proportion of test types from many unit tests to fewer end-to-end tests. 

The test pyramid is not a complex model, but after years of implementation across teams, companies, tools, and methodologies, there’s discontent. Some argue for different proportions of tests; some argue for a focus on test size and scope instead of type; and some argue for different models entirely. 

Many of these disagreements are instructive and useful, but most swap one model for another when the real problem is using a model at all. If we dig into the origins of the test pyramid and walk through the biggest arguments in this endless debate, we can see a better way to think about testing. 

Origins of the test pyramid

Over the decades, the test pyramid has become a best practice, a normative model that development teams follow or diverge from but reference nonetheless. But when we look at its origins, we can see that the confusion that tends to dominate the test pyramid conversation now was baked in from the start. 

The origin story

Martin Fowler, one of the co-authors of the Agile Manifesto, participated in some of the first discussions that led to the test pyramid in 2003 and 2004. In his retelling of the history, these discussions happened in rough parallel with work by Jason Huggins, who independently thought of the same idea in 2006. Fowler emphasizes the popularization of the model through Mike Cohn’s book Succeeding With Agile in 2009.

Fowler himself has since become a major touchstone for understanding the test pyramid. Even now, his 2012 post is one of the most frequently linked sources for understanding the test pyramid (including the frequently copied diagram below).

Left: Arrow with Rabbit at the bottom and turtle on top. Center: Pyramid labeled top to bottom: UI, Service, Unit. Right: Arrow with $$$ on top and Cent sign on bottom.

Further refinement came from Google and its well-known testing team. In Software Engineering at Google, Principal Software Engineer Adam Bender retells the story of testing at Google. There are stories about Testing on the Toilet, a pamphlet and restroom stall campaign encouraging developers to test, and the Beyoncé rule, which states, “If you liked it, then you shoulda put a test on it.” 

Tucked in the middle of these stories are two diagrams that contributed to a general understanding of the test pyramid. 

Ice cream cone labeled top to bottom: Manual tests, automated GUI tests, integration tests, Unit tests. Hourglass labeled top to bottom: Unit, Integration, E2E

In this diagram, Bender lays out two anti-patterns. Both are a result of deviations from the test pyramid: One where there’s too much at the top and one where there’s too little in the middle.  

Here, we see the two primary purposes of the test pyramid: One, a way to model what a test suite should look like, and two, a way to show how a test suite can deviate from that ideal form. 

The original disagreements

As you can see in even the original test pyramid models, consensus was never present beyond the shape of the pyramid (and even that didn’t last). 

Fowler’s pyramid included unit, services, and UI tests; Huggins’ pyramid included unit tests, functional tests, and UI tests; and Google’s pyramid included unit, integration, and E2E tests alongside rough percentages of each. 

Pyramid split into three sections labeled top to bottom: E2E 5%, Integration 15%, Unit 80%

The confusion goes beyond the labeling. Fowler, for example, doesn’t emphasize test types; instead, he emphasizes scope, writing, “The essential point is that you should have many more low-level UnitTests than high level BroadStackTests running through a GUI.” Google, too, adds nuance beyond the diagrams, with Bender explaining that tests vary in size (small, medium, and large) to scope (narrow, medium, and large). 

Across these examples and many others, there’s tension between the nuance and the model. Bender can write, for example, “When considering your own mix, you might want a different balance,” but you can safely assume more people have adopted the 80%, 15%, and 5% split than have really thought through Bender’s more granular advice. 

Part of what’s fueling this confusion is the history of testing itself, which the pyramid can’t really capture. Test types arose to prominence at different times, and developers adopted and implemented them in different ways.

Take integration testing as an example. 

Here, again, Fowler provides some useful history: Integration testing arose in the 1980s when waterfall development was prominent. Back then, one team would design modules and pass them off to the developers, who would build them in isolation before handing them over to a QA team for testing. The point, then, was simple: Integration tests tested whether those separately developed modules worked together as expected.

With hindsight, however, Fowler sees confusion: “Looking at it from a more 2010s perspective, these conflated two different things: testing that separately developed modules worked together properly and testing that a system of multiple modules worked as expected.” 

Now, writes Fowler, we really have “narrow” integration tests and “broad” integration tests. That means, he writes, “There is a large population of software developers for whom ‘integration test’ only means ‘broad integration tests,’ leading to plenty of confusion when they run into people who use the narrow approach.”

Rest assured, the other two steps of the pyramid have similar confusion. Ham Vocke, Principal Software Engineer at Stack Overflow, for example, writes, “If you ask three different people what ‘unit’ means in the context of unit tests, you'll probably receive four different, slightly nuanced answers.” And the Google testing team, in a post now captured in the Software Engineering at Google book, argues that the industry should “Just Say No to More End-to-End Tests.”

If it offers more confusion than clarity, then why do people keep using the test pyramid at all? 

Test types, scope, quantity, and proportion: Why the debate never ends

Few developers enjoy writing tests, but even fewer would want to go back to a time before tests. 

Back in 2006, Jeff Atwood, co-founder of StackOverflow, wrote, “The general adoption of unit testing is one of the most fundamental advances in software development in the last 5 to 7 years.” And Tim Bray, formerly a distinguished engineer at AWS, reflected on twenty years of programming and wrote, 

“I’m pretty convinced that the biggest single contributor to improved software in my lifetime wasn’t object-orientation or higher-level languages or functional programming or strong typing or MVC or anything else: It was the rise of testing culture.”

Despite the significance of testing, according to research from Gergely Oroz, writing tests is one of the primary use cases of developers adopting GitHub’s Copilot tool. 

In other words, testing is important… but we don’t want to do it. 

This tension is what drives the endless debate, the urge to finally get testing right once and for all. But so far, this debate has only resulted in more confusion, more pyramids, and, for many developers, more resentment about having to test and having to test “correctly.”

Various versions of the test pyramid messily arranged
Source

The debate, however, is not black and white. If it were a matter of adopting one model or another, the industry would likely settle on a standard and move on. The debate continues because there is disagreement on every single level. 

Few people agree on the size of the tests. The Google team, for example, considers resource usage (memory, processes, time, etc.) an important factor for determining test size, but other teams think about size purely in terms of code coverage.

Few people agree on the definition of certain test types, especially integration tests. We covered a few of the disagreements above but also consider the point Chris Coyier, co-founder of CodePen, makes: The best way to define integration tests might just be by seeing them as on the middle of the spectrum between unit tests (“little itty bitty sanity testers”) and end-to-end tests (“big beefy (cough; sometimes flaky) tests”). 

Few people agree on the proportions of each test type. Integration tests often feel like the forgotten stepchild in many testing conversations, but Kent C. Dodds, the creator of EpicWeb.dev, says developers should write more integration tests than any other type. 

That leads to a different model, a “testing trophy.” 

Screenshot of Kent C. Dodds' tweet with image of trophy shape labeled top to bottom: End to end, integration, unit, static

Dodds is not alone in allowing a different balance to lead to a different model. Spotify’s testing team, for example, recommends their own shape: a testing honeycomb. 

Honeycomb shape separated into three horizontal sections labeled top to bottom: Integrated, Integration, Implementation Detail

Some people use the test pyramid shape but add different layers. The web developer Alex Party, for example, sticks with the pyramid but replaces the layers with developer interface tests, consumer interface tests, and user interface tests. Others stick with the shape and types but add new dimensions, such as cost and speed or confidence. 

Left: Arrow labeled "more integration" on top and "more isolation" on bottom. Center: Pyramid labeled top to bottom: e2e tests, integration tests, unit tests. Right: Arrow labeled "slower, more expensive" on top and "faster, cheaper" on bottom.

The takeaway from this high-level survey is the confusion itself: Something is wrong with the model.

Against models and toward tools

Despite the confusion inherent to the original test pyramid formulations, the test pyramid became a best practice, and best practices have their own gravitational pull. 

As Oleksii Holub, a dev tooling consultant, writes, “Aggressively popularized ‘best practices’ often have a tendency of manifesting cargo cults around them, enticing developers to apply design patterns or use specific approaches without giving them a much-needed second thought.” 

The test pyramid is the safe option, the easy option, the likely-to-work-fine option, the “nobody got fired for buying IBM” option. But unlike other best practices, which are often easier to either implement or ignore, the test pyramid occupies a messy middle ground between broad guidelines and in-context advice.

On one end of the spectrum is Dodds’ advice, for example. Whether you agree with his points or not, his recommendations are fundamentally effective because they’re thought-through and deliberately broad. When he writes, “Write tests. Not too many. Mostly integration,” you can read, learn, and take it or leave it. 

On the other end of the spectrum is advice that’s too narrow and too specific to cite here. It’s the advice of people on your team, the advice codified in internal documentation, the advice of the senior developer who says, “Oh yeah, that test is always flaky.”

The test pyramid is somewhere in the middle: Too broad to apply in a useful way but narrow enough to create confusion when it conflicts with reality. 

Seb Rose, co-author of The BDD Books, reaches back to a classic quote from George Box, which states, “All models are wrong, but some are useful.” But as Rose argues, “The pyramid is not a model. It’s not a framework. It’s not even a heuristic. It’s something much weaker than that — it’s a metaphor.”

A metaphor can be evocative, and a metaphor can be poetic, but it shouldn’t model what anyone should or should not do. And yet, it does. 

Rose writes that he’s “seen organizations attempt to gather statistics about what proportion of tests being created fall into each category.” Worse, he’s seen organizations “set acceptable values that they have ‘derived’ from the pyramid and have attempted to enforce them through automated build rules.”

Martin Sústrik, in a provocatively titled article about the “unit test fetish” captures why. In a “hierarchically structured company,” he writes, “Progress on project has to be reported up the command chain and with a mainly spiritual activity such as software development it's pretty hard to find any hard metrics.” 

But luckily, he continues, there’s an easy solution: “Report number and/or code coverage of unit tests.” As with most easy solutions, however, it creates another, bigger problem: “Once you start reporting code coverage of unit tests, you'll be immediately pressured to improve the metric, with the goal of achieving 100% coverage.” 

Tim Bray writes warmly about him and others being “test infected,” but that warmth can freeze over if the test infection shifts to a test-coverage or test-pyramid infection. 

Earlier, we covered some of the history of the test pyramid, and despite early confusion, you can see the reasons why it emerged. But those reasons have changed. As Holub writes, “Whenever we extrapolate experiences into guidelines, we tend to think of them as being valid on their own, forgetting about the circumstances that are integral to their relevancy.” 

Many modern application frameworks now offer API layers that developers can use for simulation and testing. Tools like Docker allow developers to run tests that use real-world dependencies. A whole market of testing automation tools has made it easy to write and run automatic tests without nearly as much effort and time. The slow, resource-intensive tests that the test pyramid warned us about are faster, lighter, and more effective. 

Testing isn’t necessarily easy nowadays, but the problems have shifted. Why are we worried about the shape of our test suite, for example, when more than half of developers experience flaky tests on a monthly basis? Why should we argue about test type proportions when pseudo-tested methods can convince us we’re testing something we’re really not?

There’s no model better than your own

Moving past a best practice doesn’t mean disrespecting it. The test pyramid served its purpose, and many teams have likely benefited from following it rather than just improvising or testing on the fly. 

The test pyramid, however, is more misleading than other best practices and poses worse risks. If teams follow it too closely, testing becomes a goal instead of a tool. If the purpose of testing comes first, not the best practice, then you can embrace what Holub calls “reality-driven testing.” Instead of testing as a goal, where you measure the success of your test suite against its resemblance to the test pyramid, testing can be a tool where you let the practical realities of your code, your product, your industry, and your users lead the way. 

We come back to another classic quote, Goodhart’s Law: "When a measure becomes a target, it ceases to be a good measure." 

The test pyramid is not even a good measure, but it has become a target anyway. We have to invert: Our only measures should be the confidence of our teams and the quality of our code. And while we can and should look to relevant examples and experiences, there’s no model better than the one we develop doing the work itself.

You've successfully subscribed to Qase Blog
Great! Next, complete checkout to get full access to all premium content.
Error! Could not sign up. invalid link.
Welcome back! You've successfully signed in.
Error! Could not sign in. Please try again.
Success! Your account is fully activated, you now have access to all content.
Error! Stripe checkout failed.
Success! Your billing info is updated.
Error! Billing info update failed.