The Testing Pyramid Decoded: Origins, Evolution, and Best Practices
When it comes to testing, quality assurance professionals, testers, and developers can only agree on two things: One, we should probably test, and two, no one knows how to define each test type and how many of each test we should run.
Despite this confusion, there's one idea everyone reaches for: The test pyramid. In theory, the test pyramid should settle the debate, but more often than not, the test pyramid inspires more controversy than consensus.
The loyalty to arguing about the test pyramid despite the unending arguments makes a little more sense when you think about its origins. The original idea emerged around 2003 and 2004, and in the more than twenty years following, the test pyramid stuck around through the rise of unit tests, the "death" of QA, and the popularization of continuous deployment.
Making any sense of it requires a detailed exploration of the essentials involved in each section of the test pyramid and how the usage of each test type and the pyramid itself has evolved. We won't promise to settle the debate and bring peace to the land of testing by the time we're done, but we can promise that you'll have better ways of thinking about and using the test pyramid — no matter who brings it up and how they try to flip and rebuild it.
Origins of the Test Pyramid
Software testing has always called for some level of systematization, but the test pyramid became popular — and eventually the norm — through Mike Cohn and his 2009 book Succeeding with Agile.
According to Martin Fowler, who summarized the history, the original model emerged from discussions in 2003 and 2004. Meanwhile, Jason Huggins independently thought about the same idea in 2006.
Thanks to Fowler's influential blog, this first-hand summary remains one of the most popular touchstones. To this day, his 2012 post is one of the most frequently linked sources for understanding the test pyramid.
"The test pyramid comes up a lot in Agile testing circles, and while its core message is sound, there is much more to say about building a well-balanced test portfolio."
— Martin Fowler
In the years following, other developers, testers, and companies have put their own spin on the test pyramid, but the first one remains an anchor for almost all the others. Even those who disagree with the three primary test types tend to agree that you should think about tests in a similar hierarchical manner.
Purpose of the Test Pyramid
At the risk of stating the obvious, testing takes a lot of time and resources. QA engineers and developers (as well as their managers) have always wanted a way to systematize this effort.
Testing suffers from a disagreement between its why and its how: Most developers agree testing, broadly, is good because it improves code quality and the functionality of the final product. However, few testers agree on how to test, when to test, how much to test, which tests to use, etc.
The original purpose of the test pyramid was to provide clarity and structure, and the multi-decade argument about it proves that its original purpose is a worthy one.
According to Tidelift research, the average developer spends 12% of their time testing, and according to StackOverflow research, about 61% of developers use automated testing tools.
Fowler sums up the fundamental purpose of the test pyramid straightforwardly, writing:
"The test pyramid is a way of thinking about how different kinds of automated tests should be used to create a balanced portfolio. Its essential point is that you should have many more low-level Unit Tests than high-level BroadStackTests running through a GUI."
There's a valuable discussion to be had from getting into the granular details, from putting the test pyramid into practice and seeing where, when, and how it runs into limitations, but the essential purpose is a simple one. On the one hand, teams should think about a portfolio of test types, and on the other hand, teams should have many small tests and few big tests.
Three Types of Tests: Unit, Integration, and End-to-End
Three types of tests comprise the test pyramid: unit tests (the base), integration tests (the middle), and end-to-end tests (the top). The shape of the pyramid corresponds to how many of each test a test suite should include: many unit tests, fewer integration tests, and even fewer end-to-end tests.
Unit Tests
Unit tests, as the name implies, test individual units in isolation — primarily classes, modules, and functions. If end-to-end tests, which test how an entire system functions together, are on one end of the spectrum, unit tests are on the other. They are, as Chris Coyier, founder of CodePen, writes, "little itty bitty sanity testers."
In a functional language, for example, a unit will probably be a single function. But in an object-oriented language, a unit could be a single method on the smaller side or an entire class on the larger side.
Different people also disagree about how to test classes called by the class you're testing (sometimes called "collaborators"). Some argue that you should substitute all collaborators with mocks or stubs, and some argue you should only substitute the slowest, biggest collaborators.
Luckily, Fowler cuts through the debate, writing, "To a certain extent, it's a matter of your own definition, and it's okay to have no canonical answer." Ultimately, the most important part of unit testing is that you do it, and the second most important part is that you do plenty of it.
Even though an individual unit test is often itty bitty, the accumulated effect of using an entire suite of them is anything but. Jeff Atwood, founder of StackOverflow, wrote back in 2006:
"The general adoption of unit testing is one of the most fundamental advances in software development in the last 5 to 7 years."
The consensus on the importance of unit tests, however, doesn't translate into a precise consensus about what a "unit" really is. Martin Fowler explains, "If you ask three different people what 'unit' means in the context of unit tests, you'll probably receive four different, slightly nuanced answers."
Unit Tests Can Be Manual or Automated
Manual testing is laborious because unit tests are laborious to write by hand. It can also be difficult for developers to isolate independent units and test all the possible faults those units could have.
Thankfully, few developers have to manually write unit tests because automation is not considered a nice-to-have, but rather, the standard.
"Attempting to assess product quality by asking humans to manually interact with every feature just doesn't scale…When it comes to testing, there is one clear answer: automation."
— Adam Bender, Principal Engineer, Google
The advantage of automation, beyond sheer convenience, is reduced cognitive load and the ability to unit test continuously. By now, the advantages of unit testing and the advantages of automated unit testing, in particular, are typically considered synonymous.
Advantages of Unit Testing
The advantages of unit testing can be broken down into two categories: functional and psychological.
An example from Google captures the functional advantages: Back in 2005, the team supporting Google Web Server — the server that handles Google Search queries — wrote code riddled with bugs. According to Adam Bender, now Principal Engineer at Google, at one point, over 80% of code pushed to production contained user-facing bugs that they then had to roll back. The GWS team solved this problem by implementing automated testing led by suites of unit tests. All new code changes eventually required tests, and all tests were continuous.
The automated unit testing cut the amount of emergency pushes in half. In the process, the GWS team became much more productive, and now, GWS has tens of thousands of tests and pushes releases almost every day.
The psychological benefits of unit testing tend to emerge as feelings of courage and confidence. For example, Tim Bray, formerly a Distinguished Engineer at Amazon, explains, "Working in a well-unit-tested codebase gives developers courage. If a little behavior change would benefit from re-implementing an API or two, you can be bold, and you can go ahead and do it. Because with good unit tests, if you screw up, you'll find out fast."
"Fail fast," a core concept of Lean and Agile, can be threaded throughout the software development cycle via testing. Immediate failure (or, really, feedback) ensures developers are iterating as they code rather than debugging an entire block of code at once.
Bray explains that this advantage is even more beneficial as companies scale. "Writing good tests helps the developer during the first development pass and doesn't slow them down," he writes. "But I know, as well as I know anything about this vocation, that unit tests give a major productivity and pain-reduction boost to the many subsequent developers who will be learning and revising this code."
In this way, well-written unit tests help developers in the same way well-written documentation does. If a developer wants to know what a given piece of code does, they can look at the test that covers it.
Though unit tests have been around for decades, managing and automating them remains difficult for many teams. Wolt, for example, a food and merchandise delivery platform, had long used unit tests, but Wolt engineers struggled to manage its tests under a mixture of spreadsheets and documents.
Wolt's QA engineers knew they would need to operationalize the testing process to make it efficient and scalable. Sidelining core engineering assets for a homegrown solution wasn't an option. With Qase, Wolt can now run over 5,000 tests per month, and almost one-fifth of them are automated — significantly raising the speed of product development.
"We needed to rapidly scale our testing and enable non-QA engineers to collaborate. It was important to be able to painlessly convince and then onboard our teams to a test management platform. Qase is so intuitive that this was easily done."
— Mikko Vaha, QA Lead
How to Write Unit Test Cases
Writing good unit tests starts with a standard but reliable rule of thumb: Use one test class for every production class.
Ham Vocke, a former engineer at Stack Overflow, writes, "You can write [unit tests] for all your production code classes, regardless of their functionality or which layer in your internal structure they belong to. You can unit test controllers just like you can unit test repositories, domain classes, or file readers." Just stick to the rule of thumb.
Beyond this basic rule, you'll also want to closely monitor what you're trying to test. The most effective unit tests cover all relevant code paths, including the ideal cases and edge cases, but don't try to capture all of the implementation details.
Vocke explains that if your unit tests are close to production code, they will break when you refactor. Adam Bender agrees, stating:
"A test should contain only the information required to exercise the behavior in question."
When writing tests, many people recommend a structure called Arrange, Act, Assert:
- Arrange: Set up the test and capture its inputs and targets.
- Act: Plan the steps that will cover what you're testing (e.g., a function or method, API, etc.).
- Assert: Write down your expected outcomes so, after running the test, you can see whether your test passed or failed.
Following this pattern, you want to focus on testing the observable behavior that results from your code instead of mirroring your internal code structure inside your unit tests.
The workflow we described above will get you far, but perhaps more important than the quality of any individual unit test is the health of your test suite. Some amount of failure is inevitable, so your test suite needs to make it easy for developers to address test failures.
In 2021, researchers Benoit Baudry and Martin Monperrus took a critical look at the trust developers put in automatic unit testing. They found a startling example of how that trust can be misplaced. They found a range of "pseudo-tested methods" across a range of popular projects — methods that were covered by a test suite but whose behavior wasn't actually evaluated by any test case.
"[Pseudo-tested methods] can lure development teams into a false sense of security, and they represent a serious threat to constructing reliable applications. We've found them in a variety of popular software projects, including Apache Commons Collections, PDFBox, and the SDK for Amazon Web Services."
— Benoit Baudry and Martin Monperrus
Unit tests are good, automated unit tests are better, and full test coverage is even better than that — but coverage doesn't always mean perfection.
Good tests might fail correctly, but a bad test suite can still create a bad testing experience. Bender writes, "Allowing failing tests to pile up quickly defeats any value they were providing, so it is imperative not to let that happen." The most effective testing suites function alongside a software development process that enables developers to fix broken tests within minutes and address failed tests rapidly.
The end goal is confidence. Developers should feel comfortable enough with their unit tests that they could run them every few seconds.
Follow the Beyoncé rule: if you liked it, then you shoulda put a test on it.
Integration Tests
Integration tests are often seen as the awkward, vague middle ground of the software testing world. In the test pyramid, integration tests are literally in the middle, and even amongst teams who prioritize testing, few agree on a tight definition of integration testing.
At first glance, the meaning is clear: While unit tests cover the smallest units of code (primarily classes and methods) and end-to-end tests cover how the whole system works, integration tests cover how different components of the app work together.
From this standpoint, it seems simple, but the confusion is in the details. Even Martin Fowler is wary of the term "integration test," writing, "When I read it, I look for more context so I know which kind the author really means."
Some of the confusion that tends to erupt from discussions of integration testing comes from a history of testing that precedes modern ways of thinking about software development.
Essentially, as Martin Fowler covers, integration testing became popularized in the 1980s when the waterfall method of software development was the standard. With large projects, Fowler writes, a design phase would start the development process, and developers would each take on modules to build — all in isolation from each other. Testing, far from the test-driven development often done today, wasn't done by developers at all. "When the programmer believed it was finished, they would hand it over to QA for testing," Fowler writes.
Test Driven Development (TDD) is a software testing approach that aims to create reliable code by testing the program before you write it. Test cases are created to define the desired behavior of the software and formulated based on the software requirements before writing the actual code. Once the tests are in place, the code is then implemented to pass these tests and meet the specified requirements.
Unit testing tended to take precedence, as it does now, and integration testing usually came after. Here, Fowler writes, QA teams tested how these modules fit together "either into the entire system or into significant sub-systems."
Already, you might see how this model shows its age. Fowler points out, writing decades in retrospect, that this process conflated two different forms of testing: Tests meant to cover how separate modules work together and tests meant to see whether a system comprised of many modules worked as expected.
The devil is in the details, but still, there's an anchor here: Integration tests, whether you want to consider them one type of test or a bundle of slightly different tests, focus on testing how multiple components and modules work together.
With Integration Testing, the Differentiator Lies in What Is Being Tested and Why
Integration tests cover multiple components so developers can validate how they work together and check that the resulting functionality works as expected.
"API tests seem like a quintessential example," writes Chris Coyier. You could use unit tests to test different parts of an API, he continues, but the tests that use the API the way it would actually work in your application are "the juicy ones."
If you're building functionality that enables users to add a new credit card during a purchasing flow, for example, an integration test will likely be the best way to ensure you're validating whether the credit card number, expiration date, and CCV fields are working correctly.
Integration Tests Provide Reassurance
The primary advantage of integration testing is reassurance. When you run unit tests, you're typically staying focused and deliberately small. You're finding errors, correcting, and testing again — every test rapid enough to keep you in your seat. But this rapidity doesn't always create confidence.
With integration tests done regularly but less frequently, you can ensure that all those modules come together — that your app or feature is actually working. And while you want a plethora of unit tests, as the test pyramid implies, a complementary series of integration tests makes the whole suite more efficient because you can cover more than a single class at a time.
Alin Turcu, Data and AI Engineering Director at Cognizant, shares two advantages learned from the realities that often get forgotten when talking about testing at an abstract level.
Code is rarely written tabula rasa, so integration testing can be particularly useful when it's time to refactor. "Attempting to refactor a system that only has single-class tests is often painful," he writes, "because developers usually have to completely refactor the test suite at the same time, invalidating the safety net." When you're refactoring, integration tests then become that safety net.
Developers rarely have as much time as they wish for testing, too, so Turcu also explains how integration tests can offer more coverage when needed. "In the ideal case," he writes, "you would have all your classes unit tested, but we all know that in real case scenarios, you probably don't have enough time to do that." Integration tests come to the rescue: "Even without a full coverage by unit tests, integration tests are useful to get more confidence in high-level user stories working well."
Types of Integration Testing
Big Bang
In the big bang testing strategy, the development team integrates all of the modules to be tested and tests them as a single unit.
The big bang strategy is most effective when developers are performing system testing on small systems or systems that have little interdependence between modules. More complex systems lend themselves to more incremental approaches.
For example, when a big bang integration test case fails, it can often be difficult to locate the source of any given error. This can be especially frustrating, given how long a big bang integration test can take. Even with these issues, the big bang approach remains useful when the systems to be tested are small.
Top-Down
In the top-down integration testing strategy, developers layer the modules to be tested in a hierarchy and, as the name implies, write and run test cases for the higher-level modules first. As each module is tested, proceeding from the top layer to the bottom, developers integrate freshly tested modules into the preceding modules and test the whole. On the way, developers simulate lower-level modules to test the higher-level ones.
Since the top-down approach emphasizes incremental testing, as opposed to the big bang strategy, it's often easier to trace an error to its source. Developers are also more likely to find the most impactful bugs because the hierarchy approach requires testing the most important modules first. Developers who prefer using stubs to drivers also tend to prefer the top-down approach, which primarily uses the former and not the latter.
Bottom-Up
In the bottom-up integration testing strategy, a plan similar to the one laid out in a top-down strategy is made. But instead of testing from the highest layer to the lowest, developers write and run test cases from the lowest module to the highest.
Developers sometimes prefer the bottom-up approach — beyond the reasons shared with top-down vs. big bang — because of flexibility. The higher levels often take the longest to build, so with a bottom-up strategy, developers can first test the modules most likely to be complete and ready. Bottom-up testing also typically involves using test drivers instead of stubs, which some developers prefer.
Sandwich/Hybrid
In the sandwich/hybrid testing strategy, another incremental approach, the top-down and bottom-up strategies are combined.
Both top-down and bottom-up strategies face a similar limitation: They each need either the highest level or lowest level module to be coded and unit tested before integration testing can begin. In the sandwich testing approach, developers use both stubs and drivers so that they can test however the product demands, potentially even testing higher- and lower-level modules in parallel.
This hybrid integration testing approach tends to be more complex than previous approaches, but it often pays off when writing test cases for especially complex or otherwise large systems.
How to Write Integration Tests
The details of how best to write an integration test will be context-specific, but you can avoid most of the pitfalls by following a few best practices.
First, you'll want to carefully define the purpose of an integration test before you write it. Martin Fowler splits integration tests into two categories with two distinct purposes:
Narrow integration tests cover code in one service that communicates to a separate service. These narrow integration tests use test doubles and often have a similar scope to unit tests.
Broad integration tests focus on testing live versions of every service covered. That means testers need to go beyond the code that handles interactions and include, as Fowler writes, "substantial test environment and network access" so as to "exercise code paths through all services."
You can further break integration tests down into different layers, which tend to be dependent on the tools you're using.
As you write tests across these distinctions and purposes, Tim Bray outlines a few best practices. Integration tests, he explains, are "particularly hard in a microservices context" but still need to pass 100% of the time. "It's not OK for there to be failures that are ignored," he writes. Similarly, developers have to be careful not to let test suites get flaky. "Either the tests exercise something that might fail in production, in which case you should treat failures as blockers," he writes, "or they don't, in which case you should take them out of the damn test suite, which will then run faster."
Integration Test Checklist
- Articulate the purpose of your integration test
- Determine whether your integration test will be narrow or broad
- Identify which layers of your components the integration test will cover
- Run the integration test
- Iterate: Integration tests should pass 100% of the time and should not get flaky
- Develop benchmarks that you can use to measure how fast your integration tests run over time, how much coverage they tend to have, and how flaky they tend to be
End-to-End Testing
End-to-end testing is a form of software testing that focuses on how (and how well) a software application functions from beginning to end. End-to-end (e2e) tests are typically performed less frequently than unit and integration tests. They test the functionality of every process, component, and interaction and how they all weave together in a given workflow.
End-to-end tests typically take on the perspective of an end-user. By simulating how an end-user would use an application, testers can find any defects or bugs that might not have been caught in previous tests. By performing these tests after other, smaller tests, QA teams can look for errors that might arise from how the entire application fits together.
Testing Antipatterns
End-to-end tests, because they can be slow to run, can wreak havoc on your testing process if you're not careful with how you build them and the proportion of e2e tests you run.
Adam Bender, Principal Engineer at Google, shares two types of errors: An ice cream cone antipattern happens when developers write too many end-to-end tests, and an hourglass antipattern happens when developers write too few integration tests.
As Bender writes, the first antipattern results in test suites that "tend to be slow, unreliable, and difficult to work with." According to Bender, the second antipattern isn't as bad, but "it still results in many end-to-end test failures that could have been caught quicker and more easily with a suite of medium-scope tests."
When developers rush prototype projects to production, the first antipattern is likely. When systems are so tightly coupled that it's hard to isolate dependencies, the second is more likely.
Types of E2E Testing
There are three primary types of e2e tests: vertical, horizontal, and automatic.
The first two types of e2e tests are manual. Horizontal e2e tests assume the perspective of an end user navigating through the functionalities and workflows of the application. Horizontal end-to-end testing encompasses the entire application, so developers need to have well-built and well-defined workflows to perform these tests well.
For example, a horizontal e2e test might test a user interface, database, and an integration with a messaging tool like Slack. Testers will run through this workflow as users and look for errors, bugs, and other issues.
Vertical end-to-end tests check the functionality of an application layer by layer — typically working in a linear, hierarchical order. Testers break applications into disparate layers that they can test individually and in isolation. For example, testers might focus on particular subsystems, such as API requests and calls to a particular database, to see whether those subsystems work as intended.
Automatic e2e testing, in contrast to both integration tests and unit tests, typically involves tests programmed to reflect the manual tests that are then integrated into an automated testing tool and/or CI/CD pipeline.
Choosing when to shift from manual testing to automated isn't always an obvious decision. Unlike unit tests, which many agree should be automated, e2e tests often benefit from a human touch, and this manual involvement is sometimes surprisingly practical because e2e tests should be run much more rarely than unit tests.
What About User Acceptance Testing?
All this talk about testing user interfaces and taking the perspective of an end-user brings up a natural question: How does user acceptance testing fit into the test pyramid? The short answer: It doesn't.
But the answer is instructive — the purpose of the test pyramid is not to encapsulate the entire testing process or to capture every kind of testing that exists. And user acceptance testing is a good example of why.
User acceptance testing (UAT) occurs late in the software development cycle after the three stages of the test pyramid are run. Unlike the test types in the test pyramid, which are typically run and managed by developers, testers, and QA engineers, UAT is performed by product owners or a selection of actual end-users.
The primary goal of UAT is to get a user's perspective on the final version of the app or feature you're developing. UAT is important, but the test pyramid doesn't include it because it's a model that helps teams develop test suites that are internal to the testing team and within the control of that team. UAT, even though "testing" is in the name, is outside the scope of the test pyramid in the same way that writing the code is.
Manual vs. Automated Testing and Shifting Bugs Left
"If it can be automated, it should be" is a well-worn principle for developers and testers, but it doesn't mean that automation can and should be the only approach.
Sam Rose, a programmer and writer, explains that the primary goal of testing is to shift the discovery of bugs "left" (in other words, from facing the user to early access, code review, tests, and down to the compiler).
"The further to the left of that diagram a bug is found, the happier everyone will be," Rose writes. "The worst-case scenario is for a user to discover a bug, shown on the far right."
Modern software systems are complex, so automation is the only way to shift a significant amount of bugs left. But another benefit of shifting so many bugs left via automated methods is that testers and QA engineers can then have time to find bugs that could only be discovered through manual methods.
With automated testing, teams can find more bugs earlier, which makes fixing them cheaper, and testers can have more time to find the bugs that filter through. In this sense, automated and manual testing are complementary, not competitive.
The E2E Testing Debate
Back in 2015, the Google testing blog wrote a post called "Just Say No to More End-to-End Tests," and it stirred up some controversy.
Among the many responses and arguments was an article from Adrian Sutton, formerly lead developer at LMAX. In the response, Sutton explains how end-to-end tests can work well, showing that, done right, end-to-end tests have been "invaluable in the way they free us to try daring things and make sweeping changes, confident that if anything is broken, it will be caught."
The details of the debate are interesting, but the most important takeaway is the throughline that emerged: End-to-end testing lives or dies based on how well an individual organization or team runs it. Most of the advantages and disadvantages of e2e testing come from how well the developers and testers design and implement the testing suite.
For example, in the ice cream cone antipattern mentioned earlier, the problem isn't the e2e tests themselves; it's the proportion of e2e tests to other tests. Martin Fowler argues that e2e tests are best thought of as a "second line of test defense." A failure in an e2e test, he writes, shows that you have, one, a bug and, two, an issue in your unit tests.
Even when e2e tests aren't working well, the upstream cause might lie beyond the e2e tests themselves. For example, research from Jez Humble, author of Accelerate: The Science of Lean Software and DevOps, shows that having tests "primarily created and maintained either by QA or an outsourced party is not correlated with IT performance."
Humble's theory is that code is more testable when developers write the tests and that when developers are more responsible for tests, they tend to invest more time and energy into writing and maintaining them.
Flaky Tests
Flaky tests are non-deterministic, meaning they can either pass or fail the code they cover without the code itself being changed.
Generally, flakiness takes one of three forms:
- Random flakiness: Tests that pass or fail when you rerun them despite having changed nothing.
- Environmental flakiness: Tests that pass on one developer's machine but fail on another developer's machine.
- Branch flakiness: Tests that pass a PR but fail once a developer merges that PR into main.
In a paper called The Developer Experience of Flaky Tests, researchers found that over half of developers experience flaky tests every month or more. One of the most notable findings is that developers "who experience flaky tests more often may be more likely to ignore potentially genuine test failures."
How to Write E2E Test Cases
Different teams will have different approaches and methods for writing e2e tests, but most tend to involve five steps:
- Identifying what you want to test — whether it be a vertical or horizontal test — often involves cross-departmental collaboration among product managers, developers, and testers.
- Breaking the test scenario down into discrete steps. Here, you'll want to be as specific as possible and include the expected results of each step.
- Following the steps manually. Developers and testers can assume the perspective of end users and run through the steps they identified — noting any issues along the way.
- Writing a test that can perform the manual steps automatically. At large enough scales, teams will likely want to automate these steps, which usually requires using an automated testing tool.
- Integrating the automated test into a CI/CD pipeline.
The scope needs to be thought through carefully. Testing every single way all real users could interact with an application would provide a lot of test coverage, but the process would be onerous, and the results unlikely to be worth the cost. Testing too minimally is also a mistake because it can fool teams into thinking an application is well-tested when it isn't.
"We found that there is a low to moderate correlation between coverage and effectiveness when the number of test cases in the suite is controlled for. In addition, we found that stronger forms of coverage do not provide greater insight into the effectiveness of the suite. Our results suggest that coverage, while useful for identifying under-tested parts of a program, should not be used as a quality target because it is not a good indicator of test suite effectiveness."
— Laura Inozemtseva and Reid Holmes
Bugs, an object of all testing types, need to be well-targeted. If a testing suite is well designed, unit tests will identify errors in business logic, whereas e2e tests will focus on the functionality of different integrations and workflows. Bugs in this stage should arise from interactions between application components and emergent, system-level behavior that can't be predicted at the unit level.
Ultimately, the best e2e tests will result from iteration either from the development and testing teams themselves or from lessons from other teams. The LMAX team, for example, considers the following factors fundamental to effective e2e testing. End-to-end tests, Sutton writes, should:
- Run constantly through the day
- Complete in around 50 minutes, including deploying and starting the servers, running all the tests, and shutting down at the end
- Be required to pass before a version can be considered releasable
- Be owned by the whole team, including testers, developers, and business analysts
Beyond these basics, the LMAX team tries to ensure that:
- Test results are stored in a database to make it easy to query and search
- Tests are isolated so they can run them in parallel to speed things up
- Tests are dynamically distributed across a pool of servers
End-to-end testing, more than other testing types, depends on how well it fits into the rest of your testing suite. At all times, think about how problems might flow downstream through other tests and how the sum of your testing suite works instead of focusing too narrowly on how well any given test type functions.
The Test Pyramid and Its Discontents, or, Why No One Agrees on Anything
So far, we've dissected the test pyramid from top to bottom — discussing it at a high level and breaking it down into granular terms — but we've skipped the elephant in the room: Why do we even need a visualization, and what is it supposed to do?
The closer you look, the more you find that the controversies, debates, and continuous discontent aren't the result of disagreements about what a given test should look like but are the result of latent disagreements about what we're really doing here with this test pyramid thing.
Many of these disagreements are instructive and useful, but most swap one model for another when the real problem is using a model at all. If we dig into the origins of the test pyramid and walk through the biggest arguments in this endless debate, we can see a better way to think about testing.
The Original Disagreements
To understand the disagreements, we need to refer back to the origin stories we discussed earlier in this ebook.
Martin Fowler participated in some of the first discussions that led to the test pyramid in 2003 and 2004. These discussions happened in rough parallel with work by Jason Huggins, who independently thought of the same idea in 2006. Further refinement came from Google and its well-known testing team, with percentages allocated to each type of test.
As you can see in even the original test pyramid models, consensus was never present beyond the shape of the pyramid (and even that didn't last).
Fowler's pyramid included unit, services, and UI tests; Huggins' pyramid included unit tests, functional tests, and UI tests; and Google's pyramid included unit, integration, and e2e tests alongside rough percentages of each.
The confusion goes beyond the labeling. Fowler, for example, doesn't emphasize test types; instead, he emphasizes scope, writing, "The essential point is that you should have many more low-level UnitTests than high-level BroadStackTests running through a GUI." Google, too, adds nuance beyond the diagrams, with Adam Bender explaining that tests vary in size (small, medium, and large) to scope (narrow, medium, and large).
Across these examples and many others, there's tension between the nuance and the model. Bender can write, for example, "When considering your own mix, you might want a different balance," but you can safely assume more people have adopted the 80%, 15%, and 5% split than have really thought through Bender's more granular advice.
Part of what's fueling this confusion is the history of testing itself, which the pyramid can't really capture. Test types arose to prominence at different times, and developers adopted and implemented them in different ways.
Take integration testing as an example. Integration testing arose in the 1980s when waterfall development was prominent. Back then, one team would design modules and pass them off to the developers, who would build them in isolation before handing them over to a QA team for testing.
With hindsight, however, Fowler sees confusion: "Looking at it from a more 2010s perspective, these conflated two different things: testing that separately developed modules worked together properly and testing that a system of multiple modules worked as expected."
Now, writes Fowler, we really have "narrow" integration tests and "broad" integration tests. That means, he writes, "There is a large population of software developers for whom 'integration test' only means 'broad integration tests,' leading to plenty of confusion when they run into people who use the narrow approach."
"If you ask three different people what 'unit' means in the context of unit tests, you'll probably receive four different, slightly nuanced answers."
— Ham Vocke, former engineer at Stack Overflow
So, if the testing pyramid offers more confusion than clarity, why do people keep using the test pyramid at all?
Consensus Isn't Possible (and It Shouldn't Be)
The urge to finally get testing right once and for all drives the test pyramid debate. But so far, this debate has only resulted in more confusion, more pyramids, and, for many developers and QA teams, more resentment about having to test "correctly."
The debate, however, is not black and white. If it were a matter of adopting one model or another, the industry would likely settle on a standard and move on. The debate continues because there is disagreement on every level and from every perspective.
Few people agree on the size of the tests. The Google team, for example, considers resource usage (memory, processes, time, etc.) an important factor for determining test size, but other teams think about size purely in terms of code coverage.
Few people agree on the definition of certain test types, especially integration tests. The best way to define integration tests might just be by seeing them as on the middle of the spectrum between unit tests ("little itty bitty sanity testers") and end-to-end tests ("big beefy tests").
Few people agree on the proportions of each test type. Integration tests often feel like the forgotten stepchild in many testing conversations, but Kent C. Dodds, the creator of EpicWeb.dev, says developers should write more integration tests than any other type. That leads to a different model, a "testing trophy."
Dodds is not alone in allowing a different balance to lead to a different model. Spotify's testing team, for example, recommends their own shape: a testing honeycomb. The concept of the hierarchy remains, but the pyramid collapses in favor of different shapes that emphasize different proportions of tests.
Some people accept the test pyramid shape but add different layers. Web developer Alex Riviera, for example, sticks with the pyramid but replaces the layers with developer interface tests, consumer interface tests, and user interface tests.
Others stick with the shape and types but add new dimensions, such as cost, speed, or confidence.
The Test Pyramid Is a Model, Not a Target
The disagreements over shapes, layers, and naming can feel petty or picky at first glance, but it's not just argument for argument's sake. Beneath much of this debate is a wariness of succumbing to Goodhart's Law ("When a measure becomes a target, it's no longer a good measure").
Given the test pyramid's long-standing popularity, many developers and testers have been in teams where the test pyramid was turned from a model into a measurement or rule. Once the focus of testing shifts to building the pyramid and conforming to its layers, its usefulness evaporates.
"Whenever we extrapolate experiences into guidelines, we tend to think of them as being valid on their own, forgetting about the circumstances that are integral to their relevancy."
— Oleksii Holub
Despite this risk and the confusion inherent to the original test pyramid formulations, the test pyramid became a best practice, and best practices have their own gravitational pull. Unlike other best practices, which are often easier to implement or ignore, the test pyramid occupies a messy middle ground between broad guidelines and in-context advice.
The test pyramid is somewhere in the middle: Too broad to apply in a useful way but narrow enough to create confusion when it conflicts with reality. However, the middle ground only feels tense when you try to make it do something it can't do.
The test pyramid, ultimately, is a model — a reference point that gives you a starting point and cues you to think carefully when you stray from the beaten path. When testers talk with other testers (or leaders or testers from across the industry), the test pyramid can still provide a common reference point for shaping agreement and disagreement.
The consensus we were once seeking shifts from everyone agreeing on how to test or use the pyramid to everyone understanding the test pyramid is a language we use, tweak, and break. The test pyramid is a foundation — not to testing as a whole but to talking about testing and working together to test better.
Complement Your Test Pyramid Approach with the Right Tooling
Throughout the history of the test pyramid (and testing more broadly), testing frameworks and testing automation tools have emerged, become popular, and been replaced by other, better ones — and so on and so forth.
Many modern application frameworks now offer API layers that developers can use for simulation and testing. Tools like Docker allow developers to run tests that use real-world dependencies.
A whole market of testing automation tools has made it easy to write and run automated tests without nearly as much effort and time. The slow, resource-intensive tests the test pyramid warned us about are faster, lighter, and more effective.
Testing isn't necessarily easy nowadays, but the problems have shifted.
Know the Rules to Break the Rules
In the same way that great novelists have to know grammatical rules well but also know when to break them, great testers need to understand the test pyramid so that they know how to start, when to diverge, and how to analyze the results.
The test pyramid still serves a purpose after all these years, and many teams benefit from thinking through it rather than just testing on the fly. Ignore it outright at your peril.
However, if teams follow the test pyramid too closely, testing becomes a goal instead of a tool. If the purpose of testing comes first, not the best practice, then you can instead embrace what Holub calls "reality-driven testing."
Rather than treating testing as a goal, where you measure the success of your test suite against its resemblance to the test pyramid, you can let the practical realities of your code, your product, your industry, and your users lead the way with the test pyramid as a guide.
"The software testing pyramid isn't just a guideline—it's a blueprint for scalable, efficient testing. At Qase, we built our platform to support the entire pyramid, ensuring teams can seamlessly manage tests at every level without stitching together multiple tools. Great testing isn't about isolated efforts; it's about a unified, holistic approach that drives quality across the board."
— Nikita Fedorov, CEO of Qase
Nikita Fedorov, who founded Qase in 2019, built Qase around covering the entire test pyramid. Many testing tools, in contrast, are built only for particular test types or narrow environments and problems, requiring testing teams to cobble together their tools and compare them to the pyramid.
But, as we've learned here, the test pyramid is most useful when it's a holistic model, so the best testing tools also tend to take a similar holistic approach. Testing is at its best when it's more than the sum of its parts.
References
- https://martinfowler.com/bliki/TestPyramid.html
- https://agiletesting.blogspot.com/2006/02/thoughts-on-giving-successful-talk.html
- https://thenewstack.io/how-much-time-do-developers-spend-actually-writing-code
- https://survey.stackoverflow.co/2023/#most-popular-technologies-new-collab-tools-prof
- https://qase.io/blog/wolt-optimized-their-testing-for-growth-with-qase/
- https://www.techtarget.com/whatis/definition/fail-fast
- https://martinfowler.com/articles/practical-test-pyramid.html
- https://qase.io/blog/edge-cases-lessons-learned/
- https://increment.com/reliability/testing-beyond-coverage/
- https://qase.io/blog/test-driven-development/
- https://abseil.io/resources/swe-book/html/ch11.html
- https://qase.io/blog/user-acceptance-testing-uat/
- https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html
- https://www.symphonious.net/2015/04/30/making-end-to-end-tests-work
- https://bryanpendleton.blogspot.com/2015/04/on-testing-strategies-and-end-to-end.html
- https://qase.io/blog/flaky-tests/
- https://philmcminn.com/publications/parry2022a.pdf
- https://dl.acm.org/doi/10.1145/2568225.2568271
- https://kentcdodds.com/blog/write-tests
- https://engineering.atspotify.com/2018/01/testing-of-microservices/
- https://alex.party/posts/2022-11-14-thoughts-on-testing/
- https://tyrrrz.me/blog/unit-testing-is-overrated
Experts Quoted
- Martin Fowler, Author and Chief Scientist at ThoughtWorks
- Chris Coyier, Co-founder of CodePen
- Jeff Atwood, Co-founder of Stack Overflow
- Adam Bender, Principal Software Engineer at Google
- Mikko Vaha, QA and Software Engineering Lead
- Tim Bray, formerly a Distinguished Engineer at Amazon
- Ham Vocke, former Principal Software Developer at Stack Overflow
- Benoit Baudry, Researcher and Professor of Software Engineering
- Martin Monperrus, Researcher and Professor of Software Engineering
- Alin Turcu, Data and AI Engineering Director at Cognizant
- Sam Rose, Programmer and Writer
- Adrian Sutton, former Lead Developer at LMAX
- Bryan Pendleton, former Software Engineer at Google
- Jez Humble, author of Accelerate: The Science of Lean Software and DevOps
- Laura Inozemtseva, Author and Senior Software Engineer
- Reid Holmes, Professor
- Kent C. Dodds, Software Engineer Educator and creator of EpicWeb.dev
- Alex Riviere, Web Developer
- Oleksii Holub, Software Developer and Dev Tooling Consultant
- Nikita Fedorov, founder and CEO, Qase