A complete guide to canary testing
Canary testing refers to the practice of deployment of new functionality to a small subset of users to mitigate the risk of introducing new and potentially dangerous changes to all users. It is sometimes called canary deployment or a canary release.
Named after the practice of using canaries in coal mines to detect toxic gasses, canary testing involves gradually rolling out a new software update to a small subset of users or servers before deploying it to the entire system.
Table of contents:
- Goals of canary testing
- How canary testing differs from A/B testing and blue-green deployments
- Benefits of canary testing
- Challenges of canary testing
- How to determine when canary testing makes sense
- How to perform canary testing
Canaries were employed in mines starting in the late 1800s to detect gasses like carbon monoxide. While this gas is lethal to both humans and canaries in large quantities, canaries are much more sensitive to small amounts, allowing them to react faster than humans. Once the canary exhibited signs of carbon monoxide poisoning, the door would be closed and a valve opened to release oxygen from the tank on top, reviving the canary. Following this, the coal miners would be expected to evacuate the dangerous area.
In IT, a small number of users who are picked for the deployment of certain functionality, are acting like canaries. The development team and managers observe dashboards and metrics closely, and if “the canary” is reacting badly to the fresh release, this new functionality is quickly disabled. If the canary subset of users do not show any signs of any critical issues with the new code, the release is deployed to all users (sometimes gradually).
Goals of canary testing
The purpose of canary testing is to ensure the new release is stable and does not introduce critical issues. Any testing prior to release can only cover so much, as it’s impossible to ensure that the changes introduced to the system do not cause any unintended consequences or issues on production with real users. One of the potential reasons for this is that developers and QAs are all somewhat influenced by their mental sets: the way we think is heavily influenced by our profession and experience, one might even say we are framed by our work to think in a certain way. Testing on a small but representative number of real users is another safeguard against potentially overlooked issues.
How canary testing works
The canary testing procedure is quite straightforward. As soon as you have a release which you think might be risky to release to everyone, you follow these steps:
1. Select the canary test group for the release
The dev and QA team select a small subset of the user base to act as testers for this release. This group should be large enough to produce results that allow for meaningful statistical analysis but small enough to minimize risk.
Psychology studies have demonstrated a particular phenomenon: when people are aware they are being watched or studied, their behavior changes (so-called Hawthorne effect). This is the reason for selecting people to be part of a testing group without making them aware of it, so that their behavior will be precisely as it was before. Since their behavior isn’t changed by their awareness of the experiment, monitoring will show any issues caused by the new release.
2. Setup the testing environment
The team sets up a testing environment that operates in parallel with the existing live environment but with no real users.
The new release is deployed to this new test environment.
Then the system’s load balancer is configured to route requests from the designated canary users to the new environment. This ensures that only the selected subset of users interacts with the new version.
3. Monitor the test
The team continuously monitors the canary test environment for important metrics such as error rates, response times, latency, CPU and memory utilization, I/O load, etc. This step lasts long enough to gather sufficient data to evaluate the new version’s performance and stability.
If any of the important metrics show an unexpected drop or surge, the canary test is turned off, the canary test users are rerouted back to the original version, and the new version is pulled back for further development.
4. Evaluate the results and rollout or rollback
If the new version meets predefined deployment criteria (e.g., performance benchmarks, low error rates, positive user feedback), it is deemed ready for wider release.
If there is still some doubt about the code changes and the team decides to get more monitoring data, more users are routed to the canary test environment.
If there are no apparent issues, the new release is rolled out to the normal production environment and reaches all users. Then, the canary users are rerouted to the normal production environment as well. As soon as the canary test environment has no users, it is shut down.
At this stage, the canary test is considered successfully performed and finished.
How canary testing differs from similar testing types
Canary testing vs. A/B testing
Canary testing simply ensures the new release is stable and does not introduce critical issues, while A/B testing determines which version of a feature or design performs better in achieving specific goals.
Canary testing also focuses more on operational metrics like error rates, performance, and stability, while A/B Testing focuses on user behavior metrics like click-through rates, conversion rates, engagement, and user satisfaction.
One can say that canary testing minimizes risk by limiting exposure to potential issues in new software releases while A/B Testing allows for more informed decisions in the feature design and UX.
Canary testing is much simpler to perform, as one needs only to select a small but representative group of users for the test. A/B testing requires much more careful selection of users for the testing groups (e.g., group A and group B). For A/B testing, both groups must comprise statistically representative groups of the overall user base. This is crucial because the results need to be valid and general to the entire user population.
Additionally, for A/B tests, users need to be randomly assigned to different groups to avoid biases that could affect the outcome. Randomization ensures that the differences in performance between versions are due to the changes made and not due to underlying differences between user groups. While choosing groups for A/B testing, factors like demographics, user behavior patterns, and other relevant attributes must be taken care of for even distribution across test groups to ensure the comparability of results.
Canary testing vs. blue-green deployment
Blue-green deployment is a strategy used to minimize downtime during deployment while also providing QA and engineers a way to test the new code in an environment that is the same as the environment where all users are. The blue-green deployment approach achieves this by maintaining two identical production environments, while only one is serving users and the second is idle. When the new functionality is released, it is first deployed to the idle one and tested by QAs and devs. As soon as the engineers are happy, all the users are routed to this environment, and the other environment becomes idle. If the users load does cause issues, engineers can quickly reroute all users back to the idle environment with the old version.
Canary testing utilizes the same routing functionality and requires the same environment preparation steps as the blue-green deployment, but also allows to gradually route users to the test environment.
Benefits of canary testing
Canary testing is a good strategy for minimizing the risks associated with potentially dangerous changes introduced to the system. When the new version reaches only a small subset of users, you have an early warning system and potential issues will only affect this small group, reducing the risk of widespread problems. This means that the problems and bugs can be identified by looking at real users interacting with the new system early in the deployment process, allowing teams to address the issues before a full rollout.
Testing in a live environment with real users generally provides more accurate and relevant feedback compared to pre-production testing environments.
If significant issues are detected during canary testing, the canary group can quickly be reverted to the previous stable version, minimizing disruption and users dissatisfaction.
Additionally, specific user groups can be targeted for testing new features, allowing for more controlled and focused feedback.
Successful canary phases build confidence in the new release, both within the development team and among stakeholders.
Overall, canary testing is a strategic approach that balances the need for innovation and quick releases of the new features with the imperative of maintaining system stability and user satisfaction.
Challenges of canary testing
Canary testing, as any other practice, comes with its own set of challenges.
1. Choosing a representative canary group
To be effective in mitigating the risks, canary testing should be done on a small group of users which should properly represent the overall user base. Selecting such a group can be difficult. The canary group needs to include a variety of users who interact with the application in different ways, from different locations, devices, and network conditions.
For example, if the canary group is chosen randomly, it might comprise users who rarely touch certain functionality in the product. If this particular part of the product has a serious issue in the new release, canary testing with such an underrepresented group will not uncover any issues.
While the canary group has to be small enough to limit the potential impact of any issues, it also needs to be big enough to provide statistically significant data. The average percent of the user base chosen for canary testing ranges from 1 to 5%.
2. Effective monitoring and metrics analysis
It might be worthwhile to monitor granular metrics specific to the new release, as general metrics might not reveal specific issues introduced by the new version. For example, if the new release changes the way images are served to the client, you should monitor the rates of HTTP errors such as 403 Forbidden or 404 Not Found. This way, you can quickly identify if there are issues with the new method of serving image assets.
If your production environment doesn’t allow you to easily add metrics to monitoring, canary testing will not make much sense.
3. Implementing a seamless rollback process
To minimize disruption if the canary test reveals significant issues, it’s crucial to implement a seamless rollback process. There are two primary ways to achieve this: feature flags and routing changes.
Feature flags essentially introduce a set of IF conditions into the code, separating the new code from the existing code. They act like light switches, controlled by changing just one line in the backend configuration or even through an admin web interface. These conditions check if the user is part of the canary group. If true, the system uses the code and functions from the new release; otherwise, it uses the old code.
One minor complication with feature flags is that they need to be removed after the canary test is completed, regardless of the outcome. Leaving them in the code can lead to unnecessary complexity and potential technical debt.
Another way to disable the canary test is to quickly reroute traffic for the canary users from the testing environment back to the normal production environment. However, this requires careful handling of session data: if the testing environment has different session storage from the production environment, canary users will lose their session data after rerouting. Additionally, rerouting the canary users to the normal version of the system should be done swiftly to minimize any negative impact on users if issues arise with the new release.
The rule of thumb for choosing between these two methods is simple: choose rerouting if it can be done quickly and efficiently.
How to determine when canary testing makes sense
Canary testing makes most sense when the team is going to deploy an update which could significantly impact user experience or system performance, such as major architectural changes or overhauls of core functionalities, or when introducing new features incrementally to gather user feedback and performance data without affecting the entire user base.
However, with a proper investment in the testing environment and deployment process, you can roll out every new release through a canary release pipeline.
For example, Google offers canary builds of their Chrome browser to anyone (who’s keen on testing the new shiny but potentially broken stuff) specifically to get some real users’ feedback.
For their mobile apps, Facebook does canary releases daily, gradually routing more and more users to the new release and monitoring the important metrics and feedback.
How to perform canary testing
I once worked for a company where the mobile version of our website featured a chat window that used XMLHttpRequest on the frontend to send and receive messages. This chat functionality was crucial, as over 40% of users engaged with it daily.
The XMLHttpRequest-based chat worked as follows: when a user sent a message, a new XMLHttpRequest object was created in the browser to transmit the message to the server. Upon receiving and processing the request, the server checked for new messages for the user and included them in the response payload.
However, when the user wasn’t sending anything, we had to schedule empty XMLHttpRequest requests to poll the server for new messages. The challenge with this regular server polling approach was finding the right balance between chat responsiveness and client browser load. Polling every 10-100ms ensured timely message delivery from the server to the client but imposed a significant load on the browser and phone battery. Conversely, polling every two seconds risked displaying multiple messages simultaneously, degrading user experience: it felt as if the network wasn’t fast enough.
At that time, WebSockets emerged as a promising alternative to traditional XMLHttpRequests. WebSockets allowed for a persistent server connection, enabling the server to push messages to the client whenever available.
In theory, this new approach seemed perfect for managing chat connectivity. We built a prototype of the new chat system and performed extensive internal testing. However, we recognized that our clients had a much more diverse set of usage patterns for the chat, and we would never be able to even uncover all the use cases they had.
Thus, testing the new chat system on a small portion of live users appeared to be an excellent strategy.
We decided to employ canary testing to deploy the WebSocket-based chat system. Here’s how we approached it:
- We chose a small group of users from 10 different countries using two mainstream browsers at the time: Google Android 4.4.1 browser and Apple Mobile Safari 7, which we thought would be a representative group of users.
- The new WebSocket-based chat was deployed to this selected group, we used the existing A/B testing testing environment for the routing. The rest of the users continued using the existing XMLHttpRequest-based chat.
- We closely monitored performance metrics such as connection stability, message delivery times, “message drop” rate, etc.
- We saw that there were no real issues with the new WebSockets functionality on that small group, so within the next month we gradually routed more and more users to the WebSockets-based chat.
- In two months, we removed the old XMLHttpRequest-based code completely, essentially phasing out support for a module which was 6 years old.
The canary testing approach allowed us to validate the WebSocket-based chat in a real-world environment with minimal risk. It also enabled us to slowly and gradually release the new functionality to more and more users while keeping an eye on the key metrics. Ultimately, the new chat system provided a more responsive and efficient experience, significantly improving user satisfaction.
Canary testing helps you balance growth and risk
Canary testing is a powerful strategy for managing the risks associated with deploying new software updates. By rolling out changes to a small, representative group of users first, it allows development teams to detect and address issues early, ensuring a smoother and more stable release for the broader user base. The step-by-step process of canary testing — from selecting the test group and setting up the environment to monitoring and evaluating results—helps maintain system performance and user satisfaction.
Whether you’re dealing with major architectural changes or incremental feature updates, canary testing offers a balanced approach to innovation and reliability. By understanding its benefits and addressing its challenges, you can make informed decisions that enhance your software deployment strategy and user experience.