QA myth busting: Quality can be measured
Let’s bust some QA myths.
First myth: Quality can be measured.
Everyone wants to measure quality, but the idea that quality can be definitively measured is a myth.
Imagine trying to measure the quality of a family road trip. What would make the road trip an indisputable success for the entire family? You’ll have a hard time finding metrics that could give you a solid answer.
Of course, that doesn’t mean that we shouldn’t use the tools and metrics at our disposal to analyze and improve quality. It just means we need to have a broader understanding of what quality measurements tell us and what to do with that information.
Definitions of quality give us a general idea to work towards
Without a clear, shared understanding of “quality,” there is too much ambiguity in what the end goal is. This is particularly important for such a complicated topic as quality and quality assurance.
ISO 9000 defines quality as “a degree to which a set of inherent characteristics fulfills requirements.” This implies that quality of a product is how it satisfies clients’ needs and desires, expressed in the form of requirements.
ISO 25000 offers a more specific definition for software quality: “capability of software product to satisfy stated and implied needs when used under specified conditions.” This focuses on the software’s ability to meet both explicit and implicit client needs within the client’s context.
Combining these perspectives, we can formulate a unified definition: “Quality is a match between what’s desired and what’s produced”.
Quality is subjective
Quality can only be achieved if we, the IT folks, first understand what the client wants and desires, and then are able to produce what we understand. If the clients use our software and are happy with it, this should mean we succeeded in understanding them and producing what they wanted.
Science has struggled with measuring and quantifying quality for this particular reason: quality is very subjective. People will only consider a product or service to be high quality if they feel that the product or service pleases them or helps them solve their problems.
Accepting the fact that quality is very subjective, is it even possible to fully measure it?
Let’s go back to the family road trip. You, the driver, have a destination (project completion) and planned stops (project milestones). Let’s say you made all the planned stops and made it to your destination. That would add up to a high quality road trip, right? No, because quality is not measured by getting to the destination, but rather the experience along the way. Let’s say one passenger got food poisoning, you spent several hours stranded with a flat tire, and a hotel at one of the planned stops lost your reservation so you had to sleep in the car one night. Would you still perceive that as a quality road trip?
Quality is heavily influenced by people’s perceptions
The overall quality of a product is ultimately determined by how people perceive it and perceptions can’t be fully measured. You can’t get into the heads of the customers and see what they think of your product.
However there are various measures for checking if clients are happy:
- Employing dogfooding practice, where the employees become the first clients of the product, giving them a much better understanding of the customer;
- Organizing customer surveys and focus-groups, getting direct feedback on the product;
- Coming up with proxy metrics for quality, such as Net Promoter Score (NPS), Customer Satisfaction Score (CSAT), and Customer Effort Score (CES).
A proxy metric is used when it is expensive, very difficult, or impossible to measure something directly. Proxy metrics only correlate with the intended state rather than measuring it directly.
If the NPS metric is going down, it might be that the quality has not changed, but the NPS is now measured in a market with a different culture. Similarly, CSAT and CES only correlate with quality because people's responses are quite often significantly influenced by various response biases (when people simply don’t share the truth in the surveys). One can say that usually when quality goes down, NPS, CSAT and CES go down, but it’s not granted.
While the quality from the clients perspective can’t be fully measured, there is still some value in the proxy metrics: they serve as a signal for further analysis.
Continuing with the family road trip example, let’s say your destination is San Francisco, California. The quality of the journey depends on your family’s perception of the trip — and there is no definitive way to measure their perceptions. Before the trip, you may have agreed to hit specific sightseeing spots along the way and not to drive faster than 90 miles per hour, which serve as proxy metrics for the quality of the trip.
If you rely only on the proxy metrics, you’d think driving at a reasonable speed and spending 5 minutes at each stop would be enough. But if your family is frustrated by being rushed through planned stops, you play music everyone hates, and ignore all their requests and complaints throughout the trip, your family is unlikely to perceive the trip as a high quality experience.
Now let’s imagine gathering feedback from your family during the trip. At certain points, you ask your family to rate the quality of the road trip experience on a scale of 1-5. As you continue the journey, the score is gradually decreasing, even though you are staying below 90 mph and stopping at all the planned sightseeing spots. Two proxy metrics (speed and planned stops) tell you that you are achieving quality while another (feedback score) tells you that you are not.
The value in the proxy metrics is not to definitively determine quality, but rather to serve as a signal for further analysis:
- If you are driving under 90 mph but you notice some family members are looking nauseous, you should analyze the situation. Are you swerving too aggressively? Braking too suddenly? Is under 90 mph still too fast?
- If you see that the feedback score is dropping during the drive, you should look into the cause. Is your family unhappy with the time spent at each sightseeing spot? Do they need more snacks? Could the music be improved?
Internal quality metrics and employee perceptions
So far we’ve talked about the quality from the client point of view or “external quality.” There’s also “internal quality” of the product, or as Martin Fowler explores it: “how easy it is to keep modifying the code.”
Internal quality is all about the sustainability of development. Every engineer has had to deal with “messy code” or “bad architecture,” when it was ridiculously hard to make any changes to the code. There are a multitude of reasons why an engineer can perceive the code as “bad” or “low quality,” but this perception is also very subjective.
There is no universal metric showing how “good” or “bad” the code is, as all engineers, along with their skills and knowledge of the codebase, are different and have different opinions. Of course, there are general signs of “bad” code, such as high cyclomatic complexity of a module or a function. However, there are valid cases where it is not possible to reduce cyclomatic complexity, for example, when building finite state machines. In any case, all the signs of “bad” code are still subject to interpretation of the development team, and any human interpretation is inherently subjective.
For example, one might think that the percentage of test coverage would be a good objective metric of code quality, but there are cases where the code coverage percentage might be decreasing, while the internal quality is improving. This might happen, for instance, when the low-value extremely flaky tests are removed from the codebase, or when a trustworthy third-party library replaces custom-built modules.
Like external quality, all internal quality metrics are proxy metrics, only correlating with internal quality. And, once again, the value in the proxy metrics is mainly to serve as a signal for further analysis.
Here we need to talk about the road trip in terms of the driver, passenger, and car experience. As the driver, you have to watch the speedometer, monitor the map, and follow all the passenger requests, even if they are unreasonable. You become so focused on external quality (your family’s perception of the trip) that you ignore other signs of quality. Let’s say two of your family members insist you drive off road through a desert. You know that your car is not built for such tough driving conditions, but your passengers (the customers) want to see the desert, so you drive off the paved road and ignore signs of tire wear and strange sounds coming from the engine. You might even push through extreme fatigue to keep driving your passengers where they want to go.
Maybe your family is happy and perceive the road trip as high quality for a time, but while focusing on the external quality (your family’s happiness), you neglect the internal quality (the car's capabilities and your happiness), which eventually leads to the car (and you) breaking down.
Clearly, neglecting internal quality in the desire to satisfy customers isn’t an ideal approach either. And just like external quality, there is no way to fully measure internal quality.
Metrics are signals, not finite measurements nor goals
We’ve stated that both internal and external quality can not be fully measured, but there are certain correlating proxy metrics.
As the correlation does not necessarily mean that the change of the metric value always indicates the change of quality, the main value for the proxy metrics is to serve as a signal for analysis.
If we see that the test coverage percentage is dropping, we should analyze why. Is it that we removed redundant tests which weren’t yielding much value, or have we started rushing and deploying new features without any tests?
If metrics change, we should never aim to improve the metrics instead of analyzing the underlying reasons for their change. Only the analysis will show if any actions are needed, and if so, which ones and where.
A friend once worked for a startup, building a multimedia player app for Android tablets. They started with their own custom database engine, and even managed to cover 40% of it with tests. However, the clients kept complaining about data being sometimes lost or corrupted. The team discovered that most of these problems were due to the issues in the database engine. Upon extensive research and multiple test integrations, they replaced the custom database with SQLite. Clients stopped reporting issues.
Switching to SQLite significantly improved the overall quality of the product, even though the test coverage dropped significantly.
What if instead of improving the quality they set the goal to get 100% test coverage? They would then have two options:
- Spend years writing tests for the database while stopping or slowing down other development, essentially risking the whole business
- Pay SQLite hundreds of thousands of dollars to get access to TH3 (extensive test harness for SQLite) for little reason — SQLite always tests their releases with the same TH3 harness.
Both options would lead to unintended harmful consequences of setting the metric value as the goal. When metrics are set as goals, Goodhart’s law kicks in and unintended consequences arise.
Back to the road trip. You learned from last time and decided to set some metrics for measuring the quality of the road trip. You create a clear plan for what parts of San Francisco you’ll be visiting, assign someone to be in charge of music for the entire ride, and set a goal to hit at least 15 roadside attractions and spend at least 20 minutes at each stop. You’ve improved all measurable metrics, so your road trip is indisputably “high quality” right?
Unfortunately, no. By planning for more events in San Francisco, you reduced the food budget for the entire trip so now your passengers are unhappy with the meal choices. The frequent and lengthier stops at roadside attractions slow you down and interrupt the flow of the music that another passenger painstakingly planned out. And this time around, your partner is pregnant and much more prone to motion sickness and sensitive to food smells in the car — their perception of quality changed, just like customers’ perception of quality does over time.
Quality can’t be fully measured, but there is value in information
Different metrics correlate with various aspects of the quality of the product and can hint at where and what to analyze.
The broader the metric, the harder the analysis:
- If your NPS is decreasing, you will need to analyze pretty much everything in your product: from the market changes to your product UX and design, from performance to defects in code
- If you see the number of requests to customer support growing, you should discuss with CS, QA, and Dev to determine what’s going on
- If you see the number of flaky tests growing, you need to meet with platform engineers, QA, and Dev to investigate the cause
- If you see the number of returns from QA to Dev growing, you need to meet with QA and Dev to figure out the reason
- If you see the code review average time growing, you need to figure out the reasons with the Dev team
The reasons for why each metric changed could vary. The measurement alone doesn’t tell you anything, it might only hint at where to start the analysis. Like in the road trip example — everyone liked the food the first time so you keep it the same, but the satisfaction score for food goes down on the second trip. The decrease in food satisfaction doesn’t necessarily mean the food is the problem, it just signals that you should look into it. Further investigation reveals that your partner’s pregnancy makes them particularly sensitive to greasy food, and their complaints about everyone else’s food smells in the car make the rest of the family have a less enjoyable time.
How to use metrics and measurements
Take the following steps:
1. Choose the metric, start measuring and visualize for monitoring
With internal software quality, there’s plenty of metrics to select for monitoring. In their book Software Metrics: A Rigorous and Practical Approach, Norman Fenton and James Bieman list several. Here are some to consider:
- Cyclomatic complexity measures the number of linearly independent paths through a program’s source code.
- Code coverage measures the percentage of code that is covered by automated tests.
- Defect density measures the number of defects per unit of code.
- Defect resolution time is the average time taken to fix reported defects
- Cycle time is the time taken from the start of a development task to its completion.
- Lead time is the total time from the initial request to the delivery of the feature.
- The number of returns from QA to Dev is how many times tickets are passed back to development after QA finds bugs.
For more detailed guidance, check our article on defect management.
Let’s take “the number of returns from QA to Dev” as an example. Say we have an old-school process where developers work on features, and each feature has a ticket in Jira. When a developer completes the work, they pass the Jira ticket from “in development” to “ready for testing” status, so that testers can pick it up for testing. If testers find defects, they log all the necessary information and pass the ticket back to the “in development” status so that the developers can fix found issues.
Then, we decide to monitor how many times tasks are “returned” from QA to Dev because we know that any defect found slows down the development and we really want developers to assure quality before passing the feature on to testers. We decide that there’s some correlation between “how many tickets are passed back from QA to Dev” and quality. We think that when the testers start passing more tickets back to developers, it’s a good signal for us to analyze the situation.
To start monitoring the metric, we need to display it. Jira allows us to do so via JQL, custom fields, and automation. Essentially, you will see how many times tickets go from QA back to Dev in a certain period of time, such as one week.
2. Apply Shewhart’s control charts to the measurements
Shewhart control charts are tools used in statistical process control to monitor how a process changes over time. Shewhart control charts help you distinguish between normal variations (random fluctuations) and actual issues (signals that something is wrong). With Shewhart control charts, it’s easier to understand if the process is in a state of control or if there are variations that need attention. The reason why Shewhart control charts are beneficial is because every work item is different, and the metric value might fluctuate a little naturally. When natural fluctuations occur, there’s no need for analysis or any other action.
Applying Shewhart’s control chart to the metric of tracking how many times QA returns tasks to Dev is quite straightforward:
- Get the number of times QA returns each Jira ticket to Dev each week
- Calculate the statistics:
- Mean (average): the average number of returns per week
- Range: the difference between the highest and lowest number of returns in the dataset
- Standard deviation: the amount of variation or dispersion from the average
- Create the control chart:
- X-Axis: weeks
- Y-Axis: number of returns from QA to Dev
- Center Line (CL): the average number of returns
- Control Limits: calculate and plot the Upper Control Limit (UCL) and Lower Control Limit (LCL). Typically, these are set at three standard deviations above and below the mean.
- Plot the data: for each week, plot the number of returns on the chart
- Analyze the chart and look for patterns or trends that indicate a problem:
- Points outside control limits indicate a potential issue that needs further analysis
- Run of points above or below the center line suggest a shift in the process that also might need further analysis
Shewhart’s control charts help us filter out statistically insignificant variations in the metric value, thus giving us a better understanding if the monitored metric is signaling us for analysis of the potential issues. So based on the above graph, we’d want to investigate why there was a sudden drop in tasks returned to Dev in week 6 and a spike in weeks 7-8 before returning to average. At first glance you might think that week 6 was a great week. But the reality might be that half of the QA team was on vacation in week 6 so less tasks were returned to dev during that week and the numbers spiked in weeks 7 and 8 because the team was catching up.
3. When detecting a surge or a drop, analyze the reasons
If we observe a statistically significant surge or drop, we need to analyze the reasons.
Reasons are always different as processes, people, product, organizational structure, and company culture form a unique context.
The number of returns from QA to Dev growing could be attributed to various reasons such as:
- A new skilled tester is hired and they start finding more bugs. Nothing needs to change, as the developers see it as fair and start paying more attention to quality and testing. Even with the metric going up, quality eventually improves.
- KPIs for QAs focusing on finding bugs are introduced by management, leading to even minor issues being treated as bugs and tickets being sent back to development. The metric goes up while the quality goes down alongside relations between developers and QA. The solution would be to remove the KPIs.
- Layoffs hit the development team hard: the most experienced and therefore well-paid developers are fired, and the quality diminishes. The analysis shows that QAs did start finding more true defects in the code. There is no solution except for waiting until the remaining developers learn or for the company to realize their mistake and hire more experienced developers.
- A new manager starts imposing deadlines on the development team, causing them to cut corners and rush to push the tickets to QA. The analysis shows that QAs did start finding real defects, but the only solution is to inform the new manager of the damage he is causing to the product.
- The company is purchased by a larger one and is forced to change its approach to work. Previously, QAs would simply talk to developers and fix bugs together right away, so tickets were rarely passed back from QA to development. Now, QAs are forced to do their work after development is finished, resulting in more tickets being passed back. The quality remains the same, but the delays increase.
4. Reassess the metrics regularly
As we agreed, different metrics only correlate with various aspects of the product quality, and can only hint at where and what to analyze. There might be cases when the correlation disappears.
For example, a bank might decide to gradually replace the old Cobol codebase which had 80% test coverage with a new system built in Java. The team agreed to have only the core functionality fully covered with autotests, meaning that the whole refactoring project will constantly show an overall drop in the test coverage metric. In this scenario, there is no need to constantly analyze what’s behind the test coverage dropping trend. It would make perfect sense to ignore this metric until the refactoring is complete.
The rule of thumb for reassessing the metric is simple: if upon a few instances of analysis you see that there’s no correlation between the metric change and the product quality, consider pausing monitoring for this metric or replacing it with a different one.
Stop measuring quality metrics and start using them as signals instead
Remember that quality is not definitively measurable. You can plan the perfect road trip with your family, hit every milestone, and reach the intended destination and still end up with a car full of unhappy passengers.
If all your metrics are showing “good” performance but your employees are fleeing the company (passengers bailing on the road trip), you’re doing something really wrong. If all your metrics are showing “good” performance but your customers are choosing your competitor’s product (family chooses alternate vacation options), you’re also doing something wrong.
Metrics can only serve as a signal to analyze the reasons and improve the processes upon analysis. Embrace the inherent subjectivity of the quality, use metrics wisely, and employ TQM.