Measuring and Optimizing Your Testing Strategy

This article closes the risk-based testing cycle: how to measure whether your testing portfolio is actually working, when to review it, what to track, and how to rebalance when the evidence says to.

In the previous articles, we identified and quantified risks, ranked them by exposure, and built a testing portfolio traceable to those risks. When we're investing, we're interested in our returns: how do we know it's actually working?

Not a one-time plan

A testing strategy is an investment portfolio, and like any portfolio, the conditions around it change. When GDPR hit, security and data protection risks jumped overnight for every company handling EU data. When a competitor gains serious market share simply because their UX is better, usability risks that were "low priority" suddenly threaten revenue. A strategy that was right six months ago may be wasting resources today, or missing new risks entirely.

Is our testing reducing the risks we care about? Are we doing it at reasonable cost? If you don't know, you're just doing testing for the sake of doing it. Investing blind is rarely a good option.

When to review

When GDPR was announced, companies had two years before enforcement. Those who started reviewing their quality and security practices early invested less and were ready on time. Those who waited for the deadline ended up in crisis mode. Same principle here: the earlier you review your testing portfolio, the cheaper the adjustment.

I think of reviews as happening in three modes.

Scheduled: per release or quarterly or semi-annually, depending on how fast your risk landscape changes. Catching problems before they hurt is always cheaper than reacting after they do.

Event-driven: when the context changes materially. An architecture change, a new high-priority risk, a significant team restructuring, a new competitor eating your market share. These don't wait for the next scheduled review because the portfolio may already be misaligned.

Early-warning: watching for signals. A spike in escaped defects, increased support tickets, delivery slowdowns, performance degradation. When they appear, I wouldn't wait for the next scheduled review either.

Leading vs lagging indicators

Knowing when to review is one part, the other is knowing what to look at. Some metrics you watch continuously, others you evaluate during the review itself.

Leading indicators are early signals that tell you whether your strategy is working before failure costs materialize:

Mean time to detect: how quickly defects are found after introduction
Change failure rate: what percentage of changes cause production failures
Static analysis trend: are critical findings decreasing over time?
Flaky test rate: noisy tests erode evidence quality, so even a rising flakiness trend is worth watching
Mutation score: how well your tests detect injected faults

Lagging indicators are the actual outcomes:

Escaped defects, severity-weighted
Production incidents: count, severity, user impact
Customer churn attributable to quality
External failure cost: the realized cost of production failures (response, remediation, support load, revenue loss)

Leading indicators let you adjust before things go wrong, lagging indicators confirm whether risk actually went down.

All these metrics do not measure quality directly, they are simply proxies.

We can never know exactly how much value our testing added, because we can't compare "what happened" with "what would have happened without it". But we can watch trends: are escaped defects going down? Are high-severity incidents becoming rarer? Is external failure cost decreasing? If the trends are moving in the right direction, the portfolio is working. If not, something needs to change.

Defect containment

Of all the lagging indicators, defect containment has been the most useful in my experience: the ratio of defects found pre-release versus post-release, split by severity. If you're catching high-severity issues before release, that's a good signal your controls are working where it matters.

The escape rate is worth tracking separately: escaped defects divided by total defects discovered, where "escaped" means first found in production. Same with time-to-detect alongside time-to-fix, tracked separately for pre- and post-release defects.

For the insulin pump from the first article, a high-severity escape isn't just a metric: it's a patient safety event. For the internal CMS, a post-release defect might be a formatting issue. Same measurement, completely different stakes, which is why severity-weighting matters.

Cost of Quality as a diagnostic

The CoQ framework introduced in the first article in this series isn't just a way to categorize spending. Tracked over time, it becomes a diagnostic.

I track all four categories: prevention, appraisal, internal failure, and external failure. The two ratios that tell me the most are the shift-left ratio (prevention plus appraisal, divided by internal plus external failure) and the external failure ratio (external failure divided by total CoQ).

The pattern I see most often is this: teams are spending heavily on external failure (incidents, support, hotfixes, rework) and very little on prevention. That's backwards from an economic standpoint. The idea is to invest in prevention and appraisal in a way that causes external failure costs to decrease over time. Early on, external failure often dominates. As the portfolio matures and prevention becomes more deliberate, you should see the distribution shift.

If your external failure ratio isn't trending down, your prevention and appraisal spend isn't translating into fewer escapes.

Customer impact and risk reduction

Internal metrics tell one part of the story, customer impact tells the other.

Support ticket volume and severity, customer churn attributable to quality, NPS trend, feature adoption: these reflect how well the system actually serves users in real-world conditions. For the insulin pump, that's a clinical outcome. For the internal CMS, it's whether editors can do their jobs without friction.

Risk reduction closes the loop back to the prioritization you did earlier. The high-priority risks you identified and scored should be traceable to indicators here. Are those specific risks decreasing in exposure? Are the payment regression, performance degradation, SQL injection, and code complexity risks from our running example actually moving in the right direction? If you can't answer that per-risk, you've lost the traceability that made the prioritization worth doing.

Efficiency: are we spending in the right places?

Knowing that risks are going down is half the answer. The other half is whether we're paying a reasonable price for that reduction.

The metrics I find most useful here: effort per risk by testing approach, and cost per high-severity defect found. The severity qualifier matters: "cost per defect found" without it is easy to game by counting low-value findings.

Delivery speed matters too: testing should produce timely feedback, not be the dominant source of waiting time. If your regression cycle is holding up releases, that's worth investigating.

I've only seen a couple of teams who actually did redundancy analysis. It's worth doing though: overlapping coverage isn't always bad when the checks are genuinely different: static plus dynamic, unit plus contract, specification-based plus experience-based. But when multiple tests check the same thing in the same way at the same level, that's just cost for no added confidence.

Rebalancing: turning evidence into decisions

So you've looked at the numbers. Now what do you change?

Every rebalancing decision falls into one of five types:

Increase: the control shows strong risk reduction per unit cost
Reduce or stop: low return, correlated with other checks, or obsolete
Add: new risks emerged or coverage gaps were found
Accept residual risk: with documented rationale, owner, and review date
Reallocate: every increase names what gets reduced, because budget is finite

Here's what rebalancing looks like for the risks from our previous articles (scoring and ranking and building the portfolio):

Payment regression is covered only by slow end-to-end tests. The signal: defects are being found late, and the E2E suite is brittle. The decision: add unit tests for payment logic with explicit boundary coverage, and reduce E2E to happy-path scenarios. Faster feedback, lower cost, earlier detection.
Security vulnerabilities are escaping to production. The signal: incidents show current controls are insufficient. The decision: add SAST in CI, make security-focused code review mandatory, and run periodic dynamic security testing. The portfolio moves from late-stage pen testing only to a static-plus-dynamic mix. Fewer escapes, lower incident frequency.
Manual regression takes two weeks per release. The signal: high effort, diminishing returns, most checks duplicate existing automated coverage. The decision: automate critical regression scenarios, narrow manual scope to exploratory and high-uncertainty areas. Faster releases, maintained coverage, lower recurring cost.
A new high-traffic feature is coming. Performance risk is unaddressed in the current portfolio. The decision: add load and stress tests with thresholds tied to the risk statement, and telemetry guardrails for rollout. Issues found pre-release.
Code complexity is rising and delivery is slowing. The signal: rising cycle time and rework, issues discovered late. The decision: add static controls (linters, architecture review gates, code review focus areas) and track trends over time. Earlier detection of maintainability risk, less rework downstream.

Each decision also names what gets reduced. An increase in security investment means something else gets less. That constraint is what keeps the portfolio an actual portfolio, not an accumulation of everything you've ever added.

The feedback loop

This review is not the end of the process, it's the point where the cycle restarts.

Where the cycle takes you depends on what the review reveals:

New risks nobody anticipated? Go back to identifying and quantifying them.
Priorities changed because the product entered a new market? Re-score with updated exposure.
The portfolio is underperforming against a specific risk? Adjust which controls produce evidence against it.
A major change event like a new architecture or a new regulation? Restart from risk identification.

Here's how this looks in practice. You built the portfolio three months ago around payment risks. At review time, payment risks are down. But new security risks are surfacing, and performance signals are degrading. So the security risks go through identification and quantification. The performance signals do too. Both get re-prioritized alongside everything else, the portfolio is adjusted, and the review cycle continues.

Once again, the whole process is a continuous cycle, not a checklist you complete once.

Closing the loop

The flaky test suite from the first article in this series (70% coverage, two man-days a week lost to noise, a suite that nobody trusted) was actually a review problem. The portfolio was being maintained without anyone asking whether it was working. When we finally ran the numbers, the evidence was clear: the investment wasn't producing reliable risk reduction. So we cut it, accepted the lower coverage number, and rebuilt deliberately.

What we did instinctively then is what this process systematizes. Start with what could go wrong and what it would cost. Score and rank the risks so everyone makes a balanced decision together. Build a portfolio of testing investments traceable to those risks. Then measure, review, and rebalance continuously, because the risk landscape is always changing, and a portfolio left to run without review becomes stale.

The insulin pump and the internal CMS need completely different portfolios. What they share is this: good testing decisions come from knowing what you're investing against, tracking whether it's working, and adjusting when it isn't.

The full research behind this series is at BeyondQuality.