
In the previous article, I looked at how AI is already improving test automation: from test case generation to self-healing locators and assisted test conversions. These tools are in production today, delivering real value for many teams.
But not all AI applications in testing are ready for prime time. Some ideas are still evolving in research labs, internal experiments, or early adopter environments. They offer real promise, but also come with caveats: data quality challenges, high integration costs, and unpredictable results.
In this second part of the series, I’ll explore what’s emerging in AI-powered testing: where the research is heading, what’s being trialled in practice, and what hurdles still stand in the way of adoption.
Defect Prediction Using Historical Data
Defect prediction involves using historical software data to identify components of a system that are likely to contain defects in the future. This approach aims to optimize testing efforts and improve software quality by focusing resources on the most error-prone areas.
Despite promising results in research, defect prediction using historical data remains difficult to adopt in real-world settings.
The first challenge is data quality: prediction models rely on accurate, consistent records of past defects and code changes, but many projects lack the necessary tagging discipline or historical depth. Even when data is available, models often don’t generalise well — what works for one team or codebase may perform poorly on another due to differing architectures, coding practices, or defect definitions.
Another issue is interpretability. More advanced models, especially deep learning-based ones, can act like black boxes, offering predictions without clear reasoning. This undermines trust and makes it harder for teams to take action based on the output. Finally, integrating these tools into existing workflows isn’t straightforward. Development and QA teams often need to adapt their processes and tooling to benefit from predictive models — and that can be a tough sell when the models themselves still show inconsistent reliability.
Test Suite Optimization and Reduction
As software systems evolve, their test suites often grow extensively, leading to increased execution times and maintenance overhead. Test suite optimization and reduction techniques strive to address this by identifying and eliminating redundant or low-impact test cases, ensuring that the most critical tests are prioritized.
Despite a growing body of research, test suite optimisation and reduction remain difficult to apply reliably in real-world projects.
One of the biggest limitations is data quality: effective optimisation requires detailed test execution histories, failure logs, and accurate code coverage, i.e. data which many teams simply don’t collect or maintain consistently. Even when data is available, optimisation models often don’t transfer well across projects. Techniques that work in one codebase may be ineffective in another, due to differences in architecture, test strategy, or how failures manifest.
Another concern is the risk of over-reduction: models might classify some test cases as redundant, even though they catch rare or edge-case bugs. This can lead to reduced fault detection, especially if the optimisation process isn’t well validated. Finally, integration adds friction. Teams need to adjust existing workflows and tools to take advantage of optimisation algorithms, and the return on investment isn’t always obvious, particularly if test suites aren’t huge to begin with.
ML-powered visual testing
Visual regression testing focuses on identifying unintended visual changes in a user interface (UI) after code modifications. Traditional methods often involve pixel-by-pixel comparisons, which can be sensitive to minor, non-critical differences, leading to false positives. Machine learning (ML) introduces a more intelligent approach by understanding the context and significance of visual changes.
Despite the growing interest and promising research, ML-powered visual testing still faces several practical limitations. One key challenge is the prevalence of false positives: models may flag harmless or acceptable changes, such as font rendering differences or minor layout shifts, as regressions. This forces teams to manually inspect many issues, which undermines the promised efficiency gains.
Another limitation is the need for diverse and representative training data. To distinguish meaningful changes from noise, models must be exposed to a wide variety of UIs, resolutions, themes, and languages. Gathering this data at scale is difficult. Integration also remains a barrier: these tools often require additional infrastructure or changes in test pipelines, making adoption harder for teams with limited time or tooling flexibility.
Finally, computational cost can become an issue. Image-based testing with ML adds processing overhead, particularly in CI pipelines where speed is critical. As a result, while ML-powered visual testing is promising (especially for responsive UIs and localisation), it still requires careful human oversight and isn’t a drop-in replacement for conventional UI testing practices.
Synthetic Test Data Generation
Generating realistic test data is critical when privacy or availability of real data becomes an issue. AI-driven synthetic data generation promises to bridge this gap — but how close are we to truly useful synthetic datasets? In the full whitepaper, I examine current capabilities, remaining limitations, and when it’s actually worth using synthetic data in testing workflows.
LLM-powered test log summarisation and triage
As test logs grow into gigabytes of noise, summarising them with AI becomes very tempting. Large Language Models (LLMs) can help — but not without serious pitfalls. From brittle prompts to domain-specific jargon and compute costs, the challenges are real. The whitepaper explores practical lessons from early adopters and what’s needed to make log triage with AI truly reliable.
AI-Assisted Exploratory Test Agents at Scale
The idea of AI agents exploring apps like human testers sounds great on paper. But how much of this is actually happening in practice? Can AI simulate diverse user behaviours without human guidance? In the full whitepaper, I cover real-world experiments, benefits, and why scaling exploratory agents remains a high-effort, high-maintenance strategy for now.
Takeaway
AI is steadily expanding its role in test automation — but not everything is production-ready yet.
Techniques like defect prediction, test suite optimisation, and visual testing show real promise, but they face hurdles in data quality, scalability, and integration. In the full whitepaper, I go deeper into these emerging areas and also cover synthetic data, LLM-driven log analysis, and exploratory AI agents.