
In the first article of this series, I focused on where AI is already proving useful in test automation: generating test cases and assisting with automation. In the second, I explored what’s still emerging: promising ideas like defect prediction and visual testing that work in some places but aren’t widely adopted yet.
In this final part, I want to address what isn’t working, at least not yet. Some AI testing ideas sound great in theory but fall apart in practice. Others are driven more by marketing than engineering reality. And a few remain persistent myths, like the idea that AI will soon replace human testers.
Let’s look at what’s still hype.
Autonomous Black-Box Test Generation from User Behaviour
Automatically generating test cases from user interactions is an interesting idea. It promises to reduce manual effort and improve coverage by learning from real behaviour. But in practice, it remains largely out of reach.
The core issue is a lack of intent understanding. These systems can replicate click paths or navigation patterns, but they don’t understand why users act the way they do. As a result, they often generate tests that execute without errors but validate nothing meaningful. Worse, because most behavioural data reflects common usage, the models tend to overfit to happy paths, precisely the flows least likely to reveal serious bugs.
Even when test cases are generated, the signal-to-noise ratio is poor. Outputs are often redundant, flaky, or misaligned with testing goals, requiring heavy manual review and cleanup. Deploying these systems at all demands significant infrastructure: telemetry pipelines, data cleaning, and orchestration layers that many teams simply don’t have. And even then, generalisation is a problem. What works on one app or platform often fails elsewhere without substantial fine-tuning.
In short, while the concept is attractive, current implementations are brittle, narrow in scope, and costly to maintain — far from the plug-and-play vision often sold in marketing.
Self-Directed Exploratory Testing Agents
Some of the most ambitious claims in AI testing involve autonomous agents that can explore applications like human testers, clicking through interfaces, discovering edge cases, and identifying unexpected behaviours without explicit scripts or scenarios. While this idea has gained traction in marketing materials, there’s no real practical application so far.
The research is ongoing but not production-ready. In one of the recent papers, scientists proposed new methods to help large models like GPT-4o perform goal-directed exploration. While the paper reports encouraging results (up to 30% improvement on benchmark tasks), it ultimately presents a research direction, not a deployable system. The agent’s behaviour is still fragile, compute-heavy, and heavily reliant on curated environments and fine-tuning.
Additionally, risk misalignment is currently unquantifiable: autonomous agents can pursue unexpected or even harmful strategies when left unsupervised. Since these agents can dynamically interact with systems, AI-to-AI interactions may produce emergent behaviours that developers cannot predict or audit. The legal and ethical implications are still poorly understood.
What’s worse, models trained on observational data often behave irrationally when turned into agents. They suffer from issues like auto-suggestive delusions and predictor-policy incoherence, where the agent’s own actions distort its internal state or expectations. These issues can only be resolved by training the model on its own actions, a costly and complex process that rules out using generic LLMs like ChatGPT or Gemini in agentic roles.
Most existing agent frameworks depend on extensive scaffolding: reflection loops, planning graphs, state evaluation heuristics, and external tools like Monte Carlo Tree Search. Without these, the agents behave like click bots, lacking prioritisation, meaningful hypotheses, or judgement.
Generalised test case prioritisation via ML
Using ML to prioritise test cases sounds like an obvious win — but in practice, it remains elusive. Data dependency, model opacity, and hefty infrastructure demands keep this promising idea out of reach for most teams. In the full whitepaper, I explain why TCP with AI is harder than it looks and when it might actually pay off.
From Agents to Autonomy: the final leap or a hallucination?
The idea of AI agents independently exploring and learning from software has undeniable appeal. If an AI can navigate an app, detect regressions, and even learn from mistakes, what’s left for a human tester to do? That’s the line of thought behind one of the most persistent and misleading claims in AI test automation today: that testers will soon be fully replaced.
Replacing Testers with AI: still a myth
Will AI replace human testers? The myth persists, but reality keeps proving otherwise. From contextual judgement gaps to economic impracticalities, replacing testers remains a fantasy. I cover this in detail in the full whitepaper — including why domain understanding and dynamic prioritisation are still uniquely human strengths.
Conclusion: AI testing is real, and so are its limits
AI is already transforming parts of the testing lifecycle, from generating test cases to helping teams prioritise what to run. These capabilities are not futuristic: they are in production use, delivering real value today. At the same time, we must be honest about what AI cannot yet do, especially when it comes to adaptive reasoning, contextual judgement, and exploratory testing.
Some emerging ideas, like agentic testing or test log summarisation, show promise but remain fragile or unproven at scale. Others, like the dream of fully autonomous testing systems, are, at best, speculative. The risks of overreliance on such promises are high: wasted effort, distrust, and missed bugs.
That’s why critical thinking is so essential right now. Understanding what’s real, what’s emerging, and what’s still just hype allows teams to invest wisely — not just in AI, but in the humans who guide, evaluate, and improve its use. The future of AI in testing isn’t about replacement. It’s about partnership: helping testers focus on what only they can do, while machines take on what they do best.