Deterministic test suites cannot judge probabilistic systems, so the QA engineer's job moved from asserting exact outputs to designing evals. When the same prompt can return two correct answers, a string assertion fails good responses and passes bad ones; the new craft is grading behaviour against a golden reference.
QA had a clear contract with the codebase. Given this input, assert that output. Cover the edge cases, automate the regression suite, and a green build meant the behaviour you tested still held.
That contract assumes determinism. Agentic systems break it. Ask an agent the same question twice and you may get two different, both-correct answers. A test that asserts an exact string now fails on a perfectly good response, and a test that asserts nothing meaningful passes on a bad one.
From assertions to evals
In the AIDLC method, this is the Eval phase, and it is where QA's leverage exploded. Instead of asserting exact outputs, the QA engineer designs evals: golden datasets of input-and-acceptable-output pairs, rubric-based scoring, and LLM-as-judge suites that grade whether a response stays inside the boundary the spec defined.
This is harder and more valuable than writing assertions. Deciding what "good enough" means for a probabilistic system, encoding it so a machine can grade it, and keeping the dataset honest as the product evolves is a senior discipline. It is the difference between a system that ships confidently and one that regresses silently.
The new gate
Evals are not a phase you run at the end. In AIDLC they run on every change, the same way unit tests used to, gating each diff the agent generates. A behavioural regression that would have reached a user gets caught in CI instead. The QA engineer owns that gate, and owning it makes them one of the highest-leverage people on the team.
They also test the agent's decisions, not just its output. Did it call the right tool, in the right order, and stop when it had enough information? Those are the failures that matter in agentic systems, and catching them is the new craft.
If your team ships agent features with no eval suite, you are running probabilistic software with a deterministic safety net that no longer fits. The mismatch is where your incidents come from.
The QA engineers who win
They build eval harnesses, not just test suites. They curate golden datasets like an asset. They grade behaviour and decisions, not just strings. And they measure their work in regressions caught before release, which is the only metric that survives the move to probabilistic systems.
Testing a probabilistic AI system with a deterministic test suite?
Most AI projects stall because nobody on the team knows how to design agents, manage token budgets, or wire production evals. I build that layer for B2B companies so the feature actually ships and keeps shipping.
Senior engineer turned AI specialist. React, Next.js, AWS, agent orchestration.
Direct collaboration across UAE, Europe, and US time zones.
Discovery, role design, MCP integration, evals, and production deployment.
If you want an eval harness that gates agent behaviour the way tests used to gate code, book a discovery call and we will design it.
