Back to Blog

The QA Engineer in the Agentic Era: From Writing Tests to Designing Evals

AIDLCQATestingEvalsAgenticSoftware Lifecycle
Abstract monochrome visualization of an evaluation gate judging probabilistic outputs against a golden reference set

Deterministic test suites cannot judge probabilistic systems, so the QA engineer's job moved from asserting exact outputs to designing evals. When the same prompt can return two correct answers, a string assertion fails good responses and passes bad ones; the new craft is grading behaviour against a golden reference.

QA had a clear contract with the codebase. Given this input, assert that output. Cover the edge cases, automate the regression suite, and a green build meant the behaviour you tested still held.

That contract assumes determinism. Agentic systems break it. Ask an agent the same question twice and you may get two different, both-correct answers. A test that asserts an exact string now fails on a perfectly good response, and a test that asserts nothing meaningful passes on a bad one.

From assertions to evals

In the AIDLC method, this is the Eval phase, and it is where QA's leverage exploded. Instead of asserting exact outputs, the QA engineer designs evals: golden datasets of input-and-acceptable-output pairs, rubric-based scoring, and LLM-as-judge suites that grade whether a response stays inside the boundary the spec defined.

This is harder and more valuable than writing assertions. Deciding what "good enough" means for a probabilistic system, encoding it so a machine can grade it, and keeping the dataset honest as the product evolves is a senior discipline. It is the difference between a system that ships confidently and one that regresses silently.

The new gate

Evals are not a phase you run at the end. In AIDLC they run on every change, the same way unit tests used to, gating each diff the agent generates. A behavioural regression that would have reached a user gets caught in CI instead. The QA engineer owns that gate, and owning it makes them one of the highest-leverage people on the team.

They also test the agent's decisions, not just its output. Did it call the right tool, in the right order, and stop when it had enough information? Those are the failures that matter in agentic systems, and catching them is the new craft.

If your team ships agent features with no eval suite, you are running probabilistic software with a deterministic safety net that no longer fits. The mismatch is where your incidents come from.

The QA engineers who win

They build eval harnesses, not just test suites. They curate golden datasets like an asset. They grade behaviour and decisions, not just strings. And they measure their work in regressions caught before release, which is the only metric that survives the move to probabilistic systems.

AI Engineering for B2B

Testing a probabilistic AI system with a deterministic test suite?

Most AI projects stall because nobody on the team knows how to design agents, manage token budgets, or wire production evals. I build that layer for B2B companies so the feature actually ships and keeps shipping.

12+ years shipping production systems

Senior engineer turned AI specialist. React, Next.js, AWS, agent orchestration.

Dubai-based, working with B2B teams worldwide

Direct collaboration across UAE, Europe, and US time zones.

AI agent teams that ship, not demos that stall

Discovery, role design, MCP integration, evals, and production deployment.

If you want an eval harness that gates agent behaviour the way tests used to gate code, book a discovery call and we will design it.

X / Twitter
LinkedIn
Facebook
WhatsApp
Telegram

About Pooya Golchian

Common questions about Pooya's work, AI services, and how to start a project together.

Get practical AI and engineering playbooks

Weekly field notes on private AI, automation, and high-performance Next.js builds. Each edition is concise, implementation-ready, and tested in production work.

Open full subscription page

Get the latest insights on AI and full-stack development.